Researchers have developed a deep learning algorithm capable of successfully predicting what will happen in a video clip based on one still clip from the footage.
The Computer Science and Artificial Intelligence Laboratory at Massachusetts Institute of Technology (MIT) made the breakthrough in predictive vision by training an algorithm using 600 hundred hours of YouTube videos.
By searching for patterns and recognizable objects like hands and faces, the algorithm was able to predict human interactions such as hugging, kissing, shaking hands or high fiving.
The research is set to be presented this week at the International Conference on Computer Vision and Pattern Recognition (CVPR).
“Humans automatically learn to anticipate actions through experience, which is what made us interested in trying to imbue computers with the same sort of common sense,” said MIT PhD student and the paper’s first author Carl Vondrick.
“We wanted to show that just by watching large amounts of video, computers can gain enough knowledge to consistently make predictions about their surroundings,” he added.
Tests proved the algorithm to be correct 43 percent of the time when shown a still frame taken one second before the action happens. By way of comparison, human subjects were able to correctly predict the action 71 percent of the time.
Vondrick and his fellow researchers hope that the algorithm could one day help improve the way robots interact with humans.
“There’s a lot of subtlety to understanding and forecasting human interactions,” Vondrick said. “We hope to be able to work off of this example to be able to soon predict even more complex tasks.
“I’m excited to see how much better the algorithms get if we can feed them a lifetime’s worth of videos. We might see some significant improvements that would get us closer to using predictive-vision in real-world situations.”