This Technique Can Make It Easier for AI to Understand Videos

Whether it’s dubious viral memes, gaffe-prone presidential debates, or surreal TikTok remixes, you could spend the rest of your life trying to watch all the video footage posted on YouTube in a single day. Researchers want to let artificial intelligence algorithms watch and make sense of it instead.

A group from MIT and IBM developed an algorithm capable of accurately recognizing actions in videos while consuming a small fraction of the processing power previously required, potentially changing the economics of applying AI to large amounts of video. The method adapts an AI approach used to process still images to give it a crude concept of passing time.

The work is a step towards having AI recognize what’s happening in video, perhaps helping to tame the vast amounts now being generated. On YouTube alone, over 500 hours of video were uploaded every minute during May 2019.

Companies would like to use AI to automatically generate detailed descriptions of videos, letting users discover clips that haven’t been annotated. And, of course, they would love to sell ads based on what’s happening in a video, perhaps showing pitches for tennis lessons as soon as someone starts live-streaming a match. Facebook and Google also hope to use AI to automatically spot and filter illegal or malicious content, although this may prove an ongoing game of cat and mouse. It will be a challenge to do all this without significantly increasing the carbon footprint of AI.

Tech companies like to flaunt their use of AI, but it’s still not used much to analyze video. YouTube, Facebook, and TikTok use machine learning algorithms to sort and recommend clips, but they appear to rely primarily on the metadata associated with a video, such as the description, tags, and when and where it was uploaded. All are working on methods that analyze the contents of videos, but these approaches require a lot more computer power.

“Video understanding is so important,” says Song Han, an assistant professor at MIT who led the new work. “But the amount of computation is prohibitive.”

The energy consumed by AI algorithms is rising at an alarming rate, too. By some estimates, the amount of computer power used in cutting-edge AI experiments doubles about every three and a half months. In July, researchers at the Allen Institute for Artificial Intelligence called for researchers to publish details of the energy efficiency of their algorithms, to help address this looming environmental problem.

Keep Reading

The latest on artificial intelligence, from machine learning to computer vision and more

This could be especially important as companies tap AI to analyze video. There have been big advances in image recognition in recent years, largely thanks to deep learning, a statistical technique for extracting meaning from complex data. Deep learning algorithms can detect objects based on the pixels shown in an image.

But deep learning is less adept at interpreting video. Analyzing a video frame won’t reveal what’s happening unless that frame is compared with the ones that come before and after—a person holding a door may be opening it or closing it, for example. And while Facebook researchers developed a version of deep learning that incorporates temporal changes in 2015, this approach is relatively unwieldy.

By Han’s estimates, it can take 50 times as much data, and eight times as much processing power, to train a deep learning algorithm to interpret a video as a still image.

Together with two colleagues, Han developed a solution, dubbed the Temporal Shift Module. Conventional deep learning algorithms for video recognition perform a 3D operation (known as a convolution) on multiple video frames at once. Han’s approach uses a more efficient 2D algorithm, more commonly used for still images. The Temporal Shift Module provides a way to capture the relationship between the pixels in one frame and those in the next without performing the full 3D operation. As the 2D algorithm processes each frame in turn, while incorporating information from adjacent frames, it achieves a sense of things unfolding over time, allowing it to detect the actions shown.

The trio tested their algorithm on a video dataset known as Something Something, which was created by paying thousands of people to perform simple tasks from pouring tea to opening jars.

social experiment by Livio Acerbo #greengroundit #wired