top of page

Convolution Neural Network for Video Classification

Video classification is similar to image classification where an algorithm uses feature extractors, such as convolutional neural networks (CNN). CNN extracts feature descriptors from a sequence of images & further classifies them. Although, unlike images which are cropped & rescaled to a fixed size, videos vary in temporal extent, widely. Videos cannot be easily processed to a fixed-size architecture. Each clip in a video contains many contiguous frames and can extend the connectivity of the network in the time dimension to learn a spatio-temporal feature.

Deep learning in video classification provides a means to analyze, classify & track activity contained in visual data sources such as a video stream. Video classification has many applications such as gesture recognition, surveillance, human activity recognition, and anomaly detection to name a few.

Videos are simply a series of frames. But training neural networks on videos is not that simple. It is challenging due to the large amount of data in it. A typical approach takes an image-based network & trains it on all the frames from all videos in the dataset. Another credulous method is to pass each video file through a CNN and classify each frame individually and independently of the other frame. Further, choose a label with the most corresponding probability and label the frame. In the end, assign the majorly assigned image label to the video.

A common deep learning architecture used for video classification is the 3D-CNN model. 3D blocks are used to capture video information which is necessary to classify the video content. Another common architecture is multi-stream architecture. In this, the spatial and temporal information is processed separately. The features which are extracted from different streams are then fused to make a decision. To process temporal information different methods are used and two common ones are based on Recurrent Neural Network (RNN) & Optical flow.

With the development of the 3D-CNN model, the way for automatic video classification models using various deep learning architectures improved. Among the developments using deep learning architectures, Spatiotemporal Convolutional Networks are approaches based on the integration of temporal and spatial information using convolutional networks to perform video classification. To collect temporal and spatial information, these methods primarily rely on convolution and pooling layers. Stack optical flow is used in two/multi Stream Networks methods to identify movements in addition to context frame visuals. Recurrent Spatial Networks use Recurrent Neural Networks (RNN) to model temporal information in videos, such as LSTM or GRU. The ResNet architecture is used to build mixed convolutional models. They are particularly interested in models that utilize 3D convolution in the bottom or top layers but 2D in the remainder; these are referred to as "mixed convolutional" models. These also include methods based on mixed temporal convolution with different kernel sizes. Besides these architectures, there are also hybrid approaches based on the integration of CNN and RNN architectures


Conclusion :


  • The visual description works better than the text and the audio description and the combination of all modalities can contribute to better performance with an increase in computational cost.

  • The architectures employing CNN/RNN for feature extraction can perform better than hand-crafted features provided that enough data is available for training.

  • Tensor-Train Layer-based RNNs like LSTM and GRU perform better than the plain RNN architectures for video classification.

  • It is sometimes necessary to use optical flow for datasets like UCF-101.

  • It is not always helpful to use optical flow, especially for the case of videos taken from wild e-g Sports-1M.

  • It is important to use a sophisticated sequence processing architecture like LSTM to take advantage of optical flow.

  • LSTMs when applied on both the optical flow and the image frames yield the highest performance measure for the Sports-1M benchmark dataset.

  • Augmenting optical flow and RGB input help in improving performance.


Furthermore, the deep learning approaches are outperforming other state-of-the-art approaches for video classification. The deep learning google trend is still growing and it is still above the trend for some other very well-known machine learning algorithms. However, the recent developments in deep learning approaches are still under evaluation and require further investigations for video classification tasks. One such example is geometric deep learning approaches. This describes that this topic is still confined to some states of the US and has yet to be developed and investigated further. The use of geometric deep learning in extracting rich spatial information from the videos can also be a new research direction for better accuracy in video classification tasks.


For large-scale video classification use cases, DesiCrew offers video labeling services to provide you with a consistent flow of expertly-labeled data. Our tried-and-tested methodology and long-term talent pool ensure that we keep consistent labeling practices across time and projects. Reach us on info@desicrew.in to know more.

37 views
bottom of page