Full metadata

Title

Video2Vec: learning semantic spatio-temporal embedding for video representations

Description

High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.

In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.

Date Created

2016

Contributors

Hu, Sheng-Hung (Author)
Li, Baoxin (Thesis advisor)
Turaga, Pavan (Committee member)
Liang, Jianming (Committee member)
Tong, Hanghang (Committee member)
Arizona State University (Publisher)

Topical Subject

Resource Type

Text

Genre

Masters Thesis

Academic theses

Extent

vii, 54 pages : illustrations (chiefly color)

Language

eng

Copyright Statement

In Copyright

Reuse Permissions

Primary Member of

ASU Electronic Theses and Dissertations

Peer-reviewed

No

Open Access

No

Handle

https://hdl.handle.net/2286/R.I.40765

Statement of Responsibility

by Sheng-Hung Hu

Description Source

Viewed on January 17, 2017

Level of coding

full

Note

thesis

Partial requirement for: M.S., Arizona State University, 2016

bibliography

Includes bibliographical references (pages 51-54)

Field of study: Computer science

System Created

2016-12-01 07:03:54

System Modified

2021-08-30 01:20:32
3 years 2 months ago

Additional Formats