This dissertation targets the recognition of human actions in realistic video data, such as movies. To this end, we develop state-of-the-art feature extraction algorithms that robustly encode video information for both, action classification and action localization.
In a first part, we study bag-of-features approaches for action classification. Recent approaches that use bag-of-features as representation have shown excellent results in the case of realistic video data. We, therefore, conduct an extensive comparison of existing methods for local feature detection and description. We, then, propose two new approaches to describe local features in videos. The first method extends the concept of histograms over gradient orientations to the spatio-temporal domain. The second method describes trajectories of local interest points detected spatially. Both descriptors are evaluated in a bag-of-features setup and show an improvement over the state-of-the-art for action classification.
In a second part, we investigate how human detection can help action recognition. Firstly, we develop an approach that combines human detection with a bag-of-features model. The performance is evaluated for action classification with varying resolutions of spatial layout information. Next, we explore the spatio-temporal localization of human actions in Hollywood movies. We extend a human tracking approach to work robustly on realistic video data. Furthermore we develop an action representation that is adapted to human tracks. Our experiments suggest that action localization benefits significantly from human detection. In addition, our system shows a large improvement over current state-of-the-art approaches.