Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos.

Published in CVPR, 2023

This paper is about weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. By borrowing ideas from CLIP, we aggregated frame-level features for video representation and encoded the texts corresponding to each action and the whole video, respectively.
[paper][code]

Recommended citation:

@inproceedings{hu2022transrac,
  title={TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting},
  author={Hu, Huazhang and Dong, Sixun and Zhao, Yiqun and Lian, Dongze and Li, Zhengxin and Gao, Shenghua},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19013--19022},
  year={2022}
}