Viewing a single comment thread. View all comments

Vae94 OP t1_iy7xyk6 wrote

Great stuff, I see the 3D ViT examples are only for several(dozen) frames, not hundreds of thousands.

In my experiments so far I tried with LSTM network to classify these but the amount of input features is too massive for realistic training and I was only experiment with already videos order of magnitude smaller than what I want.