BrohammerOK t1_iy9ap9d wrote on November 29, 2022 at 6:38 PM

My first approach world be sampling N key frames uniformly from each long video and see if I get good validation performance training on that (tune the value of N as you wish). I wouldn't use a 3D transformer because frames will be very far away and the sequential nature of the data shouldn't matter that much unless your videos have some kind of general structure, you would know that I guess. I would build a baseline with like an average pooling of single frame embeddings and a classification head, then try if adding the time dimension helps at all. By randomly sampling in this way you could create a lot of data to train your model. Always inspect the sets of key frames visually first to make sure that the approach makes sense. It is a good idea to spend a good amount of time looking at the data before even thinking about models and hyperparameters, specially if it isn't a standard dataset.