BrohammerOK

BrohammerOK t1_j9wvrl7 wrote

You can work with 2 splits, which is a common practice. For a small dataset you can use 5 or 10 fold crossvalidation with shuffling on 75-80% of the dataset (train) for hyperparameter tunning / model selection, fit the best model on the entirety of that set, and then evaluate/test on the remaining 25%-20% that you held out. You can repeat the process multiple times with different seeds to get a better estimation of the expected performance, assuming that the input data when you do inference comes from the same distribution as your dataset.

1

BrohammerOK t1_j2ekoyj wrote

If you do use both in the same layer, dropout should never be applied right before batch or layer norm because the features set to 0 would affect the mean and variance calculations. As an example, it is common to use batch norm in CNNs, and then dropout after the global average pooling (before the final fc layer). Sometimes you even see dropout between conv blocks, take a look at EfficientNet by Google.

1

BrohammerOK t1_iy9ap9d wrote

My first approach world be sampling N key frames uniformly from each long video and see if I get good validation performance training on that (tune the value of N as you wish). I wouldn't use a 3D transformer because frames will be very far away and the sequential nature of the data shouldn't matter that much unless your videos have some kind of general structure, you would know that I guess. I would build a baseline with like an average pooling of single frame embeddings and a classification head, then try if adding the time dimension helps at all. By randomly sampling in this way you could create a lot of data to train your model. Always inspect the sets of key frames visually first to make sure that the approach makes sense. It is a good idea to spend a good amount of time looking at the data before even thinking about models and hyperparameters, specially if it isn't a standard dataset.

1