Submitted by Vae94 t3_z7rn5o in MachineLearning
Hello,
I am trying to figure out a classification problem with non-trivial quantity of input features. Right now I am looking at binary classification of long videos ~million frames. Right now I am stuck at barely 70 000 frames.
is there some trick to dealing with these types of problems? The only thing that comes to my mind at this point is to somewhat compress/decimate my frames to shrink the input features in a way that ML can still predict something from these.
Other way would be to manually label a lot of frames one-by-one and construct some sort of meta algorithm, but I'd like to try something less labour intensive first.
Deep-Station-1746 t1_iy7wq8f wrote
If you want to classify entire video into one label, you can first try to embed the video frames into something that's better suited for binary classification. Like this:
Modify ViT to output a binary class per video (it now outputs 1000 classes, see the output shape: (4, 1000)). Then do the training.
If you need to label each frame separately, use something like ViT. It just depends on what you want. Same idea, but each frame will get evaluated separately.
So, what do you need?