Submitted by Vae94 t3_z7rn5o in MachineLearning

Hello,

I am trying to figure out a classification problem with non-trivial quantity of input features. Right now I am looking at binary classification of long videos ~million frames. Right now I am stuck at barely 70 000 frames.

is there some trick to dealing with these types of problems? The only thing that comes to my mind at this point is to somewhat compress/decimate my frames to shrink the input features in a way that ML can still predict something from these.

Other way would be to manually label a lot of frames one-by-one and construct some sort of meta algorithm, but I'd like to try something less labour intensive first.

3

Comments

You must log in or register to comment.

Deep-Station-1746 t1_iy7wq8f wrote

If you want to classify entire video into one label, you can first try to embed the video frames into something that's better suited for binary classification. Like this:

# From https://github.com/lucidrains/vit-pytorch#vivit
import torch
from vit_pytorch.vivit import ViT

v = ViT(
    image_size = 128,          # image size
    frames = 16,               # number of frames
    image_patch_size = 16,     # image patch size
    frame_patch_size = 2,      # frame patch size
    num_classes = 1000,
    dim = 1024,
    spatial_depth = 6,         # depth of the spatial transformer
    temporal_depth = 6,        # depth of the temporal transformer
    heads = 8,
    mlp_dim = 2048
)

video = torch.randn(4, 3, 16, 128, 128) # (batch, channels, frames, height, width)

preds = v(video) # (4, 1000)

Modify ViT to output a binary class per video (it now outputs 1000 classes, see the output shape: (4, 1000)). Then do the training.

If you need to label each frame separately, use something like ViT. It just depends on what you want. Same idea, but each frame will get evaluated separately.

So, what do you need?

3

Vae94 OP t1_iy7xyk6 wrote

Great stuff, I see the 3D ViT examples are only for several(dozen) frames, not hundreds of thousands.

In my experiments so far I tried with LSTM network to classify these but the amount of input features is too massive for realistic training and I was only experiment with already videos order of magnitude smaller than what I want.

1

eeng_ t1_iy82r1q wrote

This is probably obvious to you, but most of the frames in a long video are redundant and provide little additional information. You could easily extract some key frames (eg substract previous frame from current frame and apply a fixed threshold), then run your network only on key frames and then ensemble these key frame predictions into a single label per video.

3

Vae94 OP t1_iy8fy1f wrote

Yes. Thanks for sanity check!

I was thinking of first coming up with algorithm to find outliers and the training LSTM only on the outliers, for that I should assemble some meta-algorithm I guess and train both LSTM and trimming network at the same time.

I was wondering if something like this exists in literature already?

3

BrohammerOK t1_iy9ap9d wrote

My first approach world be sampling N key frames uniformly from each long video and see if I get good validation performance training on that (tune the value of N as you wish). I wouldn't use a 3D transformer because frames will be very far away and the sequential nature of the data shouldn't matter that much unless your videos have some kind of general structure, you would know that I guess. I would build a baseline with like an average pooling of single frame embeddings and a classification head, then try if adding the time dimension helps at all. By randomly sampling in this way you could create a lot of data to train your model. Always inspect the sets of key frames visually first to make sure that the approach makes sense. It is a good idea to spend a good amount of time looking at the data before even thinking about models and hyperparameters, specially if it isn't a standard dataset.

1