Viewing a single comment thread. View all comments

Deep-Station-1746 t1_iy7wq8f wrote

If you want to classify entire video into one label, you can first try to embed the video frames into something that's better suited for binary classification. Like this:

# From
import torch
from vit_pytorch.vivit import ViT

v = ViT(
    image_size = 128,          # image size
    frames = 16,               # number of frames
    image_patch_size = 16,     # image patch size
    frame_patch_size = 2,      # frame patch size
    num_classes = 1000,
    dim = 1024,
    spatial_depth = 6,         # depth of the spatial transformer
    temporal_depth = 6,        # depth of the temporal transformer
    heads = 8,
    mlp_dim = 2048

video = torch.randn(4, 3, 16, 128, 128) # (batch, channels, frames, height, width)

preds = v(video) # (4, 1000)

Modify ViT to output a binary class per video (it now outputs 1000 classes, see the output shape: (4, 1000)). Then do the training.

If you need to label each frame separately, use something like ViT. It just depends on what you want. Same idea, but each frame will get evaluated separately.

So, what do you need?


Vae94 OP t1_iy7xyk6 wrote

Great stuff, I see the 3D ViT examples are only for several(dozen) frames, not hundreds of thousands.

In my experiments so far I tried with LSTM network to classify these but the amount of input features is too massive for realistic training and I was only experiment with already videos order of magnitude smaller than what I want.