Deep-Station-1746 t1_iy7wq8f wrote on November 29, 2022 at 12:23 PM

If you want to classify entire video into one label, you can first try to embed the video frames into something that's better suited for binary classification. Like this:

# From https://github.com/lucidrains/vit-pytorch#vivit
import torch
from vit_pytorch.vivit import ViT

v = ViT(
    image_size = 128,          # image size
    frames = 16,               # number of frames
    image_patch_size = 16,     # image patch size
    frame_patch_size = 2,      # frame patch size
    num_classes = 1000,
    dim = 1024,
    spatial_depth = 6,         # depth of the spatial transformer
    temporal_depth = 6,        # depth of the temporal transformer
    heads = 8,
    mlp_dim = 2048
)

video = torch.randn(4, 3, 16, 128, 128) # (batch, channels, frames, height, width)

preds = v(video) # (4, 1000)

Modify ViT to output a binary class per video (it now outputs 1000 classes, see the output shape: (4, 1000)). Then do the training.

If you need to label each frame separately, use something like ViT. It just depends on what you want. Same idea, but each frame will get evaluated separately.

So, what do you need?

Vae94 OP t1_iy7xyk6 wrote on November 29, 2022 at 12:36 PM

Great stuff, I see the 3D ViT examples are only for several(dozen) frames, not hundreds of thousands.

In my experiments so far I tried with LSTM network to classify these but the amount of input features is too massive for realistic training and I was only experiment with already videos order of magnitude smaller than what I want.