Submitted by Vae94 t3_z7rn5o in MachineLearning
Deep-Station-1746 t1_iy7wq8f wrote
If you want to classify entire video into one label, you can first try to embed the video frames into something that's better suited for binary classification. Like this:
# From https://github.com/lucidrains/vit-pytorch#vivit
import torch
from vit_pytorch.vivit import ViT
v = ViT(
image_size = 128, # image size
frames = 16, # number of frames
image_patch_size = 16, # image patch size
frame_patch_size = 2, # frame patch size
num_classes = 1000,
dim = 1024,
spatial_depth = 6, # depth of the spatial transformer
temporal_depth = 6, # depth of the temporal transformer
heads = 8,
mlp_dim = 2048
)
video = torch.randn(4, 3, 16, 128, 128) # (batch, channels, frames, height, width)
preds = v(video) # (4, 1000)
Modify ViT to output a binary class per video (it now outputs 1000 classes, see the output shape: (4, 1000)). Then do the training.
If you need to label each frame separately, use something like ViT. It just depends on what you want. Same idea, but each frame will get evaluated separately.
So, what do you need?
Vae94 OP t1_iy7xyk6 wrote
Great stuff, I see the 3D ViT examples are only for several(dozen) frames, not hundreds of thousands.
In my experiments so far I tried with LSTM network to classify these but the amount of input features is too massive for realistic training and I was only experiment with already videos order of magnitude smaller than what I want.
Viewing a single comment thread. View all comments