eeng_

eeng_ t1_iy82r1q wrote

This is probably obvious to you, but most of the frames in a long video are redundant and provide little additional information. You could easily extract some key frames (eg substract previous frame from current frame and apply a fixed threshold), then run your network only on key frames and then ensemble these key frame predictions into a single label per video.

3