Submitted by Dense-Smf-6032 t3_y6iu0l in MachineLearning

Hello All

How is the video tracking different from image detection? From my understanding, tracking within a video can be simply doing a per-frame level objection detection, and then using NMS to combine these object (based on the overlapping). However, my friend told me this might not be an efficient method (because per-frame level).

What are the current norm of doing video tracking? Do they run at the per-frame level?

5

Comments

You must log in or register to comment.

marcus_hk t1_ispjwcw wrote

Object detection involves a predetermined class of objects.

Video tracking means the tracking of any arbitrary thing with a bounding box around it. There are no classes per se.

4

saynay t1_isplccn wrote

Object detection per-frame is certainly one of the simpler ways to do it, but has some limitations depending on the use-case you are looking to solve. It doesn't handle occlusion very well, for example. Depending on the type of video you are operating on, and the number of frames you are processing, it can also be pretty inefficient (after all, almost nothing has changed from one frame to the next, so in theory you should not have to fully re-analyze the entire frame, and instead carry forwards some processing you did on the last frame).

There are also quite a number of sub-tasks. Are you trying to track a single object only? Are you trying to track every object of a given class? Do you need to identify and track new objects as they enter the scene, or do you know everything you want to track from the first frame? Do you need to be running in realtime?

Multi-Object-Tracking is the computer vision term most commonly used for the task, so you can find a lot of algorithms under that name. DeepSORT was one I found pretty interesting, even though it is not that great anymore, just from the combination of methods it used to accomplish the task; it detects the objects, attempts to calculate a velocity frame to frame, predicts the most-likely locations with a kalman filter, then uses a NN to re-identify the target in the next frame.

4

Dense-Smf-6032 OP t1_ispqwhq wrote

thanks for the answer. What do people use nowadays, besides deepSORT?

2

sje397 t1_isrm4mi wrote

The difference is that when you're tracking, you want to identify whether the bounding box in two successive frames is the same object, or two different objects of the same type. There's a bunch of complexity, like the linear sum assignment problem (that is, if you start by assigning the same object id to the closest bounding boxes in two successive frames, you can get a worse solution than minimising the distances between boxes in successive frames overall), and whether you track the centres of bounding boxes or look at e.g. IoU (intersection over union).

2

Anaphylaxisofevil t1_ispq1tl wrote

In the most general sense, tracking is the same as detection, but with priors based on a prediction of what you're expecting to see from past history. So tracking obviously requires an image sequence, and detection only a single timestep. Tracking can potentially be faster and more accurate, because you have access to this extra information which limits your search space, but is also reliant on the quality of your prediction; bad prediction mean tracking failures.

I'm not completely sure if this is the level of answer you're looking for though.

1

Dense-Smf-6032 OP t1_ispqyw8 wrote

I see. How do I make the video track to do fast inference (if I don't want to run it per-frame level)?

1

Anaphylaxisofevil t1_isptitd wrote

It really depends on making (and training) an adaptation to your per-frame detector to incorporate prediction priors, then devising a method for making the prediction priors from previous frames' output. I'm not that familiar with the specifics of your particular problem to add much more, I'm afraid.

1