suflaj t1_jebxmmx wrote on March 30, 2023 at 10:03 PM

If you have a small dataset, then Transformers are out of the question, especially if we're talking pretraining and all.

Seems to me like you might be interested in ML methods, such as XGBoost. Since you have tabular data it will probably outperform all other methods at first. From there on out you would be trying to find a better tailored model from the task, depending on how you want to use your data. Given your data situation, you would be looking at deep LSTMs for the end game. But currently, it doesn't matter if it's 20 or 2000 samples (idk how you count them), that's not enough to solve something you claim is too difficult to outright mathematically model.

Reinforcement learning might not be adequate given that you say that the problem is too difficult to model mathematically. RL will only be useful to you if it is difficult to model it because the problem is wide, ie it is hard for you to narrow it down to a general formula. If the problem is hard in the sense that it would be difficult or narrow, then your agent might not be able to figure out how to solve the task at all, and you would have to think out the training regimen really well to teach it anything. RL is not really well suited for very hard problems.

Finally, it doesn't seem to me you have an environment set up for the agent, because if you did, your problem would be solved given that it would require you to mathematically model it. And if it was easy to obtain data in the first place, you would be having way more than 20 or 2000 samples. That's why I presume that RL is completely out of the question for you as well.

I would personally not tackle this problem with trajectories. If you want to solve this using DL, then you should create a bigger dataset using actual camera recording, and then either label the bounding boxes or segment the image. Then you can use any of the pretrained backbones and simply train an object detector. Given an offset in the next frame, you can calculate the movement for the camera.

This is a task so generic that just with a few hundred to thousand samples you can probably get a semi-supervised labelling scheme going on - with some other model labelling the images automatically and then you just need a few humans judging these labels or correcting them. And this task is so trivial and widespread you can find a workforce to do this anywhere.

The question is what performance you would expect. But in all cases I would say that if you need a very robust solution, you should probably look into mathematically modelling it - we are presumably talking about a differential system in the background, which is not going to be easily solved by any mainstream DL model. All methods mentioned here can essentially be dumbed down to a very large non-linear equation. They can only mimic a differential system up to a certain amount of precision, determined by their width and depth, as well as the statistic significance of your samples.

x11ry0 OP t1_jefvk0p wrote on March 31, 2023 at 6:36 PM

Hi, thanks a lot for this very detailed answer. I will have a look at your suggestions. Yes we are looking into the mathematical equations too. We fear that it will be quite un-precise but in other hand I understand very well the issue with the small dataset currently available. We may end up exploring different solutions.