Alors_HS t1_is0f9h0 wrote on October 12, 2022 at 12:10 PM

I had to deploy my models on nvidia jetson / tx for my last job, 1.5y ago.

In these use cases there is a lot of optimisation to do. A list of available methods: pruning, mixed precision training/inference, quantization, CUDA/onnx/nvidia optimization, training models that perform on lower resolution data via knowledge distillation from models that trained on higher res data...

Look around, this is on the top of my head from a bit of time ago. There is plenty of resources now for inference on the edge.

muunbo OP t1_is169qt wrote on October 12, 2022 at 3:34 PM

Cool, how did you learn all those techniques? And how did you determine which one was the major cause of too much memory usage / too slow inference time?

Alors_HS t1_is1871l wrote on October 12, 2022 at 3:47 PM

Well, I needed to solve my problem so I looked at papers / software solutions. Then it was a slow process of many iterations of trials and errors.

I couldn't tell you what would be the best for your use case tho. I am afraid it's been too long for me to remember the details. Beside, each method may be more or less effective according to good results on the metrics/inference time or the method and means of training that you can afford.

I can give you a tip : I initialized my inference script only once per boot, and then put it in "waiting mode" so I wouldn't have to initialize the model for each inference (it's the largest cause of losing time). Then upon receiving a socket message, the script would read a data file, do an inference pass, write the results in an another file, delete/move the data to storage and wait for the next socket message. It's obvious when you think about it that you absolutely don't want to call/initialize your inference script once per inference, but, well, you never know what people think about :p

muunbo OP t1_is2atui wrote on October 12, 2022 at 7:58 PM

Haha that's amazing to hear because I had a similar experience too! The data scientist on the team was re-initializing the model, the image matrices, and other objects in the data pipeline over and over again for every inference. I had to decouple those in a similar way as you did.

Everyone usually thinks the model is the problem but I am starting to think that the rest of the code is usually what actually needs to be optimized