StephaneCharette

StephaneCharette t1_itsddz6 wrote

Note from the Darknet + YOLO FAQ: "Can I train a neural network using synthetic images?"

>No.
>
>Or, to be more precise, you'll probably end up with a neural network that is great at detecting your synthetic images, but unable to detect much in real-world images.

Source: https://www.ccoderun.ca/programming/darknet_faq/#synthetic_images

I made that statement several years ago, and after all this time, I still think the correct answer is "no". Every time I try to use synthetic images, it never works out as I had planned.

Looking at your "Link1" and "Link2", it is immediately obvious this is not going to work. You cannot crop your objects: https://www.ccoderun.ca/programming/darknet_faq/#crop_training_images

Darknet/YOLO (and under the covers, I believe that Ultralytics is using Darknet) learns from context, not only what is in the bounding boxes. So if you are trying to detect snowboarders with those symbols, then you'll do OK. But if you are expecting to pass in images or video frames with clothes, then that snowboarder and bus are doing nothing to help you.

Want proof? Here is a YOLO neural network video I happened to upload to youtube today: https://www.youtube.com/watch?v=m3Trxxt9RzE

Note the "6" and "9" on those cards. They are correctly recognized, no confusion even though the font used makes those 2 numbers look identical when rotated 180 degrees. YOLO really does look at much more than just the bounding box.

6

StephaneCharette t1_ir7nomj wrote

I cannot help but think, "oh yeah, this framework over here is 50x faster than anything else, but everyone has forgotten about it until just now..."

If <something> gave 50X improvements, wouldn't that be what everyone uses?

Having said that, the reason I use Darknet/YOLO is specifically because the whole thing compiles to a C++ library. DLL on Windows, and .a or .so on Linux. I can squeeze out a few more FPS by using the OpenCV implementation instead of Darknet directly, but the implementation is not trivial to use correctly.

However, if you're working with ONNX then I suspect you're already achieving speeds higher than using Darknet or OpenCV as the framework.

One thing to remember: resizing images (aka video frames) is SLOWER than inference. I don't know what your pytorch and onnx frameworks do when the input image is larger than the network, but when I take timing measurements with Darknet/YOLO and OpenCV's DNN, I end up spending more time resizing the video frames than I do in inference. This is a BIG deal, which most people ignore or trivialize. If you can size your network correctly, or you can adjust the video capture to avoid resizing, you'll likely more than double your FPS. See these performance numbers for example: https://www.ccoderun.ca/programming/2021-10-16_darknet_fps/#resize

1