Viewing a single comment thread. View all comments

chazzmoney t1_iw4l6c8 wrote

If you only need to detect hotdog/pizza and dog/cat, its a fine solution. I was using those as examples, but usually its much more drastic - “transcribe speech to text” and “identify speech from background noise” and “identify speaker”. Or “answer trivia question”, “fill in the blank with the correct choice”, “determine if the text has a positive or negative sentiment”, “determine the main topic of the text”,. Etc…. quite complicated tasks.

Thus, there are a few reason it doesn’t work in practice:

  • Generality

  • Efficiency (hardware memory)

  • Efficiency (training computation)

  • Efficiency (inference latency)

Having a general network is more interesting - a single network that can solve multiple problems is more useful and more applicable to problems that may be similar or you don’t know you have yet. It can also be easier to “fine-tune” an existing network because you don’t have enough data on a given problem to train it from scratch.

Efficiency is (in my opinion), the bigger one:

  1. To run a network, the entire network and parameters for it must be stored in memory. These days, they are on the order of gigabytes for “interesting” networks. Putting a multiplier (multiple networks) on this makes scaling quite challenging.

  2. Training one general network may be harder, but it is much faster than training a new network from scratch for each problem. If you have thousands of problems, you don’t want to be training thousands of networks.

  3. The size of “interesting” models makes inference challenging as well. The bigger (more interesting)the model, the more computation it must perform on each input; some modeling techniques are loops and require thousands of runs for each input. This seems fine at first but if a single inputs takes more than 10ms, a thousand loops will take longer than one second. Usually this means that the most interesting models have to be used on high end cloud equipment, which brings about further scaling challenges.

So your answer isn’t wrong; the practicality of the situation just makes it infeasible when you are working with large models (vision, video, language, etc) - the number of parameters is often in the billions.

4

StevenTM t1_iw4nmz3 wrote

Thank you for the in-depth answer, it was very interesting!

3

chazzmoney t1_iw4s9ic wrote

Of course - thanks for the great questions!

3