Viewing a single comment thread. View all comments

NTIASAAHMLGTTUD t1_j8c0cj3 wrote

In the interest of skepticism, can anyone pour any cold water on this or is it as good as it sounds?

62

turnip_burrito t1_j8c1932 wrote

It's only tested on one benchmark, called ScienceQA. Maybe testing it on others would allow us to how well it really stacks up.

43

el_chaquiste t1_j8c1z83 wrote

If I understand well, seems the input set (a science exam with solved exercises and detailed responses) is smaller than GPT3.5's own, but it overperforms GPT3.5 and humans on solving problems similar to those from said exam by some percent, more if it has a multimodal training including visual data.

I honestly don't know if we should get overly excited over this or not, but it seems like it would allow the creation of smaller models focused on some scientific and technical domains, with better accuracy in their reponses than generalist LLMs.

33

[deleted] t1_j8c34us wrote

[deleted]

21

SoylentRox t1_j8cblun wrote

Theoretically it should query a large number of models, and have a "confidence" based on how likely each model's answer is to be correct. Then return the most confidence answer.

11

ReadSeparate t1_j8fb4cr wrote

One can easily imagine a generalist LLM outputting an action token which represents prompting the specialized LLM, which then gets routed to the specialized LLM, then the response is formatted and put into context by the generalist.

1

Cryptizard t1_j8d86dt wrote

The humans they tested on were random people on Mechanical Turk, so that data point is not very illuminating.

3

Borrowedshorts t1_j8erk8h wrote

It's as good as it sounds, and you can't really fake performance on a dataset such as this. Multimodal models will change the game. I don't think multimodal models by themselves are the end game, but they appear to be poised to takeover state of the art performance for the foreseeable future.

1