astonzhang

astonzhang t1_j79i4jj wrote

Hi, I am an author of the paper. Opinions below are my own.

​

After we arXiv-ed our "Automatic Chain of Though Prompting in Large Language Models" paper in Oct 2022 (here's a TLDR, ICLR'23), we were asking ourselves:

"If AGI (artificial general intelligence) is the goal, what kind of chain of thought (CoT) research do we need next? Is relying on a text-only generalist model that can perform text-only multitasks the final answer?"

"How can we connect the dots between NLP and CV communities so more researchers can contribute?"

"Since not everyone can afford playing with large models, how can we deal with input in more general form (text and images) *without* relying on larger models so a larger research community can contribute?"

​

One day I was teaching my kid how to solve arithmetic reasoning problems (not from the MultiArith dataset...). My kid told me that it's much easier to understand reasoning problems with the help from figure illustrations.

"Oh, can we leverage vision input to improve chain of thought reasoning?"

"The current generalist models like GPT-3.5 (text-davinci-002/003) only offer a blackbox API (at a cost) for transforming text input into text output. Why not just fine-tune a smaller model where we have full control of all its layers (whitebox) to fuse inputs in a more general form?"

​

Fortunately, Pan Lu et al. released the ScienceQA benchmark, just in time. This is a great contribution to the community and we benefited from it by testing our idea early on this benchmark (see acknowledgement in our GitHub repo). Showing the promise of fine-tuning a smaller model with task-specific datasets (rather than feeding in-context learning demos to a larger generalist LLM) is exactly what we wanted in this study (you may feel more motivated after reading the T-Few paper).

If you feel motivated to try parameter-efficient fine-tuning (PEFT) ideas from the aforementioned T-Few paper to improve Multimodal-CoT, you may also wish to check out our recent PEFT design space paper at ICLR'23 (here's a TLDR).

55