Submitted by Singularian2501 t3_10svwch in MachineLearning
Paper: https://arxiv.org/abs/2302.00923
Github: https://github.com/amazon-science/mm-cot
Twitter: https://paperswithcode.com/top-social
Abstract:
>Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance.
astonzhang t1_j79i4jj wrote
Hi, I am an author of the paper. Opinions below are my own.
​
After we arXiv-ed our "Automatic Chain of Though Prompting in Large Language Models" paper in Oct 2022 (here's a TLDR, ICLR'23), we were asking ourselves:
"If AGI (artificial general intelligence) is the goal, what kind of chain of thought (CoT) research do we need next? Is relying on a text-only generalist model that can perform text-only multitasks the final answer?"
"How can we connect the dots between NLP and CV communities so more researchers can contribute?"
"Since not everyone can afford playing with large models, how can we deal with input in more general form (text and images) *without* relying on larger models so a larger research community can contribute?"
​
One day I was teaching my kid how to solve arithmetic reasoning problems (not from the MultiArith dataset...). My kid told me that it's much easier to understand reasoning problems with the help from figure illustrations.
"Oh, can we leverage vision input to improve chain of thought reasoning?"
"The current generalist models like GPT-3.5 (text-davinci-002/003) only offer a blackbox API (at a cost) for transforming text input into text output. Why not just fine-tune a smaller model where we have full control of all its layers (whitebox) to fuse inputs in a more general form?"
​
Fortunately, Pan Lu et al. released the ScienceQA benchmark, just in time. This is a great contribution to the community and we benefited from it by testing our idea early on this benchmark (see acknowledgement in our GitHub repo). Showing the promise of fine-tuning a smaller model with task-specific datasets (rather than feeding in-context learning demos to a larger generalist LLM) is exactly what we wanted in this study (you may feel more motivated after reading the T-Few paper).
If you feel motivated to try parameter-efficient fine-tuning (PEFT) ideas from the aforementioned T-Few paper to improve Multimodal-CoT, you may also wish to check out our recent PEFT design space paper at ICLR'23 (here's a TLDR).