Viewing a single comment thread. View all comments

utilop t1_itwxusd wrote

I think that would make sense and could see the small models - in particular with CoT - fail to produce a valid answer.

For both MMLU and BBH, they report a worse average score with CoT than the direct prompt.

I would take that as CoT not reliably producing correct explanations, as it does not encourage good answers.

Could be that the problem is their prompt, few-shot setup, or calibration though?

Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?

2

AuspiciousApple OP t1_itwy8bj wrote

>Maybe for the sake of experimentation, take one of the tasks where CoT performs considerably better than the direct prompt?

That sounds a good idea, though NLP isn't really my field, so I might also not be using the correct sampling parameters/make subtle mistakes in writing the question (e.g. punctuation, line breaks, etc.), so I was hoping someone here would know more.

Even for English to German translation, the model often generated obvious nonsense, sometimes even just repeating the english phrase, despite using the prompt as it is in the hugging face config/paper.

1