axm92 t1_j6uf2a7 wrote on February 1, 2023 at 11:23 PM

Reply to comment by LetterRip in [R] Faithful Chain-of-Thought Reasoning by starstruckmon

Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:

> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).

Notably, take a look at the section on GSM-hard (4.1). You may also enjoy the analysis in the new version of the paper (Section 6: https://arxiv.org/pdf/2211.10435.pdf).

Further, "Let's think step by step" is outperformed by "Write Python code to solve this." We'll add the numbers in the next version, but if you are interested please lmk and I can share the results earlier.

Thanks again for reading our work and sharing your feedback, I really appreciate it.

LetterRip t1_j6uj087 wrote on February 1, 2023 at 11:50 PM

> Further, "Let's think step by step" is outperformed by "Write Python code to solve this."

Interesting I was just wondering while reading that paper how well that would work compared to the n-shot prompts.

> Ah I see, thanks for clarifying. I see your point, but I wouldn't say that the prompts require an extensive knowledge of the test set. After all:

>> As an example, for the ~10 math reasoning datasets used in PaL, identical prompts were used (same prompt for all datasets, without changing anything).

That's fair. My thoughts were mostly directed at the "Table 2: Solve rate on three symbolic reasoning datasets and two algorithmic datasets" items. I think you could be right that my comments don't apply to the results in Figure 5 (GSM8K GSM-HARD SVAMP ASDIV SINGLEEQ SINGLEOP ADDSUB MULTIARITH).

Would be curious how well the 'write python code to solve this' performs in and of itself vs the "Let's think things through step by step" prompt.