YonatanBitton

YonatanBitton OP t1_iua6jz7 wrote

Thank you :) The random chance with 10-12 candidates is pretty low - 17%-24%, so fine-tuned model performance of 55% is much above random chance. However, we still see that humans perform much better. A possible explaination for this gap is that the datasets is challenging, containing complex social and caltural cues, that challenges the current models who didn't train on similar tasks. We explored this direction on the last section (Table 6) where there are easier classes like "visually salient" (which is more similar to the pre-training task of the model) with performance of 67%, and more difficult ones (different from the pre-training) like "visually non-salient" with 36%.

2

YonatanBitton OP t1_iu02pl6 wrote

This is a great point, thank you. The interpretation of common sense tasks varies from person to person, and common sense reasoning involves some ambiguity. WinoGAViL, however, only uses instances which were solved well by three human solvers (over 80% Jaccard index). To validate our dataset, we took other players (who did not take part in the data generation task) and verified that it was solved with high human accuracy (90%).

4