Viewing a single comment thread. View all comments

Kaleidophon t1_jah1ke1 wrote

I think what you are looking for is the Gumbel-softmax trick, which is basically differentiable sampling for categorical distributions. But in your case the problem will be that BLEU is not differentiable, and often in MT you find that when you try to directly optimize for some translation quality metric, the actual quality as assessed by human judges decreases.

1

Emergency_Apricot_77 t1_jah9rb7 wrote

Why go with BLEU though ? OP didn't particularly mention optimizing sequence level metrics. Can't we still use cross entropy ? Something as follows:

Sample first token, calculate cross-entropy with first token of gold

Sample second token, calculate cross-entropy with second token of gold

Sample third token, calculate cross-entropy with third token of gold

... and so on ?

​

This way we still have differentiable metric but we have a much better alignment between train and inference scenarios -- as opposed to current teacher forcing training and sampling inference -- which I thought the OP was going for.

1

Kaleidophon t1_jalv0qs wrote

>Why go with BLEU though ? OP didn't particularly mention optimizing sequence level metrics.

From OPs post above:

>is this possible at all ? I think if we can backprop through beam sampling, we can directly optimise for bleu ?

1