Viewing a single comment thread. View all comments

vwings t1_iw768hl wrote

I think it's valuable, but not huge. There have been several recent works that use this concept that a sample is described by similar samples to enrich representations:

  • the cross-attention mechanism in Transformers does this to some extent
  • AlphaFold: a protein is enriched with similar (by multiple sequence alignment) proteins
  • CLOOB: a sample is enriched with similar samples from the current batch
  • MHNfs: a sample is enriched with similar samples from a large context.

This paper uses this concept, but does it differently: it uses the vector of cosine similarities, which in other works is softmaxed and then used a weights for averaging, directly as representation. That this works and that you can backprop over this is remarkable, but not huge... Just my two cents... [Edits: typos, grammar]

3

machinelearner77 t1_iw7omuy wrote

That it works seems interesting, especially since I would have thought that it might depend too much on the hyper-parameter (anchors), which apparently it doesn't. But why shouldn't you be able to "backprop over this"? It's just cosine, everything is naturally differentiable

3

vwings t1_iw857q2 wrote

Yes, sure you can backprop, but what I meant is that you are able to train a network reasonably with this -- although in the backward pass the gradient gets diluted to all anchor samples. I thought you would at least need softmax attention (forward pass) to be able to route the gradients back reasonably.

1