cnapun

cnapun t1_jai24sf wrote

What I was trying to say was that doing this sampling approach (in a transformer) seems like it would have similar issues to a RNN, in that your computational graph will be repeated N times, where N is the rollout size. This makes me suspect that you'll get a lot of noise in your gradient estimates if N is large (also iirc Gumbel softmax gradients are biased, which might cause some more issues if chaining them)

1

cnapun t1_jage50a wrote

I'm not an expert on this topic, but I've discussed it with coworkers. I do believe you should be able to backprop through sampling, mathematically at least. My suspicion is that you'll run into the same problem as you have with RNNs, where backpropping through many steps leads to high variance in gradients. I'd search for some papers that have explored this; I assume they exist.

5

cnapun t1_j10a9jz wrote

In my experience, negative sampling is super application-dependent (esp for retrieval) sadly. FB had a paper discussing how they train a search retrieval model (no hard negatives), while amazon used hard negatives combined with easy negatives in product search (fb paper mentioned they tried this but it didn't help, but did some other stuff). Both of them use hinge loss, but other places use softmax more often. I'm a fan of random negatives (and distance weighted sampling), but eventually we found that mixed negatives + softmax with sample probability correction work a little better for a lot of cases.

One of the big challenges is that there are so many possible hyperparams here: do you concatenate negatives or sum losses, how many in-batch negatives do you use, if you have things that are from a different distribution that positives, can you use them as in-batch negatives, what's the ratio of in-batch to random negatives. And depending on the application, different configurations here can yield better or worse results.

Some not super-recent papers I can think of:

https://research.google/pubs/pub50257/

https://arxiv.org/abs/1706.07567

https://arxiv.org/abs/2010.14395

https://arxiv.org/abs/1907.00937 (3.2)

https://arxiv.org/abs/2006.11632 (2.2/2.4,6.1)

5

cnapun t1_j0z2van wrote

User behavior is pretty stochastic and not really well-captured in datasets available to academia. There's also the second class of papers that explores ranking more than candidate generation, which imo are usually more interesting, but also harder to find good data for in academia.

I take all results in papers discussing embeddings/two tower models (for retrieval) with a grain of salt because in my experience, the number one thing that matters for these in practice is negative sampling (but people rarely do ablations on this. see this paper that shows how metric learning hasn't really progressed as much as papers would have you think). They can still be good to read for ideas though

18

cnapun t1_iu3cqta wrote

I'm probably not the target demographic here (work in mid-size? tech), but I have a couple vague thoughts:

  • training speed == dev velocity, train more models -> either get things ready faster or make model better in same time
  • training speed == training cost if you're using on-demand compute. Depending on the company, they might not use on-demand (or might not care about cost). What i usually have seen happen is this never-ending cycle of slow training -> optimize -> add something that ends up leading to performance regression (maybe a new feature slows dataloading -> optimize again -> ... forever. Because of this, i think fundamental training optimizations can be useful, but it's super easy to introduce regressions and just accept them bc it's not usually a priority
  • For realtime systems powered by ML, latency == engagement. You can get substantial improvements in engagement from running inference of ranking models faster
5