Viewing a single comment thread. View all comments

smallest_meta_review OP t1_ivcf2tb wrote

While the critique is fair, if the alternative is always train agents from scratch, then reincarnating RL seems like a more reasonable alternative. Furthermore, dependence on prior computation doesn't stop NLP / vision researchers from reusing prior computation (pretrained models), so it seems worthwhile to do so in RL research too.

Re role of distillation distillation, the paper combines online distillation (Dagger) + RL to increase model capacity (rather than decrease capacity akin to SL) and wean off the distillation loss over time for training the agent only with RL loss .. the paper calls it a simple baseline. Also, it's unclear what's the best way to reuse prior computation given in a form other than learned agents, which is what the paper argues to study.

Re source of gains, if the aim is to benchmark RL methods in an RRL context, all methods would use the exact same prior computation and same reincarnating RL method for fair comparison. In this setup, it's likely that the supervised learning losses (if used) would add stability to the RL training process.

2

Nameless1995 t1_ivhscyv wrote

> (rather than decrease capacity akin to SL)

Distillation in supervised literature doesn't always reduce capacity for the student. I believe iterative distillation and such have been also explored where students have the same capacity but it leads to better calibration or something I forgot. (https://arxiv.org/abs/2206.08491, https://proceedings.neurips.cc/paper/2020/hash/1731592aca5fb4d789c4119c65c10b4b-Abstract.html)

2

smallest_meta_review OP t1_ivhz0g2 wrote

Interesting. So self-distillation is using the same capacity model as student and teacher -- are there papers which significantly increase model capacity? I thought the main use of distillation in SL was reducing inference time but would be interested to know of cases where we actually use a much bigger student model.

1

Nameless1995 t1_ivi33nf wrote

I am not sure. It's not my area of research. I learned of some of these ideas in a presentation made by someone years ago. Some of these recent paper essentially draws connection between distillation and label smoothing (essentially a way to provide "soft" labels -- this probably connects up with mixup techniques too). So on that ground, you can justify using any kind of teacher/student I think. Based on the label smoothing connection some paper goes for "teacher-free" distillation. And some others seem to be introducing "lightweight" teacher instead (I am not sure if the lightweight teacher is lower capacity than the student which would make it what you were looking for -- students having higher capacities. I haven't really read it beyond the abstract - just found it a few minutes ago from googling): https://arxiv.org/pdf/2005.09163.pdf (doesn't seem like a very popular paper though given it was published in arxiv in 2020 and have only 1 citation). Looks like a similar idea as to self-distillation was also available under the moniker of "born-again networks" (similar to also the reincarnation monker): https://arxiv.org/abs/1805.04770

1