Comments

You must log in or register to comment.

TiredOldCrow t1_iv8tqar wrote

I know it's naive to expect machine learning to imitate life too closely, but for animals, "models" that are successful enough to produce offspring pass on elements of those "weights" to their children through nature+nurture.

The idea of weighting more successful previous models more heavily when "reincarnating" future models, and potentially borrowing some concepts from genetic algorithms with respect to combining multiple successful models seems interesting to me.

60

life_is_harsh t1_iva1h5l wrote

I feel both are useful, no? I thought of reincarnation as how humans learn: We don't learn from a blank state but often reuse our own learned knowledge or learn from others during our lifetime (e.g., when learning to play a sport, we might learn from an instructor but eventually learn on our own).

7

smallest_meta_review OP t1_iva27vt wrote

While nurture + nature seems useful across lifetimes, reincarnation might be how we learn during our lifetimes? I am not an expert but I found this comment interesting:

> This must be a fundamental part of how primates like us learn, piggybacking off of an existing policy at some level, so I'm all for RL research that tries to formalize ways it can work computationally.

OG Comment

2

smallest_meta_review OP t1_iva4dj7 wrote

> Tabula rasa RL vs. Reincarnating RL (RRL). While tabula rasa RL focuses on learning from scratch, RRL is based on the premise of reusing prior computational work (e.g., prior learned agents) when training new agents or improving existing agents, even in the same environment. In RRL, new agents need not be trained from scratch, except for initial forays into new problems.

More at https://ai.googleblog.com/2022/11/beyond-tabula-rasa-reincarnating.html?m=1

5

whothatboah t1_ivadajt wrote

very ignorantly speaking, but giving a bit of bit genetic algorithm vibes...

1

smurfpiss t1_ivaf5ia wrote

Not experienced With RL much, but how is that different than an algorithm going through training iterations?

In that case the parameters are tweaked from past learned parameters. What's the benefit of learning from another algorithm? Is it some kind of weird offspring of skip connections and transfer learning?

5

smallest_meta_review OP t1_ivaghqa wrote

Good question. The original blog post somewhat covers this:

> Imagine a researcher who has trained an agent A_1 for some time, but now wants to experiment with better architectures or algorithms. While the tabula rasa workflow requires retraining another agent from scratch, Reincarnating RL provides the more viable option of transferring the existing agent A1 to a different agent and training this agent further, or simply fine-tuning A_1.

But this is not what happens in research. For example, each time we are training a new agent to let say play an Atari game, we train it from scratch ignoring all the prior agents trained on that game. This work argues that why not reuse learned knowledge from the existing agent while training new agents (which may be totally different).

3

Dendriform1491 t1_ivaj27w wrote

At least in nature this happens because the environment is always changing and the value of training decays (some sort of "data drift").

1

ingambe t1_ivaj6e5 wrote

Evolution strategies work closely to the process you described. For very small neural networks, it works very well especially in environment with sparse or quazi-sparse rewards. But, as soon as you try larger neural net (CNN + MLP, or Transformer-like arch) the process becomes super noisy and you either need to produce a tons of offsprings for the population or use gradient based techniques.

9

anonymousTestPoster t1_ival53k wrote

How is this idea different to using pre-trained networks (functions) then adapting these for a new problem context?

5

smallest_meta_review OP t1_ivancqx wrote

Good question. I feel it's going one step further and saying why not reuse prior computational work (e.g., existing learned agents) in the same problem especially if that problem is computationally demanding (large scale RL papers do this but research papers don't). So, next time we train a new RL agent, we reuse prior computation rather than starting from scratch (e.g., we train new agents on Atari games given a pretrained DQN agent from 2015).

Also, in reincarnating RL, we don't have to stick to the same pretrained network architecture and can possibly try some other architecture too.

2

_der_erlkonig_ t1_ivb0pya wrote

Not to be that guy, but it kind of seems like this is just finally acknowledging that distillation is a good idea for RL too. They even use the teacher student terminology. Distilling a teacher to a student with a different architecture is something they make a big deal about in the paper, but people have been doing this for years in supervised learning. It's neat and important work, but the RRL branding is obnoxious and unnecessary IMO.

From a scientific standpoint, I think this methodology is also less useful than the authors advertise. Differently from supervised learning, RL is infamously sensitive to initial conditions, and adding another huge variable like the exact form of distillation used (which may reduce compute used) will make it even more difficult to isolate the source of "gains" in RL research.

9

luchins t1_ivbuz90 wrote

> I feel it's going one step further and saying why not reuse prior computational work (e.g., existing learned agents) in the same problem

could you make me an example please? I don't get what you mean with using agents with different architectures

1

TheLastVegan t1_ivbvx23 wrote

>As reincarnating RL leverages existing computational work (e.g., model checkpoints), it allows us to easily experiment with such hyperparameter schedules, which can be expensive in the tabula rasa setting. Note that when fine-tuning, one is forced to keep the same network architecture; in contrast, reincarnating RL grants flexibility in architecture and algorithmic choices, which can surpass fine-tuning performance (Figures 1 and 5).

Okay so agents can communicate weights between architectures. That's a reasonable conclusion. Sort of like a parent teaching their child how to human.

I thought language models already do this at inference time. So the goal of the RRL method is to subvert the agent's trust..?

1

smallest_meta_review OP t1_ivcf2tb wrote

While the critique is fair, if the alternative is always train agents from scratch, then reincarnating RL seems like a more reasonable alternative. Furthermore, dependence on prior computation doesn't stop NLP / vision researchers from reusing prior computation (pretrained models), so it seems worthwhile to do so in RL research too.

Re role of distillation distillation, the paper combines online distillation (Dagger) + RL to increase model capacity (rather than decrease capacity akin to SL) and wean off the distillation loss over time for training the agent only with RL loss .. the paper calls it a simple baseline. Also, it's unclear what's the best way to reuse prior computation given in a form other than learned agents, which is what the paper argues to study.

Re source of gains, if the aim is to benchmark RL methods in an RRL context, all methods would use the exact same prior computation and same reincarnating RL method for fair comparison. In this setup, it's likely that the supervised learning losses (if used) would add stability to the RL training process.

2

smallest_meta_review OP t1_ivcghme wrote

Oh, so one of the examples in the blog post is that we start with a DQN agent with a 3-layer CNN architecture and reincarnate another Rainbow agent with a ResNet architecture (Impala-CNN) using the QDagger approach for reincarnation. Once reincarnated, the ResNet Rainbow agent is further trained with RL to maximize reward. See the paper here for more details: https://openreview.net/forum?id=t3X5yMI_4G2

1

Nameless1995 t1_ivhscyv wrote

> (rather than decrease capacity akin to SL)

Distillation in supervised literature doesn't always reduce capacity for the student. I believe iterative distillation and such have been also explored where students have the same capacity but it leads to better calibration or something I forgot. (https://arxiv.org/abs/2206.08491, https://proceedings.neurips.cc/paper/2020/hash/1731592aca5fb4d789c4119c65c10b4b-Abstract.html)

2

smallest_meta_review OP t1_ivhz0g2 wrote

Interesting. So self-distillation is using the same capacity model as student and teacher -- are there papers which significantly increase model capacity? I thought the main use of distillation in SL was reducing inference time but would be interested to know of cases where we actually use a much bigger student model.

1

Nameless1995 t1_ivi33nf wrote

I am not sure. It's not my area of research. I learned of some of these ideas in a presentation made by someone years ago. Some of these recent paper essentially draws connection between distillation and label smoothing (essentially a way to provide "soft" labels -- this probably connects up with mixup techniques too). So on that ground, you can justify using any kind of teacher/student I think. Based on the label smoothing connection some paper goes for "teacher-free" distillation. And some others seem to be introducing "lightweight" teacher instead (I am not sure if the lightweight teacher is lower capacity than the student which would make it what you were looking for -- students having higher capacities. I haven't really read it beyond the abstract - just found it a few minutes ago from googling): https://arxiv.org/pdf/2005.09163.pdf (doesn't seem like a very popular paper though given it was published in arxiv in 2020 and have only 1 citation). Looks like a similar idea as to self-distillation was also available under the moniker of "born-again networks" (similar to also the reincarnation monker): https://arxiv.org/abs/1805.04770

1