Viewing a single comment thread. View all comments

tdgros t1_j41f1nz wrote

You'll still pay the full price at train time, right? Early decoding works by using decoders on earlier levels at train time. Conversely, if you want to spend more on some tokens, at train time, you will need to have more layers, so at some point you will hit your memory/complexity limits.