canbooo t1_j8mfn6z wrote on February 15, 2023 at 11:37 AM

Since you are waiting since 6h without any response, let me share my 5c. You are probably inspired by chatgpt and the success of HRL so why not start there: https://openreview.net/forum?id=20-xDadEYeU

But this idea is not novel, only its application to nlp. It has been applied to other stuff like games and autonomous driving. They use PPO, which is to me the most robust on-policy algorithm. However, any other on-policy algorithm could also have been used instead and stuff like SAC could improve sample efficiency but might run into convergence problems. Also, you can try to be more generalistic and try off-policy algorithms independent of a specific language model. This would allow using same experience/value model to fine tune other LMs. But it might require much much more data to achieve a similar performance. In any case, the application of RL to NLP (except for language based games) is quite new and many points remain yet to be answered.

Smooth-Stick-5751 OP t1_j8mj4y6 wrote on February 15, 2023 at 12:17 PM

Thank you very much for your valuable opinion, I will look into your suggestions.