Comments

You must log in or register to comment.

canbooo t1_j8mfn6z wrote

Since you are waiting since 6h without any response, let me share my 5c. You are probably inspired by chatgpt and the success of HRL so why not start there: https://openreview.net/forum?id=20-xDadEYeU

But this idea is not novel, only its application to nlp. It has been applied to other stuff like games and autonomous driving. They use PPO, which is to me the most robust on-policy algorithm. However, any other on-policy algorithm could also have been used instead and stuff like SAC could improve sample efficiency but might run into convergence problems. Also, you can try to be more generalistic and try off-policy algorithms independent of a specific language model. This would allow using same experience/value model to fine tune other LMs. But it might require much much more data to achieve a similar performance. In any case, the application of RL to NLP (except for language based games) is quite new and many points remain yet to be answered.

3

Smooth-Stick-5751 OP t1_j8mj4y6 wrote

Thank you very much for your valuable opinion, I will look into your suggestions.

2

loadage t1_j8rgptu wrote

My answer is less refined than some of the other ones, and my experience with RL is minimal, but wouldn't the action space be too large? Could you contain it to any word/phrase (near infinite space)? You could try limiting it to single letters, but similar to how CNNs work, you'd be missing out on the relationship between letters and you'd still have a 26 character action space, assuming you don't use punctuation or numbers. My friend spent two years working on a RL algorithm with only a 6 action space... I can't imagine 4x that

2

Smooth-Stick-5751 OP t1_j8rtqhn wrote

I see, I'm just a beginner in this field as well, so don't know most of its working, but I will take your thoughts into consideration. Thanks.

1