canbooo t1_j8mfn6z wrote on February 15, 2023 at 11:37 AM

Since you are waiting since 6h without any response, let me share my 5c. You are probably inspired by chatgpt and the success of HRL so why not start there: https://openreview.net/forum?id=20-xDadEYeU

But this idea is not novel, only its application to nlp. It has been applied to other stuff like games and autonomous driving. They use PPO, which is to me the most robust on-policy algorithm. However, any other on-policy algorithm could also have been used instead and stuff like SAC could improve sample efficiency but might run into convergence problems. Also, you can try to be more generalistic and try off-policy algorithms independent of a specific language model. This would allow using same experience/value model to fine tune other LMs. But it might require much much more data to achieve a similar performance. In any case, the application of RL to NLP (except for language based games) is quite new and many points remain yet to be answered.

Smooth-Stick-5751 OP t1_j8mj4y6 wrote on February 15, 2023 at 12:17 PM

Thank you very much for your valuable opinion, I will look into your suggestions.

Ferocious_Armadillo t1_j8p8cld wrote on February 15, 2023 at 11:28 PM

Check this out. I’m doing AI research and there’s this paper on DenseNets which can be used in, and are directly applicable to, NLP.

Smooth-Stick-5751 OP t1_j8ph6bm wrote on February 16, 2023 at 12:33 AM

Thanks a ton!

ParmesanCharmeleon t1_j8pkmgi wrote on February 16, 2023 at 12:58 AM

There is a paper from UW NLP that published the library RL4LMs and NLPO

Smooth-Stick-5751 OP t1_j8pykyk wrote on February 16, 2023 at 2:43 AM

Got it, thanks.

loadage t1_j8rgptu wrote on February 16, 2023 at 12:49 PM

My answer is less refined than some of the other ones, and my experience with RL is minimal, but wouldn't the action space be too large? Could you contain it to any word/phrase (near infinite space)? You could try limiting it to single letters, but similar to how CNNs work, you'd be missing out on the relationship between letters and you'd still have a 26 character action space, assuming you don't use punctuation or numbers. My friend spent two years working on a RL algorithm with only a 6 action space... I can't imagine 4x that

Smooth-Stick-5751 OP t1_j8rtqhn wrote on February 16, 2023 at 2:36 PM

I see, I'm just a beginner in this field as well, so don't know most of its working, but I will take your thoughts into consideration. Thanks.

Reinforcement Learning based algorithms specifically for NLP[D][P]

Comments