Submitted by _underlines_ t3_zstequ in MachineLearning
meyerhot t1_j1cg6jj wrote
Reply to comment by londons_explorer in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_
Anyone have any ideas about how they assigned rewards? Somehow take the sum of the prob(logits) from each token in the sentence and multiply that by the reward?
Viewing a single comment thread. View all comments