meyerhot
meyerhot t1_j1cfyrj wrote
I am really interested in this and have been looking into doing some sort of finetuning on an LLM like GLM or Bloom. I had this idea for human in the loop in grad school but wasn’t able to implement how to assign the rewards to the sentences when the text generation is token by token.
meyerhot t1_j1cg6jj wrote
Reply to comment by londons_explorer in [D] When chatGPT stops being free: Run SOTA LLM in cloud by _underlines_
Anyone have any ideas about how they assigned rewards? Somehow take the sum of the prob(logits) from each token in the sentence and multiply that by the reward?