stjernen t1_iw20n2s wrote on November 12, 2022 at 9:37 AM

Kind of stupid but; Im having a hard time understanding reward and how to apply them.

Is reward a input?
Is reward the process of constant retraining?
Is reward the process of labeling?
Can it only be used with mdp?
Can it only be used in ql / dql?
I dont use cnn and images, can it be done without?
Lots of examples out there using «gym», can you do it without?
Many examples use -100 to 100 as reward, should it not be -1 to 1?

Cant really wrap my head around it. Currently making a card playing nn, with success in using feature and labeling. Want to take the next step into maybe dql.

csreid t1_iwrh9rt wrote on November 17, 2022 at 8:16 PM

>- Is reward a input?

Kind of, in that it comes from the environment

>- Is reward the process of constant retraining?

I'm not sure what this means

>- Is reward the process of labeling?

No, probably not, but I'm not sure what you mean again.

>- Can it only be used with mdp?

MDP is part of the mathematical backbone of reinforcement learning, but there's also work on decision processes that don't satisfy the Markov property (a good google term for your card-playing use case would probably be "partially observable Markov decision processes", for example)

>- Can it only be used in ql / dql?

Every bit of reinforcement learning uses a reward, afaik

>- I dont use cnn and images, can it be done without?

Absolutely! Training process is the same regardless of the underlying design of your q/critic/actor/etc function

>- Lots of examples out there using «gym», can you do it without?

You can, you just need something which provides an initial state and then takes actions and returns a new state, a reward, and (sometimes) an "end of episode" flag.

>- Many examples use -100 to 100 as reward, should it not be -1 to 1?

Magnitude of reward isn't super important as long as it's consistent. If you have sparse rewards (eg 0 except on win or loss), it might help to have larger values to help the gradient propagate back through the trajectory, but that's just me guessing. You can always try scaling to -1/1 and see how it goes.

I read "Reinforcement Learning" by Sutton and Barto (2018 edition) over a summer and it was excellent. Well-written, clear, and extremely helpful. I think what you're missing is maybe the Bellman background context.