Viewing a single comment thread. View all comments

csreid t1_iwrh9rt wrote

>- Is reward a input?

Kind of, in that it comes from the environment

>- Is reward the process of constant retraining?

I'm not sure what this means

>- Is reward the process of labeling?

No, probably not, but I'm not sure what you mean again.

>- Can it only be used with mdp?

MDP is part of the mathematical backbone of reinforcement learning, but there's also work on decision processes that don't satisfy the Markov property (a good google term for your card-playing use case would probably be "partially observable Markov decision processes", for example)

>- Can it only be used in ql / dql?

Every bit of reinforcement learning uses a reward, afaik

>- I dont use cnn and images, can it be done without?

Absolutely! Training process is the same regardless of the underlying design of your q/critic/actor/etc function

>- Lots of examples out there using «gym», can you do it without?

You can, you just need something which provides an initial state and then takes actions and returns a new state, a reward, and (sometimes) an "end of episode" flag.

>- Many examples use -100 to 100 as reward, should it not be -1 to 1?

Magnitude of reward isn't super important as long as it's consistent. If you have sparse rewards (eg 0 except on win or loss), it might help to have larger values to help the gradient propagate back through the trajectory, but that's just me guessing. You can always try scaling to -1/1 and see how it goes.

I read "Reinforcement Learning" by Sutton and Barto (2018 edition) over a summer and it was excellent. Well-written, clear, and extremely helpful. I think what you're missing is maybe the Bellman background context.

1