Open
Description
The Actor Critic example (which is actually an implementation of REINFORCE-with-baseline as pointed out in #573), does not use the discount rate properly.
The loss should include \gamma ^ t, as shown in the box on page 330 of Sutton & Barto:
Activity
dknathalage commentedon Jul 30, 2020
Actually the code implementation γᵗ. [105] - actor_critic.py
examples/reinforcement_learning/actor_critic.py
Line 105 in 8df8e74
The loop recursively multiplies the gamma with the discounted reward of the timestep after it and appends at the beginning of the list.
rodrigodesalvobraz commentedon Jul 30, 2020
Thanks for the reply. However, the section of code you indicate seems to correspond to the calculation of G in the book's pseudo-code (see more complete pseudo-code box below). This portion of the pseudo-code (and the code you indicate) applies the discount starting at the timestep t until the end of the episode.
However, additionally, the book applies the discount rate from the beginning of the episode up to t in the last line of the pseudo-code. It seems to me that it is this application of the discounting rate that is missing in the code.
msaroufim commentedon Mar 9, 2022
@rodrigodesalvobraz I'd suggest you try out your improved version and see if it converges faster or to a better result and make a PR if it does