Part of Advances in Neural Information Processing Systems 7 (NIPS 1994)
Robert Crites, Andrew Barto
We prove the convergence of an actor/critic algorithm that is equiv(cid:173) alent to Q-Iearning by construction. Its equivalence is achieved by encoding Q-values within the policy and value function of the ac(cid:173) tor and critic. The resultant actor/critic algorithm is novel in two ways: it updates the critic only when the most probable action is executed from any given state, and it rewards the actor using cri(cid:173) teria that depend on the relative probability of the action that was executed.