CS 188 | RL

Build an empirical model based on experiences
\(\hat T(s, a, s')\) and \(\hat R(s, a, s')\) are estimated from experience (by averaging seen values)
Solve MDP with \(\hat T(s, a, s')\) and \(\hat R(s, a, s')\)

Instead of using \(T\) and \(R\), take actions and the actual outcomes to compute the expected outcome
A method is model-free if it never uses \(T\) and \(R\)

Take in a policy to follow and learns the value of states under that policy
Does not update the policy
Direct Evaluation: directly add up the rewards experienced, starting from that state, and average w.r.t. episodes
Direct evaluation discards information about state transitions, calculates states independently
Temporal Difference Learning:
\(sample = R(s, \pi(s), s') + \gamma V^{\pi}(s')\)
\(V^{\pi}(s) = V^{\pi}(s) + \alpha (sample - V^{\pi}(s))\)
\(\alpha\) is learning rate, or how important the new information is relative to old information;
e.g. if \(\alpha=0\) all old information is kept, if \(\alpha = 1\) old information is completely overwritten by new information

Policy and values are learned
Q-Learning: sample based Q-value iteration
\(sample = R(s,a,s') + \gamma \max \limits_{a'} Q(s',a')\)
\(Q(s,a) = (1-\alpha)Q(s,a) + \alpha(sample)\)
\(T(s,a,s')\) is implicitly learned since the SAS pairs with higher transition probability is seen more often and hence have more weight
Q-learning converges if all state-action pairs are seen infinitely often
\(\pi(s) = \text{argmax}\ Q(s,a)\)

Q-learning only remembers the values of q-states (not relations) and needs to store all the values
idea: use features and represent states with a feature vector
feature-based representation allows us to generalize learning experiences
keep a trainable weight vector
weight update:
\(\text {difference}=\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right]-Q(s, a)\)
difference: difference between sample and predicted q-value
\(w_{i} \leftarrow w_{i}+\alpha \cdot \text {difference} \cdot f_{i}(s, a)\)
update weights after each sample

\(\epsilon\)-greedy: choose random action with probability \(\epsilon\)
exploration functions: add some value to states less seen, replace \(Q\left(s^{\prime}, a^{\prime}\right)\) with \(f\left(s^{\prime}, a^{\prime}\right)\)
\(Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha \cdot\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}} f\left(s^{\prime}, a^{\prime}\right)\right]\)
usually, \(f(s, a)=Q(s, a)+\frac{k}{N(s, a)}\) where \(N(s, a)\) is the number of times \((s,a)\) has been seen
exploration term gets smaller as \(N(s,a)\) increases