Reinforcement Learning
-
Basics
-
Still Assume MDP
-
\(T(s, a, s')\) and \(R(s, a, s')\) are unknown now
-
Model-Based Learning
-
Build an empirical model based on experiences
-
\(\hat T(s, a, s')\) and \(\hat R(s, a, s')\) are estimated from experience
(by averaging seen values)
-
Solve MDP with \(\hat T(s, a, s')\) and \(\hat R(s, a, s')\)
-
Model-Free Learning
-
Instead of using \(T\) and \(R\), take actions and the actual outcomes to compute the expected
outcome
-
A method is model-free if it never uses \(T\) and \(R\)
-
Passive Reinforcement Learning
-
Take in a policy to follow and learns the value of states under that policy
-
Does not update the policy
-
Direct Evaluation: directly add up the rewards experienced, starting from that state, and
average w.r.t. episodes
-
Direct evaluation discards information about state transitions, calculates states independently
-
Temporal Difference Learning:
\(sample = R(s, \pi(s), s') + \gamma V^{\pi}(s')\)
\(V^{\pi}(s) = V^{\pi}(s) + \alpha (sample - V^{\pi}(s))\)
-
\(\alpha\) is learning rate, or how important the new information is relative to old information;
e.g. if \(\alpha=0\) all old information is kept, if \(\alpha = 1\) old information is completely
overwritten by new information
-
Active Reinforcement Learning
-
Policy and values are learned
-
Q-Learning: sample based Q-value iteration
-
\(sample = R(s,a,s') + \gamma \max \limits_{a'} Q(s',a')\)
\(Q(s,a) = (1-\alpha)Q(s,a) + \alpha(sample)\)
-
\(T(s,a,s')\) is implicitly learned since the SAS pairs with higher transition probability is seen
more often and hence have more weight
-
Q-learning converges if all state-action pairs are seen infinitely often
-
\(\pi(s) = \text{argmax}\ Q(s,a)\)
-
Approximate Q-Learning
-
Q-learning only remembers the values of q-states (not relations) and needs to store all the values
-
idea: use features and represent states with a feature vector
-
feature-based representation allows us to generalize learning experiences
-
keep a trainable weight vector
-
weight update:
\(\text {difference}=\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}}
Q\left(s^{\prime}, a^{\prime}\right)\right]-Q(s, a)\)
difference: difference between sample and predicted q-value
-
\(w_{i} \leftarrow w_{i}+\alpha \cdot \text {difference} \cdot f_{i}(s, a)\)
update weights after each sample
-
Exploration and Exploitation
-
\(\epsilon\)-greedy: choose random action with probability \(\epsilon\)
-
exploration functions: add some value to states less seen, replace \(Q\left(s^{\prime},
a^{\prime}\right)\) with \(f\left(s^{\prime}, a^{\prime}\right)\)
\(Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha \cdot\left[R\left(s, a, s^{\prime}\right)+\gamma \max
_{a^{\prime}} f\left(s^{\prime}, a^{\prime}\right)\right]\)
-
usually, \(f(s, a)=Q(s, a)+\frac{k}{N(s, a)}\) where \(N(s, a)\) is the number of times \((s,a)\)
has been seen
exploration term gets smaller as \(N(s,a)\) increases