Markov Decision Processes
-
Basic
-
A set of states \(s \in S\)
-
A set of actions \(a \in A\)
-
A transition function \(T(s, a, s')\)
-
A reward function \(R(s, a,s')\)
-
A start state
-
Maybe a terminal state
-
Past and present states are independent; i.e. history doesn't matter
-
Policies
-
Gives an action for each state \(\pi(s) = a\)
-
Optimal policy maximizes expected utility
-
An explicit policy defines a reflex agent
e.g. move west at \((0,1)\), move north at \((2,2)\) etc.
-
Values
-
\(V^*(s)\) expected utility of starting from \(s\) and acting optimally
-
\(Q^*(s, a)\) expected utility of starting from \(s\), taking action \(a\), and acting optimally
from then on. i.e. Value of the q-state.
-
\(\pi^*(s)\) optimal policy from \(s\)
-
Recursive definitions:
\(V^*(s) = \max_a Q^*(s,a)\)
\(Q^*(s, a) = \sum \limits_{s'}T(s,a,s')[R(s,a,s') + \gamma V^*(s')]\)
\(V^*(s) = \max_a \sum \limits_{s'}T(s,a,s')[R(s,a,s') + \gamma V^*(s')]\)
-
Value Iteration
-
Start with \(V_0 = 0\)
-
Iterate on values with \(V_{k+1}(s) = \max_a \sum \limits_{s'}T(s,a,s')[R(s,a,s') + \gamma
V_k(s')]\)
-
Policy extraction: \(\forall s \in S, \pi^*(s) = \arg\max_a Q^*(s,a)\)
i.e. choose the action with the maximum Q-value
-
Policy Iteration
-
Start with \(V^\pi_0 = 0\)
-
Iteration on values for fixed policy: \(V^\pi_{k+1}(s) = \sum
\limits_{s'}T(s,\pi(s),s')[R(s,\pi(s),s') + \gamma V_k(s')]\)
-
Finding optimal policy:
1. Policy evaluation;
2. Policy improvement
-
Policy evaluation: \(V^{\pi_i}_{k+1}(s) = \sum \limits_{s'} T(s, \pi_i(s), s') [R(s, \pi_i(s), s') +
\gamma V^{\pi_i}_{k}(s')]\)
Calculate utilities for acting according to current policy
-
Policy update: \(\pi_{i+1}(s) = \text{argmax}_a \sum \limits_{s'} T(s, a, s') [R(s, a, s') + \gamma
V^{\pi_i}(s')]\)
Calculate optimal action based on current values
-
Summary
-
Value iteration: finding optimal values
-
Policy evaluation: finding values under a certain policy
-
Policy extraction: finding a policy given a set of values
If state values are optimal, the policy is optimal
-
Policy iteration: policy evaluation + policy extraction