Reinforcement Learning
    
        - 
            Basics
 
                - 
                    Still Assume MDP
                
- 
                    \(T(s, a, s')\) and \(R(s, a, s')\) are unknown now
                
 
- 
            Model-Based Learning
 
                - 
                    Build an empirical model based on experiences
                
- 
                    \(\hat T(s, a, s')\) and \(\hat R(s, a, s')\) are estimated from experience
                    (by averaging seen values)
                
- 
                    Solve MDP with \(\hat T(s, a, s')\) and \(\hat R(s, a, s')\)
                
 
- 
            Model-Free Learning
 
                - 
                    Instead of using \(T\) and \(R\), take actions and the actual outcomes to compute the expected
                    outcome
                
- 
                    A method is model-free if it never uses \(T\) and \(R\)
                
 
- 
            Passive Reinforcement Learning
 
                - 
                    Take in a policy to follow and learns the value of states under that policy
                
- 
                    Does not update the policy
                
- 
                    Direct Evaluation: directly add up the rewards experienced, starting from that state, and
                    average w.r.t. episodes
                
- 
                    Direct evaluation discards information about state transitions, calculates states independently
                
- 
                    Temporal Difference Learning: 
 \(sample = R(s, \pi(s), s') + \gamma V^{\pi}(s')\)
 \(V^{\pi}(s) = V^{\pi}(s) + \alpha (sample - V^{\pi}(s))\)
- 
                    \(\alpha\) is learning rate, or how important the new information is relative to old information;
                    
 e.g. if \(\alpha=0\) all old information is kept, if \(\alpha = 1\) old information is completely
                    overwritten by new information
 
- 
            Active Reinforcement Learning
 
                - 
                    Policy and values are learned
                
- 
                    Q-Learning: sample based Q-value iteration
                
- 
                    \(sample = R(s,a,s') + \gamma \max \limits_{a'} Q(s',a')\) 
 \(Q(s,a) = (1-\alpha)Q(s,a) + \alpha(sample)\)
- 
                    \(T(s,a,s')\) is implicitly learned since the SAS pairs with higher transition probability is seen
                    more often and hence have more weight
                
- 
                    Q-learning converges if all state-action pairs are seen infinitely often
                
- 
                    \(\pi(s) = \text{argmax}\ Q(s,a)\)
                
 
- 
            Approximate Q-Learning
 
                - 
                    Q-learning only remembers the values of q-states (not relations) and needs to store all the values
                
- 
                    idea: use features and represent states with a feature vector
                
- 
                    feature-based representation allows us to generalize learning experiences
                
- 
                    keep a trainable weight vector
                
- 
                    weight update: 
 \(\text {difference}=\left[R\left(s, a, s^{\prime}\right)+\gamma \max _{a^{\prime}}
                    Q\left(s^{\prime}, a^{\prime}\right)\right]-Q(s, a)\)
 difference: difference between sample and predicted q-value
 
- 
                    \(w_{i} \leftarrow w_{i}+\alpha \cdot \text {difference} \cdot f_{i}(s, a)\) 
 update weights after each sample
 
- 
            Exploration and Exploitation
 
                - 
                    \(\epsilon\)-greedy: choose random action with probability \(\epsilon\)
                
- 
                    exploration functions: add some value to states less seen, replace \(Q\left(s^{\prime},
                    a^{\prime}\right)\) with \(f\left(s^{\prime}, a^{\prime}\right)\) 
 \(Q(s, a) \leftarrow(1-\alpha) Q(s, a)+\alpha \cdot\left[R\left(s, a, s^{\prime}\right)+\gamma \max
                    _{a^{\prime}} f\left(s^{\prime}, a^{\prime}\right)\right]\)
- 
                    usually, \(f(s, a)=Q(s, a)+\frac{k}{N(s, a)}\) where \(N(s, a)\) is the number of times \((s,a)\)
                    has been seen 
 exploration term gets smaller as \(N(s,a)\) increases