My (rough) notes on the Deep Reinforcement Learning through Optimization NIPS tutorial on 12/15/2016.

  • Reinforcement learning:

    • Environment in state at time t
    • Agent chooses action
    • as a consequence, the envoronment changes state and a reward is emitted.
  • Policy Optimization

    • Find parameters that maximize reward
    • Policy is stochastic
  • often \(\pi\) is simpler than Q or V

  • Value function V doesn't prescribe actions
  • Q: need to solve tough problems

  • many poli optimization success stories.

  • Cross-entropy method

    • Views U as a black vox
    • Ignores all information other than U
  • Related methods:

    • Reward-weighted regression
    • weight by reward?
    • Policy improvement wiht path itegrals
    • Covariance adaptabtion evolutionary strategy
    • Power
  • Power can solve ball/cup in 100 trials

  • Likelihood ratio gradient

    • Utility is the expected rate of reward
    • Valid even if reward is discontinous and unknown or sample space is discrete.
    • Intuition: shift from bad paths to good paths
    • Can compute likelihood gradient even without dynamics model
  • Variance reduction is important in practice; several approaches discussed.

  • Desiderata for policy optimization method:

    • Stable monotonic improvement
    • Google sample complexity
  • Why is step size a bg deal in RL?

    • Supervised learning
    • Step too far -> next update will fix it
    • RL
    • Step too far -> bad policy
  • We care about \(\eta(\pi)\) - expected return of \pi

  • Collect data with \(\pi_{old}\) Want to optimize new objective to get new poicy \(\pi\)
  • Define local approximation to expected retun of \pi (not reproduced here).
  • The 'local approximation to expected return' has bounds on it.

  • TRPO provides an approximation to the previous algorithm that is nicer in practice.

  • Proximal policy optimization

    • Use penalty instead of constraint
    • Roughly the same performance for TRPO
  • Variance reduction uses value functions

    • Problem: confounding the effect of multiple actions (mixes effect of \(a_t\), \(a_{t+1}\), ...)
  • Variance reduction using discounts

    • Take a discounted set of rewards instead of normal reward.
  • Advantage-Actor-Critic uses a fixed-horizon advantage estimator.

  • Can navigate around and collect apples (LSTM policy, conv net input)

  • Temporal difference methods: generalized advantage estimation

    • Takes exponentially-weighted finite-horizon estimates.