My (rough) notes on the Deep Reinforcement Learning through Optimization NIPS tutorial on 12/15/2016.

Reinforcement learning:
 Environment in state at time t
 Agent chooses action
 as a consequence, the envoronment changes state and a reward is emitted.

Policy Optimization
 Find parameters that maximize reward
 Policy is stochastic

often \(\pi\) is simpler than Q or V
 Value function V doesn't prescribe actions

Q: need to solve tough problems

many poli optimization success stories.

Crossentropy method
 Views U as a black vox
 Ignores all information other than U

Related methods:
 Rewardweighted regression
 weight by reward?
 Policy improvement wiht path itegrals
 Covariance adaptabtion evolutionary strategy
 Power

Power can solve ball/cup in 100 trials

Likelihood ratio gradient
 Utility is the expected rate of reward
 Valid even if reward is discontinous and unknown or sample space is discrete.
 Intuition: shift from bad paths to good paths
 Can compute likelihood gradient even without dynamics model

Variance reduction is important in practice; several approaches discussed.

Desiderata for policy optimization method:
 Stable monotonic improvement
 Google sample complexity

Why is step size a bg deal in RL?
 Supervised learning
 Step too far > next update will fix it
 RL
 Step too far > bad policy

We care about \(\eta(\pi)\)  expected return of \pi
 Collect data with \(\pi_{old}\) Want to optimize new objective to get new poicy \(\pi\)
 Define local approximation to expected retun of \pi (not reproduced here).

The 'local approximation to expected return' has bounds on it.

TRPO provides an approximation to the previous algorithm that is nicer in practice.

Proximal policy optimization
 Use penalty instead of constraint
 Roughly the same performance for TRPO

Variance reduction uses value functions
 Problem: confounding the effect of multiple actions (mixes effect of \(a_t\), \(a_{t+1}\), ...)

Variance reduction using discounts
 Take a discounted set of rewards instead of normal reward.

AdvantageActorCritic uses a fixedhorizon advantage estimator.

Can navigate around and collect apples (LSTM policy, conv net input)

Temporal difference methods: generalized advantage estimation
 Takes exponentiallyweighted finitehorizon estimates.