Notes on today's opening tutorial on deep learning with Geoff Hinton, Yoshua Bengio, and Yann LeCun.

- Geoff was unable to join today.
- Breakthrough - deep learning is about learning different layers of representation / abstraction.
- Four key ingredients for ML towards AI
- lots of data
- flexible models
- enough computing power
- powerful priors

- Connectionism: concepts are represented by patterns of activation rather than symbols in the brain.
- (Stacked?) neural nets are exponentially more statistically powerful than clustering / nearest-neighbor models.
- Features can be discovered with a linear number of examples with neural nets (rather than an exponential number)
- Difference between deep learning and prior work is the idea of composing features upon features.
- 'Deep learning is not a completely general solution to everything.'
- 'There is no magic.'
- Backprop: training method for practical deep learning.
- ReLUs 'not exactly differentiable but close enough to work.'
- Convolutional Feature Map: scan an array of coefficients over the image; output given by dot product with region.
- Pooling: computes an aggregation of the outputs of the neighbors of a previous layer.
- Recurrent neurel nets: selectively summarize an input sequence in a fixed-size state vector via a recursive update.
- RNNs are rolled out / unfolded so that backprop can be applied through time.
- An RNN can represent a fully-connected directed generative model: every variable predicted from all previous ones.
*RNNs can be thought of as a directed probabilistic graphical models*.- RNNs struggle with long-term dependencies. Tricks to help:
- Gradient Clipping
- Leaky Integration
- Momentum

- LSTMs popular architecture for dealing with long-term dependencies.
- Create paths in prop / backprop that allow information to be copied.

- Normalize inputs 'to avoid ill conditioning' when using backprop.
- Multilayer nueral net objective is nonconvex.
- Local minima aren't a problem, though(?)
- 'Almost all of the local minima are equivalent.'
- Most local minima are close to global minimum error.
- Error of trained nets tends to be sharply concentrated.
- From results from Yoshua and Lecun's groups in the last few years.

- ReLU one of the key tricks in ImageNet result.
- Dropout can be thought of as regularization.
- 'Brutal, murderous, genocidal regularization.'

- Early stopping is 'beautiful free lunch.'
- Should use random search when searching hyperparameters.
- There is some neat stuff using GPs to approximate performance surface.

- Training can be distributed using asynchronous SGD, but bottlenecks on sharing weights / updates.
- New parallel method: EASGD.
- 700 million photos uploaded to Facebook every day. Each goes through two conv nets; one for object recognition and one for face recognition.
- Tesla autopilot uses convolutional net.
- Graphical models and conv nets can be jointly trained.
- Attention mechanism - add a layer that can 'learn where to look' in a sequence.
- Recent models add 'memory' to recurrent neural nets:
- LSTM
- Memory Networks
- Neural Turing Machine

- Relational learning was plugged!