Notes on today's opening tutorial on deep learning with Geoff Hinton, Yoshua Bengio, and Yann LeCun.

  • Geoff was unable to join today.
  • Breakthrough - deep learning is about learning different layers of representation / abstraction.
  • Four key ingredients for ML towards AI
    • lots of data
    • flexible models
    • enough computing power
    • powerful priors
  • Connectionism: concepts are represented by patterns of activation rather than symbols in the brain.
  • (Stacked?) neural nets are exponentially more statistically powerful than clustering / nearest-neighbor models.
  • Features can be discovered with a linear number of examples with neural nets (rather than an exponential number)
  • Difference between deep learning and prior work is the idea of composing features upon features.
  • 'Deep learning is not a completely general solution to everything.'
  • 'There is no magic.'
  • Backprop: training method for practical deep learning.
  • ReLUs 'not exactly differentiable but close enough to work.'
  • Convolutional Feature Map: scan an array of coefficients over the image; output given by dot product with region.
  • Pooling: computes an aggregation of the outputs of the neighbors of a previous layer.
  • Recurrent neurel nets: selectively summarize an input sequence in a fixed-size state vector via a recursive update.
  • RNNs are rolled out / unfolded so that backprop can be applied through time.
  • An RNN can represent a fully-connected directed generative model: every variable predicted from all previous ones.
  • RNNs can be thought of as a directed probabilistic graphical models.
  • RNNs struggle with long-term dependencies. Tricks to help:
    • Gradient Clipping
    • Leaky Integration
    • Momentum
  • LSTMs popular architecture for dealing with long-term dependencies.
    • Create paths in prop / backprop that allow information to be copied.
  • Normalize inputs 'to avoid ill conditioning' when using backprop.
  • Multilayer nueral net objective is nonconvex.
  • Local minima aren't a problem, though(?)
  • 'Almost all of the local minima are equivalent.'
    • Most local minima are close to global minimum error.
    • Error of trained nets tends to be sharply concentrated.
    • From results from Yoshua and Lecun's groups in the last few years.
  • ReLU one of the key tricks in ImageNet result.
  • Dropout can be thought of as regularization.
    • 'Brutal, murderous, genocidal regularization.'
  • Early stopping is 'beautiful free lunch.'
  • Should use random search when searching hyperparameters.
    • There is some neat stuff using GPs to approximate performance surface.
  • Training can be distributed using asynchronous SGD, but bottlenecks on sharing weights / updates.
  • New parallel method: EASGD.
  • 700 million photos uploaded to Facebook every day. Each goes through two conv nets; one for object recognition and one for face recognition.
  • Tesla autopilot uses convolutional net.
  • Graphical models and conv nets can be jointly trained.
  • Attention mechanism - add a layer that can 'learn where to look' in a sequence.
  • Recent models add 'memory' to recurrent neural nets:
    • LSTM
    • Memory Networks
    • Neural Turing Machine
  • Relational learning was plugged!