• What is the fundamental task for a hiring committee?
  • Utility: classification of prof or hobo
  • Privacy: prior candidate data used in training
  • Differential privacy vs generalization and statistical validity

Differential Privacy

  • Data privacy has been studied since the 60s
  • Current Approaches
    • Re-identification: Identify individuals in existing datasets
    • Serious failure
  • Lack of rigor leads to unforeseen breaks
  • New line of work: principled approach to privacy
    • Definitions, algorithms, lower bounds, relations to other areas
  • Differential privacy: queries very similar with, without a particular instance
  • Two parameters: \(\epsilon\) and \(\delta\)
  • \(\epsilon\) measures 'leakage' or 'harm'
  • \(\delta\) generally (cryptographically) small
  • Differential privacy gives an automatic opt-out, plausible deniability, and ensures that 'rare bad events remain rare'
  • Basic properties:
    • post processing: if A is differentially private, then any transformation of A's output is private
    • composition: if two algorithms are composed, the privacy of the composition is known
  • Post-processing is important because processing the outcome of a privacy-preserving analysis should not lead to a privacy breach
  • Composition: multiple analyses should preserve privacy
  • Of course we could just ignore input and output; no utility!
  • How is differential privacy achieved:
    • Addition of carefully crafted random noise
    • Laplace mechanism: add laplacian-distributed noise
    • General goal: achieve privacy with less noise
  • Through post-processing, composition, many tasks can be tackled
  • Examples: learning, network analysis, ...
  • 'any finite concept class C can be learned privately with sample complexity ~ log|C|'
  • Google is using a differential privacy mechanism called RAPPOR in Chrome

Generalization and Statistical Validity

  • Learning over the sample vs over the distribution
  • The dread: what we learn doesn't generalize
    • Not a good predictor for fresh samples
    • in other words: overfitting
  • Overfitting distinguishes who is in the dataset -> a privacy issue
  • Want to be able to adaptively query without overfitting
  • Central result: if queries to data are differentially private, can maintain statistical validity even with adaptive querying