February 25, 2016

Denying the Service of a Differentially Private Database

Attention conservation notice: A half-clever dig at one of the more serious and constructive attempts to do something about an important problem that won't go away on its own. It doesn't even explain the idea it tries to undermine.

Jerzy's "cursory overview of differential privacy" post brings back to mind an idea which I doubt is original, but whose source I can't remember. (It's not Baumbauer et al.'s "Fool's Gold: an Illustrated Critique of Differential Privacy" [ssrn/2326746], though they do make a related point about multiple queries.)

The point of differential privacy is to guarantee that adding or removing any one person from the data base can't change the likelihood function by more than a certain factor; that the log-likelihood remains within $\pm \epsilon$. This is achieved by adding noise with a Laplace (double-exponential) distribution to the output of any query from the data base, with the magnitude of the noise being inversely related to the required bound $\epsilon$. (Tighter privacy bounds require more noise.)

The tricky bit is that these $\epsilon$s are additive across queries. If the $i^{\mathrm{th}}$ query can change the log-likelihood by up to $\pm \epsilon_i$, a series of queries can change the log-likelihood by up to $\sum_{i}{\epsilon_i}$. If the data-base owner allows a constant $\epsilon$ per query, we can then break the privacy by making lots of queries. Conversely, if the $\epsilon$ per query is not to be too tight, we can only allow a small number of constant-$\epsilon$ queries. A final option is to gradually ramp down the $\epsilon_i$ so that their sum remains finite, e.g., $\epsilon_i \propto i^{-2}$. This would mean that early queries were subject to little distortion, but latter ones were more and more noisy.

One side effect of any of these schemes, which is what I want to bring out, is that they offer a way to make the database unusable, or nearly unusable, for everyone else. I make the queries I want (if any), and then flood the server with random, pointless queries about the number of cars driven by left-handed dentists in Albuquerque (or whatever). Either the server has a fixed $\epsilon$ per query, and so a fixed upper limit on the number of queries, or $\epsilon$ grows after each query. In the first case, the server has to stop answering others' queries; in the second, eventually they get only noise. Or --- more plausibly --- whoever runs the server has to abandon their differential privacy guarantee.

This same attack would also work, by the way, against the "re-usable holdout". That paper (not surprisingly, given the authors) is basically about creating a testing set, and then answering predictive models' queries about it while guaranteeing differential privacy. To keep the distortion from blowing up, only a limited number of queries can be asked of the testing-set server. That is, the server is explicitly allowed to return NA, rather than a proper answer, and it will always do so after enough questions. In the situation they imagine, though, of the server being a "leaderboard" in a competition among models, the simple way to win is to put in a model early (even a decent model, for form's sake), and then keep putting trivial variants of it in, as often as possible, as quickly as possible. This is because each time I submit a model, I deprive all my possible opponents of one use of the testing set, and if I'm fast enough I can keep them from ever having their models tested at all.

Enigmas of Chance

Posted at February 25, 2016 11:09 | permanent link

Three-Toed Sloth