## Thursday, November 10, 2016

### Study Group - All Your Queries Are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption

Published at USENIX 2016 and written by Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou from University of Maryland this paper proves again how difficult is to get proper security in searchable encryption.

Some of you may wonder why I chose this paper. Reason is that I wanted to get a grasp of how research looks like outside of the MPC area at a top security conference. And what other conference to pick than this year's USENIX?

After I finished reading this paper I realised a cool thing: that you need little to none a priori knowledge about Searchable Encryption (SE) [1]. Fortunately I didn't find myself diving into the literature of SE in order to fully understand the ideas presented there - how many crypto papers have you read and still be able to say this once you're done with them?

Let's introduce first the notion of SE. The setup consists in 2 parties, a client and an encrypted mail server. The client wants to obtain all e-mails which have 'alice' and 'work' as tags. He then computes a token in a deterministic manner, then the server replies with the e-mails  corresponding to the query:

To ease the presentation we can assume that instead of e-mails the server will respond with identifiers (urls).

An adversary wins if he breaks the token privacy, namely it recovers keywords from tokens.

A file injection attack is when an attacker has encrypted e-mails of his choice on the server and can see the tokens which are queried by the client. More simple, the server behaves in a honest-but-curious way and stores encrypted e-mails of his choice by spamming the client.

From a first sight this seems harmless. But if you combine it with some leaked e-mails (lots of them these days) the authors managed to have a 70% recovery rate for a keyword from a token with only 20% of leaked files. This was tested using an Enron email dataset of ~30000 emails. [2]

To understand how this works it's important to look at a simpler attack:

Binary search attack.

Suppose an adversary has a keyword universe $K$, with size $|K|$ a power for $2$. He can inject $\log_2{|K|}$ files and then recover keywords from every token with $100\%$ success.

Each file $F_i$, $i \in [0, \log_2{|K|}]$ contains keyword $k_j$ for which $j$ has $i$'th most significative bit equal to $1$. To see why this works let's look at an example inspired from the paper.

In this case we have $|K| = 8$. Black cells are the keywords inserted in file $i$. Since the search result is $F_1$, $F_3$ ($101$) we conclude that the token was derived from the keyword $k_5$.
A server which inserts one email per day can execute an attack in $2$ weeks time for $10,000$ keyword universe.

Countermeasure.

One crucial observation is that we insert $|K|/2$ keywords per file. So one can think that limiting the keywords per file to some threshold $T << |K|/2$ will fix the problem.

The threshold countermeasure turns out to be ineffective. The authors proved this with an attack which works almost like the first one after splitting the keywords in chunks of size $2T$. (denoted as hierarchical search attack).

For $|K| = 5,000$ and $T=200$ an adversary should can inject $131$ files in order to recover keywords for every token.

Attacks with partial leakage.

Sometimes e-mails leak. And when that happens...SE is almost useless against adaptive injection attacks as the authors prove. We say adaptive because it needs a forward insecure SE scheme.

What if we have want to recover keywords from multiple tokens?

The idea is to compute the distribution for each keyword $f^*(k)$ from the leaked files. Then the assumption that $f^*(k)$ is close to the queried tokens distribution $f(t)$ is made - which will turn true latter.
Next, a small candidate set of keywords is chosen (also called ground truth) and files are injected using the binary search attack. The rest of the tokens are recovered by taking joint distributions with previous tokens.

For more attacks and countermeasures which fail I strongly recommend reading the article [3].

At the end of the article one can wonder if building secure SE schemes would really help against these access pattern attacks...The authors suggest that an interesting direction is to look into lower bounds on these attacks but this seems a far from trivial task.

[1] http://crypto.stanford.edu/~dabo/pubs/papers/encsearch.pdf
[2] https://www.cs.cmu.edu/~./enron/
[3] https://eprint.iacr.org/2016/172