As more corporate and private users outsource their data to cloud storage, recent data breach incidents make end-to-end encryption increasingly desirable. Unfortunately, semantically secure encryption renders various cost-effective storage optimization techniques, such as data deduplication, ineffective. On this ground Stanek et al.  introduced the concept of �data popularity� arguing that data known/owned by many users do not require as strong protection as unpopular data; based on this, Stanek et al. presented an encryption scheme, where the initially semantically secure ciphertext of a file is transparently downgraded to a convergent ciphertext that allows for deduplication as soon as the file becomes popular. In this paper we propose an enhanced version of the original scheme. Focusing on practicality, we modify the original scheme to improve its efficiency and emphasize clear functionality. We analyze the efficiency based on popularity properties of real datasets and provide a detailed performance evaluation, including comparison to alternative schemes in real-like settings. Importantly, the new scheme moves the handling of sensitive decryption shares and popularity state information out of the cloud storage, allowing for improved security notion, simpler security proofs and easier adoption. We show that the new scheme is secure under the Symmetric External Diffie-Hellman assumption in the random oracle model.
E � is a special-purpose threshold cryptosystem that allows all users to encrypt arbitrary messages m of fixed length ? associated with some label � in such a way that once enough (more than some threshold) of the users provide their decryption shares (created using the same label � ), all the messages associated with � can be decrypted. Differently from classic threshold cryptosystems,the Encrypt interface now includes the added label � and the decryption process is designed to be non-interactive. Non- interactivity requires modification of the DShare interface � the decryption share is created using the label l instead of using the ciphertext and is stored in some repository until required for decryption.
Apart from typical deduplication scheme participants � the user wanting to store his data remotely and the storage provider offering his service and wanting to benefit from deduplication, our scheme requires the presence of two additional entities � the identity provider IdP and the index repository service IRS . IdP serves as the identity authority as well as the trusted dealer of the secret shares of the E � master secret. Upon scheme deployment, IdP is responsible for execution of E � .Setup. Each user interacts with IdP only once, when joining the scheme, and the interaction only includes IdP ensuring that the user is new (i.e. hasn�t participated in the scheme yet) and providing the identity credentials and secret share. Security-wise, IdP is used to hinder exploita- tion of the threshold cryptosystem E � by means of Sybil attacks and master secret leakage.
Data deduplication is a process by which a storage provider only stores a single copy of a file (or of its part) owned by multiple users. There are four different dedu- plication strategies, depending on whether deduplication happens at the client side (i.e. before the upload) or at the server side, and whether it happens at a block level or at a file level. Client-side data deduplication is more beneficial than server-side since it ensures that multiple uploads of the same content only consume network bandwidth and storage space of a single upload.
While security was not part of the early deduplication designs its need soon became imminent with users requiring protection for their data. Convergent encryption, in which the encryption key is derived from the plaintext, seemed like a simple and secure solution allowing deduplication. Unfortunately, it was proven insecure . Moreover, a general impossibility result holds stating that classic semantic security is not achievable for schemes im- plementing plain convergent encryption
we present a scheme that permits a more fine-grained security-to-efficiency trade-off. The intuition is that outsourced data may require different levels of pro- tection, depending on how popular the datum is: content shared by many users, such as a popular song or video, arguably requires less protection than a personal document, the copy of a payslip or the draft of an unsubmitted scientific paper.
This differentiation based on popularity also partly alleviates the user�s need to manually sort data as common (deduplicable) and potentially sensitive (non-deduplicable) as every file is first treated as potentially sensitive and thus not deduplicated. Deduplication occurs only when the file becomes popular (i.e. is shared by many users). Note that strictly confidential files that must never be deduplicated should be explicitly marked as non-deduplicable (e.g. by the user or policy) and handled separately.
This work deals with the inherent tension between storage optimization methods and end-to-end encryption. In our previous work we introduced a novel ap- proach that allows to vary the security level of a file based on how popular that file is among the users of the system and presented an encryption scheme that implements this approach. In this work we modify and significantly enhance the originally proposed scheme, focusing on its practicality.
Specifically, re-located storage of decryption shares and removal of file popularity information from the storage provider allowed us to strengthen the security of unpopular files and simplify the security proofs while modification of scheme internal processes led to decreased complexity and clear functionality isolation of individual algorithms. The resulting scheme prevents deduplication of unpopular data and allows their automatic transition to a popular state (once they are uploaded by enough users) where dedupli- cation can be applied.