Secure Data Deduplication with Dynamic Ownership Management in Cloud Storage

— I n cloud storage services, reduplication technology is commonly used to reduce the space and bandwidth requirements of services by eliminating redundant data and storing only a single copy of them. DE duplication is most effective when multiple users outsource the same data to the cloud storage, but it raises issues relating to security and ownership. Proof-of-ownership schemes allow any owner of the same data to prove to the cloud storage server that he owns the data in a robust way. However, many users are likely to encrypt their data before outsourcing them to the cloud storage to preserve privacy, but this hampers reduplication because of the randomization property of encryption. Recently, several reduplication schemes have been proposed to solve this problem by allowing each owner to share the same encryption key for the same data. However, most of the schemes suffer from security flaws, since they do not consider the dynamic changes in the ownership of outsourced data that occur frequently in a practical cloud storage service. In this paper, we propose a novel server-side reduplication scheme for encrypted data. It allows the cloud server to control access to outsourced data even when the ownership changes dynamically by exploiting randomized convergent encryption and secure ownership group key distribution. This prevents data leakage not only to revoked users even though they previously owned that data, but also to an honest-but-curious cloud storage server. In addition, the proposed scheme guarantees data integrity against any tag inconsistency attack. Thus, security is enhanced in the proposed scheme. The efficiency analysis results demonstrate that the proposed scheme is almost as efficient as the previous schemes, while the additional computational overhead is negligible.


INTRODUCTION
Cloud computing provides scalable, low-cost, and location-independent online services ranging from simple backup services to cloud storage infrastructures. The fast growth of data volumes stored in the cloud storage has led to an increased demand for techniques for saving disk space and network bandwidth. To reduce resource consumption, many cloud storage services, such as Dropbox Wuala, Mozy , and Google Drive , employ a dedupli-cation technique, where the cloud server stores only a single copy of redundant data and provides links to the copy instead of storing other actual copies of that data, regardless of how many clients ask to store the data. The savings are significant , and reportedly, business applications can achieve disk and bandwidth savings of more than 90% . However, from a security perspective, the shared usage of users' data raises a new challenge.As customers are concerned about their private data, they may encrypt their data before outsourcing in order to protect data privacy from unauthorized outside adversaries, as well as from the cloud service provider This is justified by current security trends and numerous industry regulations such as PCI DSS However, conventional encryption makes reduplication impossible for the following reason. DE duplication techniques take advantage of data sim-ilarity to identify the same data and reduce the storage space. In contrast, encryption algorithms randomize the encrypted files in order to make ciphertext indis-tinguishable from theoretically random data. Encryp-tions of the same data by different users with different encryption keys results in different ciphertexts, which makes it difficult for the cloud server to determine whether the plain data are the same and deduplicate them. Say a user Alice encrypts a file M under her secret key skA and stores its corresponding ciphertext CA. Bob would store CB, which is the encryption of M under his secret key skB. Then, two issues arise: how can the cloud server detect that the underlying file M is the same, and even if it can detect this, how can it allow both parties to recover the stored data, based on their separate secret keys?
Straightforward client side encryption that is secure against a chosen-plaintext attack with randomly chosen encryption keys prevents reduplication . One naive solution is to allow each client to encrypt the data with the public key of the cloud storage server. Then, the server is able to deduplicate the identified data by decrypting it with its private key pair. However, this solution allows the cloud storage server to obtain the outsourced plain data, which may violate the privacy of the data if the cloud server cannot be fully trusted Convergent encryption resolves this problem effectively. A convergent encryption algorithm encrypts an input file with the hash value of the input file as an encryption key. The ciphertext is given to the server and the user retains the encryption key. Since convergent encryption is deterministic, identical files are always encrypted into identical ciphertext, regard-less of who encrypts them. Thus, the cloud storage server can perform reduplication over the ciphertext, and all owners of the file can download the ciphertext (after the proof-ofownership (PoW) process option-ally) and decrypt it later since they have the same encryption key for the file. Convergent encryption has long been studied in commercial systems and has different encryption variants for secure reduplication, which was formalized as message-locked encryption later in .However, convergent encryption suffers from security flaws with regard to tag consistency and ownership revocation.
As an example of the tag consistency attack issue, suppose Alice and Bob have the same data M, and Alice generates ciphertext CA from M, and then maliciously generates another ciphertext CA from M ( 䁈 M). Next, she uploads CA with an honestly generated tag T (CA) 䁈 H(M) for a cryptographic hash function H, which plays the role of data index. When Bob generates ciphertext CB from M and tries to upload CB, the cloud server checks T (CA) 䁈 T (CB). Then, it deletes CB and keeps only CA . Afterwards, when Bob downloads and decrypts it, the data would be M , not M, which means the integrity of his data has been compromised. Recently, message-locked encryption (MoE) and leakage-resilient reduplication schemes have been proposed to solve this problem by introducing additional integrity check phase for decrypted data.
In the case of ownership revocation, suppose multiple users have ownership of a ciphertext outsourced in cloud storage. As time elapses, some of these users may request the cloud server to delete or modify their data, and then, the server deletes the owner-ship information of the users from the ownership list for the corresponding data. Then, the revoked users should be prevented from accessing the data stored in the cloud storage after the deletion or modification request (forward secrecy). On the other hand, when a user uploads data that already exist in the cloud storage, the user should be deterred from accessing the data that were stored before he obtained the ownership by uploading it (backward secrecy). These dynamic ownership changes may occur very frequently in a practical cloud system, and thus, it should be properly managed in order to avoid the security degradation of the cloud service. However, the previous reduplication schemes could not achieve secure access control under a dynamic ownership changing environment, in spite of its importance to secure reduplication, because the encryption key is derived deterministically and rarely updated after the initial key derivation. Therefore, for as long as revoked users keep the encryption key, they can access the corresponding data in the cloud storage at any time, regardless of the validity of their ownership. This is the problem we attempt to solve in this study.

1.1
Contribution We propose a reduplication scheme over encrypted data. The proposed scheme ensures that only authorized access to the shared data is possible, which is considered to be the most important challenge for efficient and secure cloud storage services in the environment where ownership changes dynamically. It is achieved by exploiting a group key management mechanism in each ownership group. As compared to the previous reduplication schemes over encrypted data, the proposed scheme has the following advantages in terms of security and efficiency.
First, dynamic ownership management guarantees the backward and forward secrecy of deduplicated data upon any ownership change. As opposed to the previous schemes, the data encryption key is updated and selectively distributed to valid owners upon any ownership change of the data through a stateless group key distribution mechanism using a binary tree. The ownership and key management for each user can be conducted by the semi-trusted cloud server deployed in the system. Thus, the proposed scheme delegates the most laborious tasks of ownership management to the cloud server without leaking any confidential information to it, rather than to the users. Second, the proposed scheme ensures security in the setting of PoW by introducing a re-encryption mechanism that uses an additional group key for dynamic ownership group. Thus, although the encryption key (that is the hash value of the file) is revealed in the setting of PoW, the privacy of the outsourced data is still preserved against outside adversaries, while reduplication over encrypted data is still enabled and data integrity against poison attacks is guaranteed.

1.2
Organization The rest of the paper is organized as follows. In Section 3, the system architecture and security require-ments are described. In Section 4, the cryptographic background is provided and the general framework of reduplication over encrypted data is defined. In Section 5, we propose our scheme's construction. We analyze the efficiency and security of the proposed scheme in Section 6 and 7, respectively. In Section 8, we conclude the paper.

II.
REoATED WORK DE duplication techniques can be categorized into two different approaches: reduplication over unencrypted data and reduplication over encrypted data. In the former approach, most of the existing schemes have been proposed in order to perform a PoW process in an efficient and robust manner, since the hash of the file, which is treated as a "proof" for the entire file, is vulnerable to being leaked to outside adversaries because of its relatively small size. Whereas, in the latter approach, data privacy is the primary security requirement to protect against not only outside ad-versaries but also inside the cloud server. Thus, most of the schemes have been proposed to provide data encryption, while still benefiting from a deduplica-tion technique, by enabling data owners to share the encryption keys in the presence of the inside and outside adversaries.
Since encrypted data are given to a user, data access control can be additionally implemented by selective key distribution after the PoW process. However, not much work has yet been done to address dynamic ownership management and its related security problem. a similar attack scenario on cloud storage that uses deduplica-tion across multiple users. Specifically, when an at-tacker temporarily compromises a server and obtains the hash values for data in the cloud storage, he is able to download all these data. This is because only a small piece of information about the data, namely, its hash value, serves as not only an index of the data to locate information of the data among a huge number of files, but also a "proof" that anyone who knows the hash value owns the corresponding data. Therefore, any users who can obtain the short hash value for specific data are able to access all the data stored in the cloud storage.
Harnik et al. proposed a randomized threshold to avoid an attack on cloud storage services that use server-side data reduplication by stopping data reduplication. However, their method did not employ client-side data possession proofs to prevent hash manipulation attacks. Mulazzani et al. demon-strated the hash manipulation attack and conducted a practical evaluation of such an attack in Dropbox , which is one of the biggest cloud storage providers. Specifically, the authors showed that spoofing the hash value of a file chunk added to the local Dropbox folder allows a malicious user to access files of other Dropbox users, given that the SHA-256 hash values of the file's chunks are known to the attacker. To overcome these attacks, Halevi et al. in-troduced and formalized the notion of proof-of-ownership (PoW), where a user proves to a server that he holds a file using Merkle trees, rather than only a short hash value for it. Specifically, Halevi et al.'s scheme encodes a file using an erasure code that is resilient to the erasure of up to faction of the bits, and then, builds a Merkle tree over the encoded file. Then, a challenge-response protocol between the server and the client verifies the ownership. PoW is closely related to proof of retrievability and proof of data possession .However, proof of retrievabil-ity and data possession often use a pre-processing step that cannot be used in the data reduplication procedure.
Despite their significant benefits in terms of saving resources, these reduplication schemes may cause another security vulnerability and reveal users' private data, in particular, when partial information of users' data has already been leaked. Additionally, all of the above reduplication schemes allow the cloud server to store the data in plaintext form and send plain data to users on receipt of request messages after the PoW procedure. Thus, the cloud server should be fully trusted by all the users in the systems, which constitutes a the significant security threat in the prag-matic cloud storage services where the cloud server may learn every customer's private information and maliciously exploit it.

2.2
DE duplication over Encrypted Data In order to preserve data privacy against inside cloud server as well as outside adversaries, users may want their data encrypted. However, conventional encryption under different users' keys makes crossuser reduplication impossible, since the cloud server would always see different ciphertexts, even if the data are the same, regardless of whether the encryption algorithm is deterministic. Convergent encryption, introduced by Douceur et al. is a promising solution to this problem. In convergent encryption, a data owner derives an en-cryption key K H(M), where M is data or a file to be encrypted and H is a cryptographic hash function. Then, he computes the ciphertext C E(K; M) via a block cipher E, deletes M, and keeps only K after uploading C to the cloud storage. If another user encrypts the same message, the same ciphertext C is produced since encryption is deterministic. Thus, on receipt of C from other users after the initial upload, the server does not store the file but instead updates meta-data to indicate it has an additional owner. If any legitimate owners request and download C later, they can decrypt it with K.
However, convergent encryption suffers from the following security flaw. Suppose a user ua has data Ma and a user ub has other data Mb 䁈 Ma. ua uploads maliciously-generated ciphertext Ca E(H(Mb); Ma), and its tag (or, index) T (Ca) H(E(H(Mb); Mb)). Then, when ub tries to upload Cb E(H(Mb); Mb) and its tag, the server sees a tag match T (Ca) 䁈 T (Cb). Thus, the server deletes Cb and keeps only Ca. oater, when ub downloads it, the decryption constitutes Ma, not Mb, meaning the integrity of the data has been compromised. This is referred to as the tag consistency problem 20m. u et al. 19m also introduced a similar data integrity attack in the cloud storage service, called a poison attack.
In order to solve this problem, Bellare et al. introduced a message-locked encryption (MoE) concept and its security notion, and proposed randomized convergent encryption as one implementation of MoE. In randomized convergent encryption, an initial uploader encrypts a message and generates C1 E(o; M), where o is a randomly chosen key, and then encrypts the message encryption key o and generate C2 o K, where K is a key-encrypting key (KEK) that is derived from the message (K H(M)). Then, the message tag T is generated from the KEK, not from the ciphertext (T H(P; K), where P is a set of public parameters). When any legitimate owner receives C1; C2; T from the server later, he computes o C2 K, decrypts C1 with o, and obtains M. Then, he generates a tag T H(P; H(P; M)) and checks whether T 䁈 T . If T 䁈 T , he accepts it; else, rejects it, since the data is compromised. In the scheme, C2 is used to distribute the message encryption key, where K is used as a group KEK shared among owners of the same data. Since a tag is generated from the KEK, not from the ciphertext, even different ciphertexts encrypted under the different keys of each owner can be deduplicated provided that the plaintext is the same. u et al. 19m also proposed a leakage-resilient reduplication scheme to resolve the data integrity problem. This scheme also enables the data owner to encrypt data with a randomly selected key. Then, the data encryption key is encrypted under a KEK derived from the data and distributed to the other data owners after the PoW process. If a legitimate owner receives a ciphertext, he can check the integrity of the data by decrypting the data encryption key with the same KEK.
Convergent encryption is insecure in the setting of PoW, where the hash value of the file (that is, a deterministic encryption key) may be leaked .Unfortunately, this is also the case in MoE and u et al.'s schemes Since the hash value of the file is used as the KEK in both schemes, if the KEK is revealed, adversaries who obtain it are able to decrypt the key encryption message and obtain the encryption key, even if the encryption key is not deterministic. Another drawback in both schemes is the lack of dynamic ownership management among the data owners. For example, suppose a group of users share data in the cloud storage. Some users may request data deletion or modification in the storage. Then, they should be prevented from accessing the original data after this time instance (forward secrecy). oikewise, when a user subsequently uploads the data, access right to the previous data should not be given to him before that time instance (backward secrecy). However, in both schemes, this unauthorized data access cannot be controlled, since the data encryption key cannot be updated at all after its initial selection and distribution by an initial uploader.
Recently, oi et al. proposed a convergent key management scheme in which users distribute the convergent key shares across multiple servers by exploiting the Ramp secret sharing scheme. oi et al. also proposed an authorized reduplication scheme in which differential privileges of users, as well as the data, are considered in the deduplica-tion procedure in a hybrid cloud environment. Jin et al. proposed an anonymous reduplication scheme over encrypted data that exploits a proxy re-encryption algorithm. Bellare et al. proposed a server-aided MoE which is secure against brute-force attack, which was recently extended to interactive MoE to provide privacy for messages that are both correlated and dependent on the public system parameters. However, these schemes do not handle the dynamic ownership management issues involved in secure reduplication for shared outsourced data.
Shin et al. proposed a reduplication scheme over encrypted data that uses predicate encryption. This approach allows reduplication only of files that belong to the same user, which severely reduces the effect of reduplication. Thus, in this paper, we fo-cus on reduplication across different users such that identical files from different users are detected and deduplicated safely to provide more storage savings.

III. DATA DEDUPoICATION ARCHITECTURE
In this section, we describe the data reduplication architecture and define the security model. According to the granularity of reduplication, deduplica-tion schemes are categorized into (coarse-grained) filelevel or (fine-grained) block-level schemes. Since block-level reduplication can easily be deduced from file-level reduplication, we consider only file-level reduplication for simplicity's sake. Thus, a data copy refers to a whole file in this paper. Fig. 1 shows the architecture of the data reduplication system, which consists of the following entities. 1) Data owner: This is a client who owns data, and wishes to upload it into the cloud storage to save costs. A data owner encrypts the data and outsources it to the cloud storage with its index information, that is, a tag. If a data owner uploads data that do not already exist in the cloud storage, he is called an initial uploader; if the data already exist, called a subsequent uploader since this implies that other owners may have uploaded the same data previously, he is called a subsequent uploader. Hereafter, we refer to a set of data owners who share the same data in the cloud storage as an ownership group.

2)
Cloud service provider: This is an entity that provides cloud storage services. It consists of a cloud server and cloud storage. The cloud server deduplicates the outsourced data from users if necessary and stores the deduplicated data in the cloud storage. The cloud server maintains ownership lists for stored data, which are composed of a tag for the stored data and the identities of its owners. The cloud server controls access to the stored data based on the ownership lists and manages (e.g., issues, revokes, and updates) group keys for each owner-ship group as a group key authority. The cloud server is assumed to be honest-but-curious. That is, it will honestly execute the assigned tasks in the system; however, it would like to learn as much information about the encrypted contents as possible. Thus, it should be deterred from accessing the plaintext of the encrypted data even if it is honest.

3.2
Threat Model and Security Requirements 1) Data privacy: Unauthorized users who cannot prove ownerships should not be able to decrypt the ciphertext stored in the cloud storage. Additionally, the cloud server is no longer fully trusted in the system. Thus, unauthorized access from the cloud server to the plaintext of the encrypted data in the cloud storage should be prevented.

2)
Data integrity: The reduplication algorithm should guarantee tag consistency against any poison attacks. That is, the reduplication algo-rithm should allow the valid owners to verify that the data downloaded from the cloud storage have not been altered.

3)
Backward and forward secrecy: In the context of reduplication, backward secrecy means that any user should be prevented from accessing the plaintext of the outsourced data before upload-ing the data. Conversely, forward secrecy means that any user who deletes or modifies the data in the cloud storage should be prevented from accessing the outsourced data after its deletion or modification.

4)
Collusion resistance: Unauthorized users who do not have valid ownerships of data in the cloud storage should not be able to decrypt them even if they collude.

IV.
PREoIMINARIES AND DEFINITION 4.1 Notations In this paper, x $ S denotes the operation of selecting an element x at random and uniformly from a finite set S and assigning it to x. For an algorithm A, y A(x1; : : :) denotes running A on inputs x1; : : : and assigning the output to the variable y. 1 denotes a string of ones, if 2 N, which is the security parameter4. For two bit-strings a and b, we denote by ajjb their concatenation. oet U 䁈 fu1; ; ung be the universe of users. oet IDt be the identity of a user ut. oet Gi U be a set of users that owns the data Mi, which is referred to as an ownership group. oet oi 䁈 Ti; Gi be an ownership list for Mi, maintained by the cloud server, which consists of a tag Ti and Gi for Mi. oet KGi be the ownership group key that is shared among the valid owners in Gi.

4.2
Definitions In this section, we define a secure reduplication framework for encrypted data with ownership management capability. The scheme consists of the follow-ing algorithms: 1) KEK $ KEKGen(U): The KEK generation algo-rithm takes a set of users U as input, and outputs 2) In secure group communication, backward secrecy implies that when a member newly joins a multicast group, he should be prevented from learning learning group communications exchanged after he leaves the group.group communications exchanged before he joins the group. Forward secrecy implies when a member leaves a multicast group, he should be prevented from 3) To be consistent with the standard convention in algorithms, where the running time of an algorithm is measured as a function of the length of its input, we will provide the adversary and the honest parties with the security parameter in unary as 1 .

Fig. 2. Scheme overview and corresponding security
KEKs for each user in U for secure ownership group key distribution.
2) C $ Encrypt(M; 1 ): The encryption algorithm is a randomized algorithm that takes as input data M and a security parameter , and outputs a ciphertext C of the data. C consists of the encrypted message and its tag information for indexing.

3)
C $ ReEncrypt(C; G): The re-encryption algo-rithm is a randomized algorithm that takes a ciphertext C and an ownership group G, and outputs a re-encrypted ciphertext C . Specifi-cally, it outputs a re-encrypted ciphertext such that only valid owners in G can decrypt the message. 4) M Decrypt(C ; K; P K): The decryption algo-rithm is a deterministic algorithm that takes as input C , message encryption key K, and a set of KEKs P K for encrypting an ownership group key GK, and outputs a message M, iff K is derived from M and GK is not revoked for the ownership gro V.
PROPOSED DEDUPoICATION SCHEME In this section, we propose a secure reduplication scheme for encrypted data that has dynamic ownership management capability. The proposed scheme is constructed based partially on a randomized convergent encryption scheme in order to randomize the encrypted data, which renders the proposed scheme secure against the chosen-plaintext attack while still allowing reduplication over the data. The proposed scheme is further integrated into the re-encryption protocol for owner revocation. The owner revocation is executed by re-encrypting the outsourced ciphertext and selectively distributing the reencryption key to valid (that is, not revoked) owners by the cloud server. Fig. 2 shows the overview of the proposed scheme and its corresponding security goals.
To handle dynamic ownership management, the cloud server must obtain the ownership list for each data, since otherwise revocation cannot take effect. This setting where the cloud server knows the ownership list does not violate the security requirements, because it is allowed only to re-encrypt the ciphertexts and can by no means obtain any information about the data encryption key of users. The simplest implementation is to make EK : f0; 1gk ! f0; 1gk a block cipher, where k is the length of the key K. We additionally employ a cryptographic hash function H : f0; 1g ! f0; 1g to generate an encryption key and a tag from a message. 5.1.1 Key Generation The cloud server runs KEKGen(U) and generates KEKs for users in U. First, the cloud server sets a binary KEK tree for the universe of users U, as in Fig.  3, which will be used to distribute the ownership group keys to users in U U. In the tree, each node vj holds a KEK, denoted by KEKj. A user is represented by a leaf, and each user maintains the KEKs on the path nodes from its leaf to the root. These are called path keys. For instance, in Fig. 3, u2 stores KEK9; KEK4; KEK2, and KEK1 as its path keys P K2. For ut 2 U, P Kt denotes a set of the path keys of ut. The KEK tree is constructed by the cloud server as follows: 1) Every member in U is assigned to a leaf node of the tree. Random keys are generated and assigned to each leaf node and internal node.

2)
Each member ut 2 U receives securely the path keys P Kt from its leaf node to the root node of the tree. Then, the path keys are used as KEKs to encrypt the ownership group keys by the cloud server in the data re-encryption phase. The key assignments in this method are conducted randomly and independently of each other.

5.1.2
Data Encryption Without loss of generality, we suppose a data owner ut wants to upload his data Mi to the cloud storage. ut encrypts the data by running the Encrypt(Mi, 1 ) algorithm. The algorithm chooses a random data encryption key o $ f0; 1gk( ), where k( ) is an algorithm that determines the size of the encryption key under the security parameter. It also computes a key Ki H(Mi) from Mi, which will be used as a KEK for encrypting a message encryption key o, and a tag Ti H(Ki), which is an index information for the data. Then, the algorithm encrypts the data and the encryption key as C1 E (M ) and C2 o K , i o i 1 i2 i and constructs the ciphertext Ci 䁈 Ci jjCi After the construction of Ci, the data owner ut sends uploadjjTijjCijjIDt to the cloud storage5. Then, the owner deletes Mi and retains only Ki for storage saving. On its receipt, the cloud server inserts IDt into Gi, creates oi 䁈 Ti; Gi , and stores Ci in the cloud storage, if ut is the first uploader for Mi. If oi already exists (which means ut is a subsequent uploader), then it inserts only IDt into Gi without storing Ci.

5.1.3
Data Re-encryption Before distributing the ciphertext Ci, the cloud server re-encrypts it by running ReEncrypt(Ci, Gi) using the ownership group information for the ciphertext. The re-encryption algorithm enforces access control of dynamically changing owners to the outsourced data. The algorithm progresses as follows: 1) For Gi, choose a random ownership group key GKi. Then, re-encrypt Ci1 and generate Ci1 䁈 EGKi (Ci1). the case where the user has not uploaded the data or has been revoked at this moment. It is important to note that the ownership group key distribution protocol through Ci3 is a stateless approach. Thus, even if users cannot update their key states constantly (e.g., in case of offline), they are able to decrypt the group key from it at any time they receive it when they become online, provided that they are not revoked from the ownership groups.

5.1.4
Data Decryption When a user ut receives a ciphertext Ci from the cloud server, he can decrypt the message by running Decrypt(Ci , Ki, P Kt), if ut 2 Gi. The data decryption phase consists of ownership group key decryption followed by the message decryption. Ownership Group Key Decrypt. When a user sends a data request query and receives TijjCi from the cloud server, he first parses it as Ti, Ci1 , Ci2, Ci3, and obtains the ownership group key from Ci3. If the user ut has valid ownership (that is, ut 2 Gi at this time instance), he can decrypt the ownership group key GKi using a KEK 2 KEK(Gi) \ P Kt as GKi 䁈 DKEK2(KEK(Gi)\P Kt)(Ci3): 2) Select root nodes of the minimum cover sets in the KEK tree that can cover all of the leaf nodes associated with users in Gi. We denote by KEK(Gi) a set of KEKs that such root nodes of subtrees for Gi hold. For example, if Gi 䁈 fu1; u2; u3; u4; u7; u8g in Fig. 3, then KEK(Gi) 䁈 fKEK2; KEK7g, because v2 and v7 are the root nodes of the minimum cover sets that can cover all of the members in Gi. It follows that this collection covers all users in Gi and only them, and any user u 2䁈 Gi can by no means know any KEK in KEK(Gi).

3)
Generates Ci3 䁈 fEK(GKi)gK2KEK(Gi): The user ut may belong to at most one subset rooted by only one such KEK in KEK(Gi). Thus, there can be only one such KEK.
The key-indistinguishability property follows from the fact that no u 2䁈 Gi is contained in any of the subsets the root node of which is holding any KEK in KEK(Gi). This means that, for every KEK in KEK(Gi), the KEK is indistinguishable from a random key, given all the information of all users not in Gi 32m. Thus, any user u 2䁈 Gi can by no means decrypt GKi, even if he colludes with other users u 2䁈 Gi, which makes the proposed scheme secure against such a collusion attack as we will analyze it in Section 7.4. This encryption is employed as the method for delivering the ownership group keys to valid owners. On receiving any data request query TijjIDj from a user uj, the cloud server looks up oi and responds with TijjCi to the user, where Ci 䁈 Ci1 jjCi2jjCi3 if IDj 2 Gi; otherwise, it does nothing. The former indicates the case where the user has uploaded the data and has not been revoked, while the latter indicates 4) In server-side reduplication approaches, the data owner may send only Ti, and the cloud server requests the owner to upload the encrypted data only when there are no data indexed by Ti. This approach can save the network bandwidth; however, it can be used as a side channel that reveals information about the contents of files of other users. This may violate the privacy of other users Thus, in the proposed scheme, we assume that the data owner sends Ci as well as Ti in order to preserve privacy. Message Decrypt. After that, the user ut decrypts the ciphertext and obtains the message as follows. Ci1 DGKi ( . Then, the subsequent user would report the data inconsistency to the cloud server, which may help the cloud server to find and revoke the malicious user, and delete the polluted data from the cloud storage6.

5.2
Key Update When subsequent users upload data which is the same as the previously uploaded by the initial uploader, the corresponding ciphertext should be reencrypted to prevent subsequent users from accessing the previous encrypted data in order to provide backward secrecy. In contrast, when users who have valid ownerships request the cloud server to delete or modify the data in the cloud storage, they should be revoked from the ownership list and deterred from accessing the data after data deletion or modification in order to provide forward secrecy. Subsequent Upload. We suppose a user us wants to upload data Mi to the cloud storage, and its corresponding ownership list oi 䁈 Ti; Gi and cipher-text Ci 䁈 Ci1 jjCi2 already exist in the cloud storage (Ci might be encrypted and uploaded by the initial uploader, and re-encrypted by the cloud server such that Ci1 䁈 EGKi (Ci1)). Then, the user us encrypts the data by running the Encrypt(Mi, 1 ) algorithm and generates ciphertext, say Ci . With overwhelming probability, it holds that Ci 䁈Ci since the encryption key o is randomly selected from f0; 1gk ( ) by different users. After the construction of Ci , the data owner us sends uploadjjTi jjCi jjIDs to the cloud storage7. Then, the key update and re-encryption processes progress as 1) If Ti 䁈 Ti, the cloud server puts IDs into Gi.

2)
The cloud server decrypts the ciphertext com-ponent Ci1 䁈 EGKi (Ci1) in Ci with the current ownership group key GKi. Then it selects a random ownership group key GKi ( 䁈GKi) and runs the ReEncrypt(Ci, Gi) algorithm described in Section 5.1.3 with the updated ownership group information Gi and GKi to guarantee backward secrecy, which updates the cipher text component as Ci1 : EGKi (Ci1) ! EGKi (Ci1). If there is no Ti in the cloud storage such that Ti 䁈 Ti, since this implies the first upload, the cloud server creates a new ownership list for the data, inserts IDs into the newly generated ownership group, and stores the uploaded data in the cloud 6.
Even if the proposed scheme can detect any data modification or loss by a malicious user or CSP, it cannot recover the original data under the data loss attack because all of the redundant data would be deduplicated. with some redundancy and disperse the shares across multiple CSPs, which is out of scope in this paper. 7.
The subsequent uploader sends Ci to prevent the side-channel attack as in the initial upload; however, if communication channel is secure in the presence of eavesdroppers, Ci does not need to be uploaded, which will reduce the communication cost as in the client-side reduplication.storage following the same procedures described in Section 5.1.2.
Data Deletion. When a user us wants to delete data Mi from the cloud storage, the user sends the data deletion request messages with deletejjTijjIDs to the cloud server. Then, the cloud server performs the following procedures.

1)
If IDs 2 Gi, it deletes IDs from Gi. Then, it selects a random ownership group key and runs the ReEncrypt(Ci, Gi) algorithm described in Section 5.1.3 with the updated ownership group information Gi to guarantee forward secrecy.

2)
Else, it does nothing. Data Modification. When a user us wants to modify the data Mi to Mj, the user encrypts the data and constructs the ciphertext Cj and its corresponding tag Tj by running the Encrypt(Mj, 1 ) algorithm described in Section 5.1.2. Then, the user sends a data modification request message with modif yjjTijjTjjjCjjjIDs to the cloud server. Then, the cloud server performs the data deletion procedure, followed by data upload procedure, as follows.
-Data Deletion(1-2): 1) If IDs 2 Gi, it deletes IDs from Gi. Then, it selects a random ownership group key and runs the ReEncrypt(Ci, Gi) algorithm described in Section 5.1.3 with the updated ownership group information Gi for guaranteeing forward secrecy.

3)
If there exists oj 䁈 Tj; Gj for the tag Tj in the cloud storage, it performs the subsequent upload procedures described above.

4)
Else, if there does not exist oj in the cloud storage, it performs the initial upload procedures described in Section 5.1.2. When multiple users upload or delete the same file at the same time, they are handled in a batch way. Specifically, the ownership list for the file is updated based on the ownership changes, and the corresponding ownership group key and ciphertext are updated once. Then, they are securely delivered following the proposed algorithms. We note that this can be handled straightforwardly without any security degradation.

5.3
Comparison Table 1 shows the comparison results of the se-cure data reduplication schemes, that is convergent encryption (CE) 15m, leakage-resilient (oR) deduplication 19m, and randomized convergent encryption (RCE) 20m in terms of the data reduplication over Since all the schemes allow data owners to en-crypt their data and enable reduplication over them, they can guarantee the data confidentiality or privacy against the cloud server and unauthorized outside adversaries. With regard to data integrity, convergent encryption cannot guarantee the integrity of deduplicated data in the face of a poison attack, whereas the other schemes preserve it by adopting an additional mechanism that enables data owners to check the tag consistency of the received data.
In the proposed scheme, upon every membership change in the ownership list (e.g., subsequently uploading the same data, or modifying/deleting the existing data), access to the corresponding data is per-mitted to owners only for the time windows during which the owners maintain valid ownership of the data by re-encrypting it using an updated ownership group key and selectively distributing it.
This re-solves the dynamic ownership management problem as opposed to the other schemes. The rekeying in the proposed scheme can be done immediately upon any ownership change. This enhances the security of the outsourced data in terms of backward/forward secrecy by reducing the windows of vulnerability. A more rigorous security analysis is given in Section 7.

VI. SCHEME ANAoYSIS
In this section, we analyze the efficiency of the proposed scheme and compare it with the previous deduplication schemes over encrypted data in terms of both the theoretical and practical aspects. The efficiency of the proposed scheme is demonstrated in the network simulation in terms of communication cost. We also discuss its computation cost when implemented with specific parameters. 6.1 Efficiency The comparative results for the theoretical efficiency of the schemes are summarized in Table 2. In Table 2, the analysis results of each scheme in terms of the communication and storage overhead are shown. For communication overhead, "upload message size" represents the communication cost required for the data outsourcing process; "download message size" represents the communication cost required for ciphertext downloading and tag checking processes, and "rekeying message size" represents the communi-cation cost required for rekeying the data encryption key. For storage overhead, "key size" and "tag size" represent the size of the keys and tag information that each owner needs to store, respectively. The notations used in the table are as follows. Size of a data or file of an encrypted data (䁈 output length of E( )) Size Size of a key (䁈 output length of k( ) on input 1 ) Size of a tag Size of an identity of a user ( logn) Size of a node value in Merkle hash tree Size of exchanged messages for PoW on inputs the file size and 1 (䁈 ulogCM , where u is the smallest integer such that (1 ) u < " for some constant fraction > 0) 21m n Number of users in the system m Number of owners in an ownership list for a file For the upload and download message sizes, the proposed scheme is the same as the basic RCE 20m scheme. In oR 19m, the communication overhead for verifying PoW is additionally included in the download message. In the scheme, the PoW verification and tag checking processes are done during the data up-load phase by subsequent owners. However, they can be executed during the data download phase without loss of functionality and efficiency. Thus, we suppose they are executed during the download phase as in RCE 20m and the proposed scheme for the sake of fair comparison.
With regard to the rekeying message size, only the proposed scheme supports key updates upon ownership changes for data. In the proposed scheme, the rekeying message size (i.e., size of Ci 3 ) would be (n m)log nm n Ck. This additional message plays an important role in enhancing the backward and forward secrecy, and enforces fine-grained user access control to the outsourced data in contrast to the other schemes. Whereas, in CE 15m, the encryption key is determined by the message itself; in oR 19m and RCE 20m, it is selected by the initial uploader and never updated during the lifetime of the data in the system. Thus, even if the other schemes do not need the addi-tional rekeying messages, they cannot guarantee the data privacy during the windows of vulnerability in the practical cloud environment where the ownership changes dynamically as time elapses. With regard to storage overhead, in the oR scheme, each owner stores the leaf node values of the Merkle tree for PoW in addition to the data encryption key and KEK, the size of which increases in proportion to the data size. In the proposed scheme, each data owner stores logn additional KEKs as compared to the original RCE scheme. These KEKs allow the secure and selective distribution of the dynamically updated data encryption key, which supports finegrained ac-cess control on the basis of the valid ownership of each user with a little storage overhead.
6.2 Simulation In this simulation, we measure the communication cost of the reduplication schemes. We consider the on-line cloud storage systems connected to the Internet. Almeroth et al. 33m demonstrated the group behavior in the Internet's multicast backbone network (MBone). They showed that the number of users joining a multicast group follows a Poisson distribution with rate , and the membership duration time follows an exponential distribution with a mean duration 1䁈 . Since each owner group of data or files can be seen as an independent network group, where the owners in the group share common outsourced data, we show the simulation results following this probabilistic behavior distribution 33m. We suppose that user join and leave events are independently and identically distributed in each ownership group following Poisson distribution. The ownership duration time for outsourced data (that is, from data upload time to deletion time) is assumed to follow an exponential distribution. We set the interupload time between users as 12 hours ( 䁈 2) and the average ownership duration time as 10 days (1䁈 䁈 10).

6.3
Implementation Figs. 4 and 5 show the total communication costs that the cloud server incurs to send on data requests from owners in the secure reduplication schemes that support tag consistency (that is, oR 19m, RCE 20m, and the proposed scheme) in a single ownership group during 100 days. In this simulation, we suppose the owners send data request messages immediately after ownership changes in the ownership group to measure the communication cost incurred by dynamic ownership changes for a fair comparison with regard to the security perspective. The communication cost, which is measured in bytes, includes those of the  Time (in days)

Fig. 5. Communication cost in RCE and the proposed scheme
Ciphertext and of rekeying messages for non-revoked owners.
In this simulation, we set CM 䁈 CC 䁈 10MB for a typical multimedia data (e.g., music file), which is one of the most commonly used types of data in the cloud. To achieve a 128-bit security level, we set CK 䁈 128 bits, CT 䁈 128 bits. Fig. 4 shows the communication costs in oR, RCE, and the proposed scheme. In the oR scheme, each owner performs a PoW process on every data request procedure and receives an updated encryption key and ciphertext by unicast. This incurs much more communication overhead than the other schemes. Thus, RCE and the proposed scheme have Table 3 Comparison of computation cost almost the same and relatively negligible communication overhead, as shown in Fig. 4. To provide a detailed comparison of the two schemes, Fig. 5 shows the communication costs of RCE and the proposed scheme. In RCE, we assume that only the rekeying component C2 䁈 L K is selectively distributed to valid owners securely by unicast in order to realize a dynamic ownership management similar to that of the proposed scheme, which is the most efficient rekeying scenario in RCE. In this scenario, the communication overhead of the proposed scheme is less than that of RCE by about 300-byte with the same level of ownership management capability 8 . Next, we analyze and measure the computation cost incurred when a data owner encrypts and decrypts data during upload and download phases, respectively. The computation cost is shown in Table 3 in terms of the computation of a cryptographic hash function for key generation, tag generation (the hash function is also used for key encryption/decryption in oR 19m), data encryption/decryption, and key decryption. The comparatively negligible bitwise exclusive-or operations are ignored in the computation analysis results. For each operation, we include a benchmark timing. Each cryptographic operation was implemented using the Crypto++ library ver. 5.6.2 34m on a 3.4 GHZ pro-cessor PC. The key parameters were selected to pro-vide a 128-bit security level. The implementation uses an MD5 as a cryptographic hash function to  Table 3, are measured for different data sizes as shown in Fig. 6, which increase in proportion to the size of data.
On the basis of the encryption and decryption time, we measured the total computation cost for the upload and download of each scheme, as shown in Fig. 7 and Fig. 8, respectively. For the upload procedure, the proposed scheme requires the same computations as the CE and RCE schemes. For the download procedure, the proposed scheme needs one more key decryption operation than does the basic 8. Each detailed communication cost can be found in the supple-mentary file for this paper. RCE scheme. However, since the symmetric key size is much smaller than the typical data size in the cloud (e.g., document file, or multimedia data), the additional 128-bit key decryption time (i.e., 0:129 ms) in the proposed scheme would be relatively negligible as compared to the data decryption time in a pragmatic cloud computing system as depicted in Fig.  8.
The measured computation time for upload and download is described in Table 4. More experimental results with diverse file size (100KB 1000MB) can be found in the supplementary file for this paper.

VII. SECURITY
In this section, we prove the security of the pro-posed scheme in terms of the security requirements discussed in Section 3.2, that is data privacy, data integrity, backward and forward secrecy, and collusion resistance. 7.2 Data Integrity In the reduplication scheme, data integrity may be threatened by a poison attack on tag consistency. Without loss of generality, we suppose an attacker and another user u have the same data M. The attacker maliciously generates ciphertext C from M ( 䁈M), and uploads it with a tag T generated from M ini-tially. In the proposed scheme, the poison attack on tag consistency is easily detected, as in the basic RCE scheme. When the user u subsequently requests the data and receives the corresponding ciphertext C1jjC2jjC3 with T , he can obtain the ownership group key GK and data encryption key o from C3 and C2, respectively if the user has valid ownership of the data. Then, the user decrypts C1 and obtains the data M using the two encryption keys GK and o, generates the tag T from M , and checks whether T 䁈 T . If these tags are not consistent, the user drops the message, since it implies the data may have been modified during the previous outsourcing procedure. Therefore, the proposed scheme guarantees data integrity against a poison attack on tag consistency.

7.3
Backward and Forward Secrecy When a user holds data that has been already uploaded previously at some time instance and tries to upload them into the cloud storage, the corresponding ownership group key is updated independently and randomly, say from GK to GK , and delivered securely to the valid owners of the data (including the user) immediately. In addition, the ciphertext component C1 䁈 EGK(C 1), which was encrypted with GK, is re-encrypted by the cloud server with the updated ownership group key GK at the same time as C1 䁈 EGK (C1). Even if the user has stored the previous ciphertext before he holds the data, he cannot decrypt the previous ciphertext. This is because, even if he is able to obtain encryption keys o and GK from the current ciphertext, they are of no use for recovering the desired component C1 for the previous ciphertext, since it is re-encrypted with a previous GK. Therefore, the backward secrecy of the outsourced data is guaranteed in the proposed scheme. On the other hand, when a user deletes or modifies the data at some time instance, the corresponding ownership group key is also updated independently and randomly, say from GK to GK , and delivered securely to the valid ownership group members (excluding the user) immediately. Then, the ciphertext component C1 䁈 EGK(C1) that was encrypted with GK is re-encrypted by the cloud server with a new ownership group key GK at the same time as C1 䁈 EGK (C1). Then, the user cannot decrypt the current ciphertext after his revocation, since he can by no means obtain GK . Even if the user has recovered the data encryption key o before he was revoked from the ownership group and stored it, it would be of no use for obtaining the desired data from C1 in the subsequent ciphertext, since it is re-encrypted with a new random GK . Therefore, the forward secrecy of the outsourced data is also guaranteed in the proposed scheme.

7.4
Collusion Resistance To provide collusion resistance, unauthorized users who have not valid ownerships of cloud data should not be able to decrypt them even if they collude. In the proposed scheme, in order to decrypt the ci-phertext and obtain the plain data, users should have knowledge of both the data encryption key o and the ownership group key GK. Even if some unauthorized users may be able to obtain the data encryption key9, it is impossible to have the ownership group key GK. This is because the KEK assignment for ownership group key distribution in the binary KEK tree is information theoretic, that is KEKs are assigned randomly and independently of each other. Whenever any ownership change occurs in an ownership group, the ownership group key is rekeyed immediately and the data are re-encrypted using the updated group key. Even if the unauthorized users collude with each other, they cannot obtain the current ownership group key, since none of the KEKs in their path keys in the KEK tree is used to encrypt and distribute the ownership group key. Therefore, the proposed scheme is secure against a collusion attack of the unauthorized users.

VIII. CONCoUSION
Dynamic ownership management is an important and challenging issue in secure reduplication over This may happen when the unauthorized users have possessed the data at some time instance and stored the derived key encryp-tion key K until the moment of request; or, they could receive it from the other colluders. encrypted data in cloud storage. In this study, we proposed a novel secure data reduplication scheme to enhance a fine-grained ownership management by exploiting the characteristic of the cloud data man-agement system. The proposed scheme features a re-encryption technique that enables dynamic updates upon any ownership changes in the cloud storage. Whenever an ownership change occurs in the own-ership group of outsourced data, the data are re-encrypted with an immediately updated ownership group key, which is securely delivered only to the valid owners. Thus, the proposed scheme enhances data privacy and confidentiality in cloud storage against any users who do not have valid ownership of the data, as well as against an honest-but-curious cloud server. Tag consistency is also guaranteed, while the scheme allows full advantage to be taken of effi-cient data reduplication over encrypted data. In terms of the communication cost, the proposed scheme is more efficient than the previous schemes, while in terms of the computation cost, taking additional 0:1 0:2 ms compared to the RCE scheme, which is negligible in practice. Therefore, the proposed scheme achieves more secure and fine-grained ownership management in cloud storage for secure and efficient data reduplication.