Chapter 2Literature Survey and Problem DefinitionThis chapter provides the related work done so far in this section giving some preliminary definitions that will be used during this dissertation and focusing on conceptual aspects of distributed association rule mining systems. Section 2.1 gives a description of existing techniques and their methodology for distributed data mining. Section 2.2 describes the problem definition and objectives for secure distributed data mining system.2.1 Literature SurveyHere surveys of various existing approaches for secure distributed data mining are observed. Data mining technology has tremendous applications in the field of identifying patterns among tremendous quantities of data.
Data mining techniques and data ware- housing are used in almost every application of database, most of the data mining tools operate by collecting all the data into a centralize site, then applying data mining algorithm on that data. However, privacy concerns can prevent building a centralized warehouse, In case of distributed system data may be distributed among several custodians, none of which are allowed to transfer their data to another site.
Here homogeneous databases are assumed. All sites have the same schema, but each site has distributed information. The goal is to produce association rules that hold globally, while limiting the information shared from each site.Previous work in data mining has considered two settings. In first setting the data owner and the miner are two different entities, and second, which consist of data distribution among several sites or players, these players or sites jointly performs mining on the data held by those sites or players. Kantarcioglu and Clifton [8] have proposed the protocol for securely computing union of each private subsets held by the different sites. The private subset of a given site includes the itemsets which are s-frequent in his own database. This implementation of is costly and its implementation depends upon cryptographic techniques such as commutative encryption, oblivious transfer.Yao [9] proposed the protocol for securely computing the union of private sub- sets at each site. The authors proposed a multi-party computation, which is the costly part of the system and in its implementation cryptographic techniques like encryption, decryption, commutative encryption, and hash functions are used. The use of such cryptographic techniques improves communication cost and computation cost. In the existing systems discussed so far these techniques causes some leakage of information. Thus there is need for preventing information leakage. The proposed system overcomes this problem of information leakage.In the existing systems [1], [5] the protocol for securely computing the union of private subsets at each site in the transaction is suggested. Here a multi party computation is considered and in it’s implementation cryptographic techniques like encryption, decryption, commutative encryption, and hash functions are used. In these systems it is hard to mine association rules through security assumptions in addition it reveals the data during the mining process. It is not possible to mine globally valid results from distributed data without revealing private information. Secure distributed association rule mining is costly in terms of computational cost and communication.In UNIFY-KC algorithm the fake itemset is added and then removed from item- sets [1] . It adds overhead in computation, where as this overhead is reduced in AES algorithm [6-8]. In this paper for experimentation the data has been partitioned horizon- tally so that it can be distributed on different sites. Data partitioning techniques are suitable for dealing with the problems in handling large data set. Round robin partitioning, range partitioning and hash partitioning are some of available horizontal data partitioning techniques [14]. Round robin is the partitioning strategy that partitions data set with balanced class distribution. Using RR data skew is also reduced.Kantarcioglu and Clifton [6] implemented the sub-protocol for the secure computation of the combination or union of private subsets which are held by the different players or sites. (The private subset of a given player or site, includes the itemsets that are s-frequent in his partial database.)The use of such cryptographic techniques improves communication cost and computation cost. In the existing system these cryptographic techniques are used it causes some leakage of information, therefore it is not perfectly secure. Thus the union of private subsets is not perfectly calculated, so the system is proposed to overcome with this problem.In the existing system the protocol for securely computing the union of private subsets at each site in the transaction is implemented. Thus knowledge discovery on this high volume data becomes slow. In these systems it is hard to mine association rules through security assumptions in addition it reveals the data during the mining process. As discussed earlier it is impossible to mine globally valid results from distributed data without disclosing private information. Secure distributed association rule mining is costly in terms of computational cost and communication.As per Alex and Freitas [7] the speedup of data mining system improves form O(n) to O(n/k) through implementation of distributed approach. Also due to use of distributed or parallel association rule mining techniques the speedup can be achieved. As Rakesh Agrawal [8] proposed mining of association rules. 2.2 Problem Definition2.2.1 Problem statement The aim is to design an algorithm to enable handling of large data sets using available computing facilities. To design the algorithms to accelerate a mining process on large data sets. Implementation of secure Distributed data mining algorithm. To design secure multi-party algorithms that computes the union of private subsets that each of the interacting sites hold, and to design algorithm to test the inclusion of an element held by one site in a subset held by another.2.2.2 Objective Implementation of Distributed data mining algorithm for mining on large data set. Acquiring speedup in computation process by utilizing resources available in distributed system. Provide security in distributed computing environment. To design a system to improve simplicity, efficiency and privacy. To design a system that does not depend on commutative encryption and oblivious transfer.2.2.3 Proposed SystemIn the proposed system the problem of secure computation of union of private subsets of sites is addressed. In the proposed system the database is distributed horizontally among various sites in transaction. Round robin technique is used for Horizontal distribution of Data sets to reduce the data skew. 2.2.4 Statement of the Scope Advances in computers and their use in all walks of life has resulted in tremendous growth of data. This data is available from transactions due to computerization of banking, businesses, hospitals and other government organizations. This explosion of data needs new techniques to transfer this data into valuable knowledge. The business need for tools to handle the data flood has risen piercingly in the last years to extract useful data patterns that can be used to assist doctors in better understanding effects of a treatment, analyze a scientific experimentation; improve a business process with data mining to reduce costs, enhance research and increase sales. It also helps businesses in detecting fraud, measuring and improving program performance. The proposed system is concerned with maintaining security while mining of association rules in distributed database, where input is transaction database and output will be set of association rules. Figure 2.1: Proposed system including AES encryption and decryption techniques