|Abstract:||Data mining is an emerging research area, whose goal is to discover potentially useful information embedded in databases. Due to the wide availability of huge amounts of data and the imminent need for turning such data into useful knowledge, data mining has attracted a great deal of attention in recent years.
Frequent pattern mining has been a focused topic in data mining research. The goal of frequent pattern mining is to discover the patterns whose numbers of occurrence are above a predefined threshold in the datasets. Depending on the different definition of pattern, frequent pattern mining stands for various mining problems, such as frequent itemset mining, sequential pattern mining and so on. Frequent pattern mining has numerous applications, such as the analysis of customer purchase patterns, web access patterns, natural disasters or alarm sequences, disease treatments and DNA sequences. Many algorithms have been presented for mining frequent patterns since the introduction of the problem. Most of them are sequential algorithms and their execution time is constrained by the computing power, I/O speed and/or memory space of the uni-processor system where their implementations execute. The irregular nature of the datasets and the requirement to reach global decision make it challenging to efficiently mining frequent patterns in parallel.
In this dissertation, we propose a framework for parallel mining frequent patterns. Our parallelizing framework targets a distributed memory system. It was applied to the parallelization of frequent itemset mining, sequential pattern mining and closed-sequential pattern mining. Our parallel algorithms are based on the most efficient serial algorithms. We devised a partitioning strategy to distribute the work to avoid the unnecessary inter-processor communication. Our task partition is based on the divide-and-conquer property of frequent pattern mining problem so that the processors can work asynchronously and independently during mining. We discuss the load balancing problem inherent in the divide-and-conquer property and then present a sampling technique, called selective sampling, to address the load balance problem. Selective sampling strategy allows us to predict the relative time required to mine the projections and in this way enable us to identify large tasks, decompose them, and evenly distributed them across processors to achieve load balancing.
We implemented parallel algorithms for mining frequent itemsets, sequential patterns and closed-sequential patterns following our framework. A comprehensive performance study has been conducted in our experiments on both synthetic and real-world datasets. The experimental results have shown that our parallel algorithms have achieved good speedups on various datasets and the speedups are scalable up to 64 processors on our 64-processor system.