Files in this item



application/pdfOptimization Problems in Data Mining.pdf (928kB)
(no description provided)PDF


Title:Optimization Problems in Data Mining
Author(s):Heeren, Cinda
data mining
Abstract:One natural, yet unusual, source of data is the set of queries that are performed on a database. We consider such queries to be reflective of data access patterns and we use them to create indices on the data that are likely to be useful in minimizing the cost of answering future queries. We formalize the problem of finding these optimal indices under a constraint on the total amount of space available for storing them, we give strong negative and positive performance bounds, and we quantify the error in performance introduced by running the algorithm on a sample drawn from an unknown query distribution. We investigate the problem of finding optimized support association rules for a single numerical attribute, where the optimized region is a union of k disjoint intervals from the range of the attribute. We give the first polynomial time algorithm for the problem of finding such a region maximizing support and meeting a cumulative confidence threshold. Experiments demonstrate that the best algorithm for a more constrained version of the problem has performance degradation on both synthetic and real world data. We prove theoretical bounds on sufficient sample size to achieve a given performance level, and we validate convergence on synthetic and real-world data experimentally. We propose a natural greedy algorithm, and analyze its performance. We introduce a novel type of rule, wherein claims of the form ``our object ranked r or better in x of the last t time units,'' are formalized, and where maximal claims of this form are defined under two natural partial orders. For the first, we give an efficient and optimal algorithm for finding all such claims. For the second, we give an algorithm whose running time is significantly more efficient than that of a na\"ive one. Finally, we connect this boasting problem to that of finding a sequence of optimized confidence association rules, and give an efficient algorithm for solving a simplification of the problem.
Issue Date:2004-05
Genre:Technical Report
Other Identifier(s):UIUCDCS-R-2004-2401
Rights Information:You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Date Available in IDEALS:2009-04-17

This item appears in the following Collection(s)

Item Statistics