|Abstract:||Effective management of moving object data, originating in supply chain operations, road network monitoring, and other RFID applications, is a major challenge facing society today, with important implications into business optimization, city planning, privacy, and national security. Towards the solution of this problem, I have developed a comprehensive framework for warehousing, mining, and cleaning large moving object data sets.
The proposed framework addresses the following key challenges present in object tracking applications: (1) Datasets are massive, a single large retailer may generate terabytes of moving object data per day. (2) Data is usually dirty, many tags are not detected at all, or are incorrectly detected at the wrong location. (3) Dimensionality is very large, there are spatio-temporal dimensions defined by object trajectories, sensor related dimensions such as temperature or humidity recorded at different locations, and item level dimensions describing the attributes of each object. (4) Data analysis and mining need to navigate and discover interesting patterns, at different levels of abstraction, and involving a large number of interrelated records in multiple datasets.
At the core of my dissertation, is the RFID data warehousing engine. It receives clean data from the cleaning engine, and provides highly compressed data, at multiple levels of abstraction, to the mining engine. The mining engine is composed of three modules. The first, mines commodity flow patterns that identify general flow trends and significant flow exceptions in a large supply chain operation. The second, makes route recommendations, based on observed driving behavior and traffic conditions. And the third, discovers and characterizes a wide variety of traffic anomalies on a road network.
RFID Data Warehousing. A data warehouse is an enterprise level data repository that collects and integrates organizational data in order to provide decision support analysis. At the core of the data warehouse is the data cube, which computes an aggregate measure (e.g., sum, avg, count) for all possible combination of dimensions of a fact table (e.g., sales for 2004, in the northeast). Online analytical processing (OLAP) operations provide the means for exploration and analysis of the data cube. My research on this direction has extended the data cube to handle moving object data sets, by significantly compressing such data, and proposing a new aggregation mechanism that preserves its path structure. The RFID warehouse is built around the concept of the movement graph, which records both spatio-temporal and item level information in a compact model. We show that compression and query processing efficiency can be significantly improved, by partitioning the movement graph around gateway nodes, which are special locations connecting different spatial regions in the graph.
RFID Data Cleaning. Efficient and accurate data cleaning is an essential task for the successful deployment of applications, such as object tracking and inventory management systems, based on RFID technology. Most existing data cleaning approaches do not consider the overall cost of cleaning in an environment that possibly includes thousands of readers and millions of tags. We propose a cleaning framework that takes an RFID data set and a collection of cleaning methods, with associated costs, and induces a cleaning plan that optimizes the overall accuracy-adjusted cleaning costs. The cleaning plan determines the conditions under which inexpensive cleaning methods can be safely applied, the conditions under which more expensive methods are absolutely necessary, and those cases when a combination of several methods is the optimal policy. Through a variety of experiments we show that our framework can achieve better accuracy at a fraction of the cost than that obtained by applying any single technique.
Mining Flow Trends. An important application of moving objects is mining movement patterns of objects in supply chain operations. In this context, one may ask questions regarding correlations between time spent at quality control locations and laptop return rates, salient characteristics of dairy products discarded from stores, or ships that spent abnormally long at intermediate ports before arrival. The gigantic size of such data, and the diversity of queries over flow patterns pose great challenges to traditional workflow induction and analysis technologies since processing may involve retrieval and reasoning over a large number of inter-related tuples through different stages of object movements. Creating a complete workflow that records all possible commodity movements and that incorporates time will be prohibitively expensive since there can be billions of different location and time combinations. I propose the FlowGraph, as a compressed probabilistic workflow, that captures the general flow trends and significant exceptions of a data set. The FlowGraph achieves compression by recording the set of major flow trends, and the set of non-redundant flow exceptions (i.e., abnormal transitions or durations) present in the data. I extended the concept of the FlowGraph to incorporate multiple levels of abstraction of object and path characteristics, and defined the FlowCube, which is a data cube that records FlowGraphs as measures, and that allows OLAP reasoning on object flows.
Mining Route Recommendations. Modern highway networks provide several mechanisms for automatic vehicle identification. The most common are the use of toll collection transponders to detect vehicles at multiple points in the network, and the use of cameras to automatically identify license plates. Such information provides valuable patterns useful to online navigation systems and route planning applications. Most existing route planning applications use a fastest path algorithm based on static or dynamic models of road speeds, but such models in general disregard observed driver behavior, and other important factors such as weather, car-pool availability, or vehicle type. Existing solutions may, for example, provide a route that is the fastest one, but that goes through a high crime area, and is thus avoided by experienced drivers. We propose a traffic-mining-based path-finding method that mines speed and driving models from historic traffic data, and uses them to compute fast routes that are well supported by historic driving behavior under the set of relevant driving and traffic conditions.
Mining Traffic Anomalies. Identification and characterization of traffic anomalies on massive road networks is a vital component of traffic monitoring. Anomaly identification can be used to reduce congestion, increase safety, and provide transportation engineers with better information for traffic forecasting and road network design. However, due to the size, complexity and dynamics of such transportation networks, it is challenging to automate the process. We propose a multi-dimensional mining framework that can be used to identify a concise set of anomalies from massive traffic monitoring data, and further overlay, contrast, and explore such anomalies in multi-dimensional space.