Files in this item



application/pdfgetafix.pdf (971kB)


Title:Getafix: Workload-aware distributed interactive analytics
Author(s):Ghosh, Mainak; Xu, Le; Qian, Xiaoyao; Kao, Thomas; Gupta, Indranil; Gupta, Himanshu
Subject(s):Data management
Workload aware
Lookback processing
Abstract:Distributed interactive analytics engines (Druid, Redshift, Pinot) need to achieve low query latency while using the least storage space. This paper presents a solution to the problem of replication of data blocks and routing of queries. Our techniques decide the replication level of individual data blocks (based on popularity, access counts), as well as output optimal placement patterns for such data blocks. For the static version of the problem (given set of queries accessing some segments), our techniques are provably optimal in both storage and query latency. For the dynamic version of the problem, we build a system called Getafix that dynamically tracks data block popularity, adjusts replication levels, dynamically routes queries, and garbage collects less useful data blocks. We implemented Getafix into Druid, the most popular open-source interactive analytics engine. Our experiments use both synthetic traces and production traces from Yahoo! Inc.’s production Druid cluster. Compared to existing techniques Getafix either improves storage space used by up to 3.5x while achieving comparable query latency, or improves query latency by up to 60% while using comparable storage.
Issue Date:2016-03-08
Genre:Technical Report
Date Available in IDEALS:2016-03-08

This item appears in the following Collection(s)

Item Statistics