Files in this item



application/pdfHASHEMI-DISSERTATION-2020.pdf (4MB)
(no description provided)PDF


Title:Timed execution in distributed machine learning
Author(s):Hashemi, Hadi
Director of Research:Campbell, Roy H
Doctoral Committee Chair(s):Campbell, Roy H
Doctoral Committee Member(s):Kindratenko, Volodymyr; Gropp, William D; Godfrey, Philip B; Diamos, Gregory F
Department / Program:Computer Science
Discipline:Computer Science
Degree Granting Institution:University of Illinois at Urbana-Champaign
Subject(s):Machine Learning Systems, Distributed Systems, Deep Learning, Deep Neural Networks, TensorFlow, Scheduling
Abstract:Deep learning powers many transformative core technologies including Autonomous Driving, Natural Language Translation, and Automatic Medical Diagnosis. Its exceptional ability to extract intricate structures from high-dimensional data takes the credit for major advances in machine learning. Essential ingredients that make Deep Learning possible include: the availability of a massive curated data, a well-designed model, and readily available high-performance computation. The computation used in training deep neural networks has doubled every 3.4 months since 2012, five times faster than Moore's law. Fulfilling this massive computational demand that has long outgrown the capability of a single high-end node is vital to keep extending the flow of innovations. For example, in 2018, the AlphaGoZero model trained with 1.89 ExaFlops/s times a day. The state-of-the-art GPU at the time, NVidia V-100, could only deliver 125 TeraFlops. In a meanwhile, Summit, the fastest supercomputer in the world, could sustain 1 ExaFlops/s using 27,360 NVidia V-100 GPUs through distributed computation. This dissertation studies the challenges of scaling out an ML job to keep up with the computational demand, specifically the problems stemmed from the complex interplay of various resources in the distributed ML. In the first stage of this research, we developed methods and systems to properly observe a distributed ML environment by tracing a distributed execution, visualizing the results, and expanding the observability to the production infrastructure. Later we developed three systems to address scalability challenges using these methods and systems based on a precise execution timing of the spectrum of resources: Network: TicTac reduces internal computation delays by enforcing a near-optimal order on network transfers, which results in up to 37.7% throughput increases. Computation: Caramel increases the network utilization and decreases network congestion by modifying the order of computation and choosing the most fitted collective primitive for the workload. This result in cutting the training time up by a factor of up to $3.8$. While computation and network scheduling suggest an order of execution, TimedRPC addresses the issue of correctly enforcing this order by implementing a priority-based scheduling through pre-emption where an on-going transfer can be paused when a transfer with higher priority is requested. I/O: Diot maximizes the I/O throughput by tuning knobs such as number of concurrent I/O requests and read size on I/O pipeline. Additionally it detects the I/O delivery unfairness which may causes a struggling worker due to slow I/O through. Thesis Statement: Heuristic timing of distributed machine learning execution leads to utilization optimizations for computation, network, and storage, which in turn improves the overall throughput. In general, any multi-resource optimization involving parallelism is at least NP-Hard.
Issue Date:2020-05-08
Rights Information:Copyright 2020 Sayed Hadi Hashemi
Date Available in IDEALS:2020-08-26
Date Deposited:2020-05

This item appears in the following Collection(s)

Item Statistics