|Abstract:||The world sees a proliferation of deep learning (DL) models and their wide adoption in different application domains. This has made the performance benchmarking, understanding, and optimization of DL inference an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible computing system to serve DL models with the desired latency, throughput, and energy requirements while maximizing resource utilization. However, DL faces the following challenges in performance engineering.
Benchmarking — While there have been significant efforts to develop benchmark suites that evaluate widely used DL models, developing, maintaining, and running benchmarks takes a non-trivial amount of effort, and DL benchmarking has been hampered in part due to the lack of representative and up-to-date benchmarking suites.
Performance Understanding — Understanding the performance of DL workloads is challenging as their characteristics depend on the interplay between the models, frameworks, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack. This largely limits the types of analysis that can be performed on model execution.
Optimization Advising — The current DL optimization process is manual and ad-hoc that requires a lot of effort and expertise. Existing tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow DL characterization/optimization cycles that cannot keep up with the fast pace at which new DL innovations are introduced.
Evaluation and Comparison — The current DL landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks a DL benchmarking platform to facilitate evaluation and comparison of DL innovations, be it models, frameworks, libraries, or hardware. Due to the lack of a benchmarking platform, the current practice of evaluating the benefits of proposed DL innovations is both arduous and error-prone — stifling the adoption of the innovations.
This thesis addresses the above challenges in DL performance engineering. First we introduce DLBricks, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks. DLBricks decomposes DL models into a set of unique runnable networks and constructs the original model’s performance using the performance of the generated benchmarks. Then, we present XSP, an across-stack profiling design that correlates profiles from different sources to obtain a holistic and hierarchical view of DL model execution. XSP innovatively leverages distributed tracing and accurately capture the profiles at each level of the HW/SW stack in spite of profiling overhead. Next, we propose Benanza, a systematic DL benchmarking and analysis design that guides researchers to potential optimization opportunities and assesses hypothetical execution scenarios on GPUs. Finally, we design MLModelScope, a consistent, reproducible, and scalable DL benchmarking platform to facilitate evaluation and comparison of DL innovations. This thesis also briefly discusses TrIMS, TOPS, and CommScope which are developed based on the needs observed from the performance benchmarking and optimization work to solve relevant problems in the DL domain.