Automatic Software Performance Optimization on Modern Architectures
- Automatic Software Performance Optimization on Modern Architectures
- Jiang, Changhao
- Issue Date
- computer architecture
- As computer architectures become more complex, the task of writing efficient program to best utilize the underlying architecture's power increasingly becomes an extremely difficult and expensive process. Traditional approach of expert manual tuning of software performance becomes infeasible as both software and hardware complexity grow. To make things even worse, the relative cost of man labor compared with that of machine computation increases rapidly. One approach to attacking the problem is automatic library generation via empirical evaluation. The essential idea is to have a meta-program automatically generate other high performance program via empirical evaluation and intelligent search. The methodology has been successfully applied in several application domains, such as numerical computing, signal processing, sorting, etc. This dissertation extends the automatic library generation methodology to emerging untraditional computer architectures and to a more complex application domain. Specifically, it consists of two parts of work: First, it studies and implements an automatic matrix multiply library generator for graphics hardware -- a specialized architecture with enormous computing power for graphics applications; Second, it uses machine learning techniques to automatically select the best algorithm for frequent pattern mining problems according to input characteristics. In order to utilize the tremendous computing power of graphics hardware and to automatically adapt to the fast and frequent changes in its architecture and performance characteristics, this dissertation implements an automatic tuning system to generate high-performance matrix-multiplication implementation on graphics hardware. The automatic tuning system uses a parameterized code generator to generate multiple versions of matrix multiplication, whose performances are empirically evaluated by actual execution on the target platform. An ad-hoc search engine is employed to search over the implementation space for the version that yields the best performance. In contrast to similar systems on CPUs, which utilize cache blocking, register tiling, instruction scheduling tuning strategies, it identifies and exploits several tuning strategies that are unique for graphics hardware. These tuning strategies include optimizing for multiple-render-targets, SIMD instructions with data packing, overcoming limitations on instruction count and dynamic branch instruction. The generated implementations have comparable performance with expert manually tuned version in spite of the significant overhead incurred due to the use of the high-level BrookGPU language. Frequent pattern mining is a fundamental problem in data mining and a large number of distinct algorithms have been proposed to solve it efficiently. However, no single algorithm outperforms all the others since their relative performance highly depends on the characteristics of the input data. In the dissertation, we present a machine learning based approach to select the best frequent pattern mining algorithm based on the input characteristics. Three of the fastest publicly available algorithms, FP\_Growth, LCM and Eclat, were extensively evaluated using synthetic data sets. The results of these evaluations were used to train a support-vector machine (SVM) prediction system, which is then used at runtime to predict the best mining algorithm for real-world data sets. Our experiments show that the runtime prediction overhead is negligible and that the trained SVM prediction system usually identifies the best algorithm. In case of misprediction, the selected algorithm is still competitive in performance.
- Type of Resource
- Copyright and License Information
- You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).
Edit Collection Membership