Files in this item



application/pdfSo Youn_Lee.pdf (1MB)
(no description provided)PDF


Title:Analysis of the impact of sequencing errors on BLAST using fault injection
Author(s):Lee, So Youn
Advisor(s):Iyer, Ravishankar K.
Department / Program:Electrical & Computer Eng
Discipline:Electrical & Computer Engr
Degree Granting Institution:University of Illinois at Urbana-Champaign
sequencing error
fault injection
sequence alignment
smith-waterman algorithm
Abstract:This thesis investigates the impact of sequencing errors in post-sequence computational analyses, including local alignment search and multiple sequence alignment. While the error rates of sequencing technology are commonly reported, the significance of these numbers cannot be fully grasped without putting them in the perspective of their impact on the downstream analyses that are used for biological research, forensics, diagnosis of diseases, etc. I approached the quantification of the impact using fault injection. Faults were injected in the input sequence data, and the analyses were run. Change in the output of the analyses was interpreted as the impact of faults, or errors. Three commonly used algorithms were used: BLAST, SSEARCH, and ProbCons. The main contributions of this work are the application of fault injection to the reliability analysis in bioinformatics and the quantitative demonstration that a small error rate in the sequence data can alter the output of the analysis in a significant way. BLAST and SSEARCH are both local alignment search tools, but BLAST is a heuristic implementation, while SSEARCH is based on the optimal Smith-Waterman algorithm. The error rates were larger than the corresponding fault rates by one to two orders of magnitude, indicating a small error rate in the sequence can drastically change the analysis output. False negative (FN) error rates were much larger than false positive (FP) rates. FN has negative impact because FP can be controlled by more selective subsequent filtering. SSEARCH overall had a smaller standard deviation in the error rates. A small standard deviation is important in predicting the confidence of the output based on the input quality. As the cost of running optimal algorithms like SSEARCH has decreased with the advances in computing technology, it should be more and more encouraged to use them in order to get accurate results. ProbCons is a multiple sequence alignment algorithm. Errors were measured with the sum-of-pairs (SP) and true column (TC) scores and were defined with respect to BAliBASE, a benchmark for multiple sequence alignment algorithms. The results showed no significant correlation between the fault and error rates. Errors measured with SP scores remained in the same order as the fault rate; errors measured with TC scores tended to be larger, but varied without correlation to the fault rate. Such randomness makes the systematic improvement in multiple sequence alignment difficult, and use of a single objective function to optimize the alignment, while the benchmark is aligned largely with human intervention, may be a counterproductive approach to multiple sequence alignment.
Issue Date:2013-08-22
Rights Information:Copyright 2013 So Youn Lee
Date Available in IDEALS:2013-08-22
Date Deposited:2013-08

This item appears in the following Collection(s)

Item Statistics