Files in this item

FilesDescriptionFormat

application/pdf

application/pdfSP21-ECE499-Thesis-Jin, Xin.pdf (872kB)Restricted to U of Illinois
(no description provided)PDF

Description

Title:Speech interruption detection for live-streaming audio
Author(s):Xin, Jin
Contributor(s):Patel, Sanjay
Degree:B.S. (bachelor's)
Genre:Thesis
Subject(s):speech interruption detection
live streaming
support vector machine
k-nearest neighbor
multilayer perceptron
mean opinion score
Abstract:Conversation is an important human activity. It happens between multiple persons when they start and end talking naturally. However, an interruption may occur when one speaker speaks over another speaker either intentionally or unintentionally. Frequent interruptions during conversation can significantly influence the experience and vastly decrease the efficiency of the conversation. Interruptions can happen more frequently in live-streamed audio calls with significant internet delays. Detection of interruption during conversation can be helpful for live-streaming companies who care about their quality of service. It can also be used for speech-to-text models for audio preprocessing and labeling and estimating conflict level in a debate. This project aims to assess the quality of the interrupted speech in live-streaming audios. The task of interruption detection was divided into two steps: generation of the simulated interrupted speech audio dataset and building machine learning models for interruption detection. The audio dataset was synthetically created by concatenating and overlapping speech audios with different interruption times and latency times. The performance of interruption detection was examined on the k-nearest neighbor classifier, the support vector machine classifier, and the multilayer perceptron model. Each model takes an array of the 0.5s audio segment as input and then predicts the existence of interrupted speech in each 0.5s segment. The result has shown that the SVM model appears to be very effective at detecting interrupted speeches in the audio of a conversation. It has an accuracy of 92.61% on cross-validation of training data and 72.62% on unseen data.
Issue Date:2021-05
Genre:Dissertation / Thesis
Type:Text
Language:English
URI:http://hdl.handle.net/2142/110308
Date Available in IDEALS:2021-08-11


This item appears in the following Collection(s)

Item Statistics