Withdraw
Loading…
Enhancing mid-level fusion with attention-based dual labeling for multimodal emotion recognition
Vasudeva, Sachit
This item's files can only be accessed by the System Administrators group.
Permalink
https://hdl.handle.net/2142/129791
Description
- Title
- Enhancing mid-level fusion with attention-based dual labeling for multimodal emotion recognition
- Author(s)
- Vasudeva, Sachit
- Issue Date
- 2025-05-09
- Director of Research (if dissertation) or Advisor (if thesis)
- Kim, Inki
- Department of Study
- Industrial&Enterprise Sys Eng
- Discipline
- Industrial Engineering
- Degree Granting Institution
- University of Illinois Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Affective Computing, Cross-Model Attention, Emotion Recognition
- Abstract
- Emotion recognition plays a pivotal role in human-computer interaction by enabling systems to interpret and adapt to users’ affective states. Traditional models typically rely on discrete categorical labels, which oversimplify the ambiguity and subjectivity inherent in human emotional expressions. This rigid labeling constrains model generalizability, especially in applications such as healthcare, social robotics, and virtual assistants, where nuanced understanding is essential. To address these limitations, this study introduces a probabilistic multimodal emotion recognition framework that combines attention-based mid-level fusion with dual-label learning. The model integrates audio embeddings from wav2vec2.0 and facial features from ResNet50-Face using a cross-modal attention mechanism that dynamically reweights modality contributions, enabling richer cross-modal interactions compared to early or late fusion strategies. Crucially, the framework incorporates a dual-label learning paradigm to jointly model self-reported (actor-intended) and observer-perceived emotions using soft probabilistic labels. A temperature-scaled softmax formulation captures uncertainty in emotion perception and improves interpretability by modeling distributions over emotion categories instead of committing to hard labels. Evaluation on the RAVDESS dataset, augmented through techniques such as pitch shifting, noise injection, and occlusion simulation, demonstrates the framework’s effectiveness. The model achieves 80.1% accuracy and 0.79 macro-F1 score, outperforming unimodal and simple fusion baselines. Additionally, it achieves a substantially lower KL divergence (0.51) and improved Expected Calibration Error (4.7%), indicating better alignment with human-perceived emotion distributions and improved prediction confidence calibration. These results highlight the promise of probabilistic multimodal learning for affective computing. By explicitly modeling uncertainty and leveraging both actor and observer perspectives, the proposed framework enables more robust, interpretable, and empathetic emotion-aware AI—paving the way for adaptive, emotionally intelligent systems in high-stakes domains.
- Graduation Semester
- 2025-05
- Type of Resource
- Thesis
- Handle URL
- https://hdl.handle.net/2142/129791
- Copyright and License Information
- Copyright 2025 Sachit Vasudeva
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…