A Cross-Modal Densely Guided Knowledge Distillation for Emotion Recognition

A Cross-Modal Densely Guided Knowledge Distillation Based on Modality Rebalancing Strategy for Enhanced Unimodal Emotion Recognition

Shuang Wu¹, Heng Liang², Yong Zhang³^*, Yanlin Chen⁴, Ziyu Jia⁵^*

¹South China University of Technology, ²The University of Hong Kong, ³Huzhou University, ⁴New York University, ⁵Institute of Automation, Chinese Academy of Sciences

IJCAI 2025
^*Corresponding Author

Abstract

Multimodal emotion recognition has garnered significant attention for its ability to integrate data from multiple modalities to enhance performance. However, physiological signals like electroencephalogram are more challenging to acquire than visual data due to higher collection costs and complexity. This limits the practical application of multimodal networks. To address this issue, this paper proposes a cross-modal knowledge distillation framework for emotion recognition. The framework aims to leverage the strengths of a multimodal teacher network to enhance the performance of a unimodal student network using only the visual modality as input. Specifically, we design a prototype-based modality rebalancing strategy, which dynamically adjusts the convergence rates of different modalities to mitigate the modality imbalance issue. It enables the teacher network to better integrate multimodal information. Building upon this, we develop a Cross-Modal Densely Guided Knowledge Distillation (CDGKD) method, which effectively transfers knowledge extracted by the multimodal teacher network to the unimodal student network. Our CDGKD uses multi-level teacher assistant networks to bridge the teacher-student gap and employs dense guidance to reduce error accumulation during knowledge transfer. Experimental results demonstrate that the proposed framework outperforms existing methods on two public emotion datasets, providing an effective solution for emotion recognition in modality-constrained scenarios.

Methodology

Diagram of the Prototype-Based Modality Rebalancing Strategy

Figure 1: The training process of our multimodal teacher network.

1. Prototype-Based Modality Rebalancing Strategy

Motivation: When fusing data from different sources like video and EEG, networks often face "modality imbalance," where one modality dominates and suppresses the other. This limits the overall performance of the fusion network.

Our Approach: To solve this, we introduce a modality rebalancing strategy using prototype loss. A "prototype" serves as a central feature vector for each emotion class. By measuring how closely each modality's features cluster around these prototypes, we can estimate its convergence rate. Our method then dynamically adjusts the loss weights to boost the slower-learning modality, ensuring it contributes effectively. This allows the teacher network to learn a more balanced and powerful multimodal representation for later distillation.

2. Cross-Modal Densely Guided Knowledge Distillation (CDGKD)

Motivation: Transferring knowledge from a complex multimodal "teacher" to a simple unimodal "student" is difficult due to the large structural gap between them. Furthermore, traditional step-by-step distillation can suffer from "error accumulation," where mistakes from one stage are amplified in the next.

Our Approach: We propose CDGKD, which uses multiple Teacher Assistant (TA) networks to bridge this gap. The core innovation is dense guidance: the student learns not just from the immediately preceding TA, but from a combined set of all higher-level networks, including the original teacher. We also employ a stochastic learning strategy, which randomly selects knowledge sources during training. This prevents the network from relying on a single knowledge path, reduces the risk of overfitting, and effectively mitigates error accumulation.

Overview of the Cross-Modal Densely Guided Knowledge Distillation (CDGKD) framework

Figure 2: The CDGKD framework.

Experimental Results

Student Network Performance (CDGKD)

Table 1: Comparison of student network performance across knowledge distillation baselines.

Teacher Network Performance (Prototype Loss)

Table 2: Comparison of the teacher network performance with baselines.

BibTeX

@inproceedings{wu_cross-modal_2025, title = {A Cross-Modal Densely Guided Knowledge Distillation Based on Modality Rebalancing Strategy for Enhanced Unimodal Emotion Recognition}, url = {https://doi.org/10.24963/ijcai.2025/472}, doi = {10.24963/ijcai.2025/472}, pages = {4236--4244}, booktitle = {Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, {IJCAI}-25}, publisher = {International Joint Conferences on Artificial Intelligence Organization}, author = {Wu, Shuang and Liang, Heng and Zhang, Yong and Chen, Yanlin and Jia, Ziyu}, editor = {Kwok, James}, date = {2025-08}, }