A Cross-Modal Densely Guided Knowledge Distillation Based on Modality Rebalancing Strategy for Enhanced Unimodal Emotion Recognition

Shuang Wu1, Heng Liang2, Yong Zhang3*, Yanlin Chen4, Ziyu Jia5*
1South China University of Technology, 2The University of Hong Kong, 3Huzhou University, 4New York University, 5Institute of Automation, Chinese Academy of Sciences
IJCAI 2025
*Corresponding Author
Overall Framework

Our proposed framework enhances unimodal (Visual) emotion recognition by distilling knowledge from a multimodal (Visual+EEG) teacher network. We introduce a modality rebalancing strategy to improve the teacher network and a densely guided distillation method to effectively transfer knowledge while minimizing error accumulation.

Abstract

Multimodal emotion recognition has garnered significant attention for its ability to integrate data from multiple modalities to enhance performance. However, physiological signals like electroencephalogram are more challenging to acquire than visual data due to higher collection costs and complexity. This limits the practical application of multimodal networks. To address this issue, this paper proposes a cross-modal knowledge distillation framework for emotion recognition. The framework aims to leverage the strengths of a multimodal teacher network to enhance the performance of a unimodal student network using only the visual modality as input. Specifically, we design a prototype-based modality rebalancing strategy, which dynamically adjusts the convergence rates of different modalities to mitigate the modality imbalance issue. It enables the teacher network to better integrate multimodal information. Building upon this, we develop a Cross-Modal Densely Guided Knowledge Distillation (CDGKD) method, which effectively transfers knowledge extracted by the multimodal teacher network to the unimodal student network. Our CDGKD uses multi-level teacher assistant networks to bridge the teacher-student gap and employs dense guidance to reduce error accumulation during knowledge transfer. Experimental results demonstrate that the proposed framework outperforms existing methods on two public emotion datasets, providing an effective solution for emotion recognition in modality-constrained scenarios.

Methodology

Diagram of the Prototype-Based Modality Rebalancing Strategy

Figure 1: The training process of our multimodal teacher network.

1. Prototype-Based Modality Rebalancing Strategy

Motivation: When fusing data from different sources like video and EEG, networks often face "modality imbalance," where one modality dominates and suppresses the other. This limits the overall performance of the fusion network.

Our Approach: To solve this, we introduce a modality rebalancing strategy using prototype loss. A "prototype" serves as a central feature vector for each emotion class. By measuring how closely each modality's features cluster around these prototypes, we can estimate its convergence rate. Our method then dynamically adjusts the loss weights to boost the slower-learning modality, ensuring it contributes effectively. This allows the teacher network to learn a more balanced and powerful multimodal representation for later distillation.

2. Cross-Modal Densely Guided Knowledge Distillation (CDGKD)

Motivation: Transferring knowledge from a complex multimodal "teacher" to a simple unimodal "student" is difficult due to the large structural gap between them. Furthermore, traditional step-by-step distillation can suffer from "error accumulation," where mistakes from one stage are amplified in the next.

Our Approach: We propose CDGKD, which uses multiple Teacher Assistant (TA) networks to bridge this gap. The core innovation is dense guidance: the student learns not just from the immediately preceding TA, but from a combined set of all higher-level networks, including the original teacher. We also employ a stochastic learning strategy, which randomly selects knowledge sources during training. This prevents the network from relying on a single knowledge path, reduces the risk of overfitting, and effectively mitigates error accumulation.

Overview of the Cross-Modal Densely Guided Knowledge Distillation (CDGKD) framework

Figure 2: The CDGKD framework.

Experimental Results

Student Network Performance (CDGKD)

Table 1: Comparison of student network performance across knowledge distillation baselines

Table 1: Comparison of student network performance across knowledge distillation baselines.

Teacher Network Performance (Prototype Loss)

Table 2: Comparison of the teacher network performance with baselines

Table 2: Comparison of the teacher network performance with baselines.

Video Presentation

Poster

BibTeX


@inproceedings{wu_cross-modal_2025,
	title = {A Cross-Modal Densely Guided Knowledge Distillation Based on Modality Rebalancing Strategy for Enhanced Unimodal Emotion Recognition},
	url = {https://doi.org/10.24963/ijcai.2025/472},
	doi = {10.24963/ijcai.2025/472},
	pages = {4236--4244},
	booktitle = {Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, {IJCAI}-25},
	publisher = {International Joint Conferences on Artificial Intelligence Organization},
	author = {Wu, Shuang and Liang, Heng and Zhang, Yong and Chen, Yanlin and Jia, Ziyu},
	editor = {Kwok, James},
	date = {2025-08},
}