Models#

Currently available models can manage text and audio modalities. For each model, we summarize its text and audio encoding modules and its fusion strategy. Fusions strategies are divided into three categories:

Concat: Concatenation of the text and audio embeddings.
Avg: Average of the text and audio embeddings.
Cross: Crossmodal Attention between the text and audio embeddings.

**Models Information**#
Model	Text Encoding	Audio Encoding	Fusion
BiLSTM	GloVe + BiLSTM	(Wav2Vec2 or MFCCs) + BiLSTM	Concat-Late
MM-BERT	BERT	(Wav2Vec2 or HuBERT or WavLM) + BiLSTM	Concat-Late
MM-RoBERTa	RoBERTa	(Wav2Vec2 or HuBERT or WavLM) + BiLSTM	Concat-Late
CSA	BERT	(Wav2Vec2 or HuBERT or WavLM) + Transformer	Concat-Early
Ensemble	BERT	(Wav2Vec2 or HuBERT or WavLM) + Transformer	Avg-Late
Mul-TA	BERT	(Wav2Vec2 or HuBERT or WavLM) + Transformer	Cross