Datasets#
Overview of the currently available datasets in MAMKit.
Datasets |
Tasks |
Modalities |
Description |
---|---|---|---|
UKDebates |
ASD |
Text, Audio |
The first MAM dataset. It contains transcriptions and audio sequences of three candidates for UK Prime Ministerial elections of 2015 in a two-hour debate aired by Sky News. The candidates are David Cameron, Nick Clegg, and Ed Miliband. The dataset contains 386 sentences and corresponding audio samples. |
MArg γ |
ARC |
Text, Audio |
A multimodal dataset built around the 2020 US Presidential elections. The dataset contains transcriptions and audio sequences of four candidates and a debate moderator concerning 18 topics. The authors design a controlled crowdsourcing data annotation process whereby each crowd worker labels sentence pairs as describing support, attack, or no relation. In total, the dataset contains 4,104 sentence pairs with corresponding aligned audio samples. A high-quality subset of the M-Arg, M-Arg γ , containing 2,443 sentence pairs with high agreement confidence γ ≥ 85% is commonly considered for model evaluation. |
MM-USED |
ASD, AFC |
Text, Audio |
It contains presidential candidates’ debate transcripts and corresponding audio recordings aired from 1960 to 2016. This dataset consists of 26,781 labeled sentences and corresponding audio samples covering 39 debates and 26 different speakers. |
MM-USED-fallacy |
AFC |
Text, Audio |
The dataset contains 1,891 sentences labeled as argumentative fallacies belonging to six distinct categories. Sentences are taken from presidential candidates’ debates aired from 1960 to 2016. |