MultiMediate Grand Challenge 2021
-
MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation
Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.
Eye Contact Detection Sub-challenge
This sub-challenge focuses on eye contact detection in group interactions from ambient RGB cameras. We define eye contact as a discrete indication of whether a participant is looking at another participants’ face, and if so, who this other participant is. Video and audio recordings over a 10 second context window will be provided as input to provide temporal context for the classification decision. Eye contact has to be detected for the last frame of this context window, making the task formulation also applicable to an online prediction scenario as encountered by artificial mediators.

Next Speaker Prediction Sub-challenge
In the next speaker prediction sub-challenge, approaches need to predict which members of the group will be speaking at a future point in time. Similar to the eye contact detection sub-challenge, video and audio recordings over a 10 second context window will be provided as input. Based on this information, approaches need to predict the speaking status of each participant at one second after the end of the context window.
Evaluation of Participants’ Approaches
For the purpose of this challenge we model the next speaker detection problem as a multi label problem. Hence a model for this task should predict a binary value (speaking = 1, not-speaking = 0) for each participant, for a given sample. As a metric to compare the submitted models we will use the unweighted average recall over all samples (see scikit recall_score(y_true, y_pred, average='macro') function).
For the eye contact detection task the problem is modeled as a multi class problem. Given a specific participant, a submitted model should predict with what other participant he or she is making eye contact. The task is modeled using five classes - one for each participants position (classes 1-4) and an additional class for no eye contact (class 0). To evaluate the performance of this task we will use accuracy as a metric (see scikit accuracy_score(y_true, y_pred) function).
Participants will receive training and validation data that can be used to build solutions for each sub-challenge (eye contact detection and next speaker prediction). The evaluation of these approaches will then be performed remotely on our side with the unpublished test portion of the dataset. For that, participants will create and upload docker images with their solutions that are then evaluated on our systems (for more information regarding the process visit this link).