On the 16th of November, we discussed the results of the 2016 Video Hyperlinking task at the TREC video retrieval (TRECVid) evaluation conference at NIST in Gaithersburg, MA, USA. After a turbulent year due to a last minute withdrawal of the data set we were planning to use, the organising team is very happy that we can look back at a fruitful workshop, especially thanks also to the hard work of the participating teams. Next year, the video hyperlinking task will be back at TRECVid, hopefully with an even increased amount of participants.
The video hyperlinking task asks participants to return relevant video segments of arbitrary length (link targets) given query video segments that we call ‘anchors’. The anchors are extracted from the same data collection. The video hyperlinking (VH) task can be seen as a video-to-video search task: given a video fragment, VH systems provide relevant other video fragments that are topically related, while not being (near) duplicates. The task uses the Blib10000 data set with 14.838 semi-professionally created videos including speech recognition transcripts (provided by LIMSI/Vocapia and LIUM), shot segmentation (provided by TU Berlin) and features for 1000 visual concepts (AlexNet).
In the 2015 edition we observed that the provided anchors could be ambiguous. Given an anchor video that shows a castle and a Rolls Royce passing by, the question could be what this anchor is about: is it about the car, the building, or maybe both? Ambiguity is a real world problem that should be addressed in video hyperlinking research, but in an evaluation setting anchor ambiguity represents a challenging task to work with for the systems, and makes it hard to generate useful relevance assessments for the organisers.
Therefore, this year the task organisers developed a method to create multimodal anchors: anchors that combine information from both audio and visual information streams. Anchors were selected from the videos that contained verbal linguistic cues of potential reference to the visual stream (mentions as ‘can see’, ‘this looks like’, etc), and further filtered to confirm the visual cues presence: actions and objects that are crucial given the narrative of the video but are not explicitly named or mentioned.
On the basis of 90 evaluation anchors, participants provided their system’s ranked output of the link targets. For each submission, the top 5 anchor-target pairs were manually judged by 3 crowd workers in the Amazon Mechanical Turk (MT) set-up, i.e. we collected in total 3 x 7216 judgments. The final relevance judgment for each anchor-target pair was made based on the majority decision among 3 crowd workers.
These results provided the basis for calculation of Precision@5 measure. Also, we used the output of the MT process as ground truth to judge the anchor-target pairs from systems below rank-5 to arrive at a Mean Average Precision measure that was slightly adapted to include rewards/penalties for segmentation accuracy (MAiSP).
Click the links below to see the graphs for the P@5 and MAiSP measure.
As we had to change our data set with respect to last year’s evaluation we cannot directly compare with last year’s results. However, the 2016 results indicate that we are on a performance level that provides a nice baseline for improvement over the next years and much room for interesting experiments with feature combinations, segmentation and query formulations.
Participants papers and presentations
- EURECOM: Bernard Merialdo, Paul Pidou, Maria Eskevich, Benoit Huet. EURECOM at TRECVID 2016: The Adhoc Video Search and Video Hyperlinking Tasks.
- EURECOM.POLITO: Benoit Huet, Elena Baralis, Paolo Garza, and Mohammad Reza Kavoosifar. Eurecom-Polito at TRECVID 2016: Hyperlinking Task.
- FX PALO ALTO LABORATORY, INC (FXPAL): Chidansh Bhatt, Matthew Cooper. FXPAL Experiments for TRECVID 2016 Video Hyperlinking.
- Informedia (INF): Xuanchong Li, Alexander Hauptmann. INF@TRECVID 2016 Video Hyperlinking
- CNRS, IRISA, INSA, Universite de Rennes 1 (IRISA): Remi Bois, Vedran Vukotic, Ronan Sicre, Guillaume Gravier, Pascale Sébillot, Christian Raymond. IRISA at TrecVid2016: Crossmodality, Multimodality and Monomodality for Video Hyperlinking.