Video hyperlinking is growing interest in the multimedia retrieval community. In video hyperlinking the goal is to apply the concept of linking that we are used to in the text domain to videos: enable the user to browse from one video to another. The assumption is that video hyperlinking can help to explore large video repositories more adequately. Links are created based on an automatically derived, topical relationship between video segments. The question however is, how do we identify which video segments in these repositories are good candidates for linking? And also, if we have such candidates, how to make sure that the links to video targets are really interesting for a user? Five research groups presented their view on this today, at a special session at the International Conference on Multimedia Retrieval (ICMR2017) in Bucharest.
Hubs and false links
Chong-Wah Ngo from City University of Hong Kong presented a paper that introduces measures for selecting video fragments that are suitable for video hyperlinking. Whereas in video hyperlinking we often talk about anchors (starting point of a link) and targets (where a link points to), the authors approach the problem of selecting anchors and targets in terms of “hubs” (information corners) and “authorities” (providing explanation or context). They propose a measure to identify such hubs (called Hubness) and another one to assess the risk of creating false links (called Local Intrinsic Dimensionality). Both measures can be used to fine-tune video hyperlinking algorithms. The experiment described in the paper demonstrates that fragments quantified as a hub with low local intrinsic dimensionality are likely to be good anchors or targets. (On the Selection of Anchors and Targets for Video Hyperlinking)
One of the key questions in Video Hyperlinking is how information from different modalities (audio, video, text) in the anchors and targets should be combined to link them successfully. Petra Galuščáková from Charles University Prague, presented a paper focusing on the effect of various combinations of visual and textual information on video hyperlinking. The paper reports on experiments using several methods for visual data processing: Feature Signatures, convolutional neural networks (CNN), concept detection and face recognition. The results show that visual features can provide an improvement over text-only retrieval but the impact in the study appeared to be somewhat limited. The results based on deploying face recognition in video hyperlinking were not encouraging to push this further.
(Visual Descriptors in Methods for Video Hyperlinking
Rémi Bois from CNRS/IRISA/INRIA presented an end-to-end study of multimedia linking implemented within the news domain to enable exploration of related multimedia content sources (press articles, videos, radio podcasts) by journalists. The focus of this study is on an exploration scenario without precise information need, where one typically has to get a comprehensive view on a topic or event in a limited amount of time. In this context, the authors present a method for generating ‘graphs’ that link heterogeneous multimedia news sources in such a way that users can easily explore relevant information on a topic. Evaluation of the approach with journalism students indicated that the graph representations can support professionals in gathering information more efficiently.
(Linking Multimedia Content for Efficient News Browsing
The question how visual features can be used to predict interestingness levels of a video sequence was addressed by a study presented by Yang Liu from Hong Kong Baptist University. The study exploits the fact that video frames could be taken from different angles or described by various kinds of visual features. A special machine learning approach that learns features from theses different “views of the data”, was applied to predict interestingness. The approach was tested successfully on data extracted from trailers of Hollywood-like movies. Following up on the presentation there was a short discussion on the definition of “interestingness” versus “popularity”. It was suggested by the audience that interestingness should be regarded as a content dependent feature, whereas popularity is the observed feature.
(Multi-view Manifold Learning for Media Interestingness Prediction
The final paper of the special session relates to the identification of interesting parts in videos, in this case, academic presentations (International Speech Conference Multimodal corpus). Keith Curtis from Dublin City University presented a method to generate summaries of such presentations by deploying audiovisual features to classify areas of high audience engagement, audience comprehension, and areas of intentional or unintentional emphasis applied by the speaker. These areas are then used to create summaries that are evaluated by looking at user behavior (using eye-tracking) in an enhanced digital video browser for quickly skimming presentations.
(Utilising High-Level Features in Summarisation of Academic Presentations
The special session was organized by Maria Eskevich, Roeland Ordelman, Benoit Huet, Gareth Jones, Claire-Hélène Demarty, Duong Quang-Khanh-Ngoc, and Mats Sjöberg.