Linking Images to Spoken Content

One of the topics investigated in audiovisual hyperlinking is the linking of spoken content to images. Multimedia hyperlinking could be a more appropriate term here. We envision scenarios such as ‘visual radio’ or ‘enriching spoken word collections’ that aim at either increasing the entertainment value by providing also visuals with the audio track, or improving the understanding of the contents of the spoken word by providing visual context information. Obviously, the goal is to do this fully automatically.

A recent study that investigated how users perceive relevance of linked images demonstrated that a relatively simple approach won’t do the trick. Also, the relevance of images is to a large extent depending on the perception of what a user thinks an image represents, and not, what the metadata may be saying what the image represents. These observations do not really come as a surprise but it is good that they are now documented in a paper.

The research was done in the context of developing novel approaches for engaging with radio archives in The Netherlands via a tablet friendly platform called The starting point was that we wanted to see if we could increase the entertainment value of listening to this radio archive by automatically adding images from a large open image repository of the Dutch National Archive.

Technical set-up

The radio programs, consisting of interviews on various topics, were fed into a speech recognition system that produced time-labeled speech transcripts of the spoken word. An entity-extraction algorithm provided us with keywords (names of people and places) that were used to search an index of the metadata (captions) from the images. The top-ranked images in the search were linked to the speech segments where the entities were extracted from.

User study

A total of 43 participants in the study were presented with 4 audio fragments ranging from 2-5 minutes in length. Each audio fragment had more then one linked image. For each image, the users were asked to rate the match between the contents of the speech fragment on a 5-point scale (strongly disagree – strongly agree). In 90% of the cases the users (strongly) disagreed with the match, in only 5% of the cases they (strongly) agreed with the match.


These are not really promising results. Was the technology failing? Well, yes and no, not completely. Next to some obvious failures due to imperfect speech transcripts or ambiguity in the metadata (for example, native americans called “Indians” in Dutch were mistakenly matched with people from India), a number of non-matching cases represented indeed people or places that were mentioned in the speech content. The problem was that only the participants in the study did not recognise the images as such. To explain this look at the three images below, that could all be labeled with ‘Lisbon’. Only if you know Lisbon well, you may observe that the third image on the right is indeed taken in Lisbon (due to the landmark). We tested this observation by conducting a small follow up study where we asked participants to rate relevance of images with and without captions. Indeed the images with captions received a better relevance rating then the ones without caption.



Clearly the technical set-up was not appropriate for reaching acceptable performance levels. One way we hope to obtain better results is by taking the context of the entities into account (e.g., by adding context words to the query with a certain weight) or by treating the speech segment as a ‘long query’ that could be decomposed into a structured query.

The perceived relevance and added value of linked images is an area of further research. One way to proceed here is to study speech-to-images linking in a scenario that allows us to compare task accomplishment in conditions with and without linked images.


The research in the project was funded by Dutch national program COMMIT/.

Nadeem, Danish and Theune, Mariet and Ordelman, Roeland J.F. Towards Audio Enrichment through Images: A User Evaluation on Image Relevance with Spoken Content, Proceedings of The Eighth International Conference on Advances in Multimedia (MMedia2016), Lisbon, Portugal, 2016.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s