Information Retrieval for Mixed-Media Collections

Principal Investigator: Dr Gareth Jones

2000-2001, University of Exeter, U.K.

EPSRC REF: GR/N04034

1 Project Background

The Information Retrieval for Mixed-Media Collections (IRMMC) project concerns the investigation of information retrieval for Mixed-Media collections, where linguistic information sources are originally realised in electronic text, spoken data or printed text, are contained in a single Mixed-Media document collection. The indexing and retrieval characteristics of spoken and printed media mean that this is a non-trival task requiring careful analysis.

The relevance of this project can be seen in the context of the increasing availability of digital media, as shown for example my the rapid expansion in the World Wide Web (WWW). This is creating the potential for instant access to hitherto unimagined amounts of information. Current technology for the WWW focuses on searching and retrieval from online text sources. However digital technology potential enables text-based searching for any language based content in any media. An example application is the news research group at the BBC. Until recently the BBC manually maintained a newspaper archive in a printed cuttings room. This has now been replaced by a system making use of the online electronic versions of various daily newspapers such as The Times and The Guardian. BBC researchers now have a system with rapid, efficient access to recent news material and slower manual access to their existing paper cuttings archive. The technology developed in this MMIRC project would enable them to fully digitise their information searching by integrating scanned versions of the existing paper cuttings into their online searching environment. This system could naturally be extended in the near future as streaming of multimedia data becomes economic to include archive video and audio footage into the online research infrastructure, the retrieval problems having already been addressed in this project. This is just one example, there are many organisations with recent documents in electronic form, while maintaining important legacy paper archives which it would be uneconomic to retype.

2 Existing Work

There have been a number of a notable research efforts in spoken document retrieval (SDR) in recent years including: the Informedia project at Carnegie Mellon University, the Video Mail Retrieval and Multimedia Retrieval projects at Cambridge University, the SCAN system from AT&T Research, the phone-based system developed at ETH, Zurich, and THISL at Sheffield University. Most of these groups are active participants in the annaul NIST TREC SDR track from which it is possible to assess the relative strengths of their methods.

There are less examples of work in retrieval from scanned documents. A significant study was carried out at the University of Nevada (Taghva et al. 96). Their work reports a study using their own retrieval collection and a simulated OCR collection generated from standard information test collections. They report results for various retrieval models. However further work is required. The MMIRC project will carry out an investigation into the application of the probabilistic information retrieval model (Robertson and Walker 1994). The main previous study in OCR text was in the TREC-5 Confusion Track (Harman and Voorhees 97). This operated in a similar manner to the current TREC SDR tracks. Part of the initial OCR text investigation in the MMIRC project will review the methods introduced by partipants in the TREC-5 Confusion Track, in particular investigating their use with the probabilistic retrieval model.

The main focus of this project Information Retrieval for Mixed-Media Collections is a new topic requiring the combination of a variety of existing techniques and the development of novel methods to develop an optimal system. Some of the relevant issues are described in an experiment by Sanderson and Crestani (Sanderson and Crestani 1997). In their experiment the correct text and recognised speech text from the NIST TREC-6 SDR track were combined into a single retrieval collection. They observed that text source documents are favoured over the transcribed spoken documents. The goal of the Mixed-Media Information Retrieval research is to explore this and other effects, and to develop a suitable retrieval model to address these problems.

The development of a full scale demonstration system is beyond the scope of a project of this size and the proposed work will focus on the development of core technology for Mixed-Media Information Retrieval. However we will construct a working text-based demonstration system illustrating the underlying retrieval from the Mixed-Media data. We anticipate that the technology implemented in this system will be easily portable to a multimedia environment.

In conclusion, separate technologies have been developed and tested for text, spoken and OCR text retrieval. A large research effort continues to be focused on text retrieval and there is currently significant interest in SDR. Retrieval of OCR text requires further investigation to explore effective retrieval models and techniques. Retrieval from collections where index information is stored for documents in different media is a new research area with considerable practical application.

3 Aims and Objectives

3.1 Project Aims

The aims of the project are to develop techniques for Mixed-Media Information Retrieval for text, spoken and OCR text documents; and to carry out an intestigation into the use of a probabilistic information retrieval model for OCR text retrieval. The work will bring together existing work in OCR text retrieval to investigate their applicability with a probabilistic retrieval model and to develop new methods to improve retrieval with imperfect OCR indexing. The Mixed-Media retrieval research will develop techniques to optimise retrieval performance from these collections. Techniques will be explored to discover and exploit identifiable semantic and lexical relationships between document indexes derived from different media to seek to provide overall improvement in retrieval performance. Further investigation will explore the use of relevance feedback mechanisms both for OCR text independently, and in Mixed-Media collections. The results of the work will be demonstrated in an interactive demonstation system based on Mixed-Media document index files.

3.2 Project Objectives

The objectives against which the success of the project may be judged are as follows:

4 Methodology

4.1 Information Retrieval

The probabilistic retrieval model has been shown to work effectively for large text (Walker et al. 98) and spoken document collections (Jones et al. 96) (Johnson et al. 98). Motivated by its success in this existing work and its theoretical motivatation (Robertson and Walker 94), the IRMMC project will take the proabilistic retrieval model as the starting point for our investigation of OCR text and Mixed-Media Retrieval.

4.2 OCR Image Retrieval

The principal difference between retrieval of text originated in electronic form and that maintained only qin physical documents is the need to index the contents in the latter case. The reliability with which content can be extracted from documents is dependent on a number of factors, but a fundamental issue is that indexing is errorful. Due to this imprecise indexing, the subject of OCR text retrieval thus has many parallels with SDR. The problems are different, but in both cases recognition errors degrade retrieval performance relative to perfect text transcriptions, and the retrieval strategies must be optimised for the source media.

4.3 Spoken Document Retrieval

A formal investigation of SDR is beyond the scope of this project. Experiments will be carried out to establish retrieval performance and standard SDR methods, such as feedback, used to seek optimal performance.

4.4 SDR vs OCR Image Retrieval

An important difference between OCR indexing and spoken document indexing is that in the latter case the indexing vocabulary is limited to the active recognition vocabulary of the speech recognition system. In the OCR case the system attempts to match the input character strings to dictionary entries, but where this match is poor the character string is output. This has the effect of significantly increasing the number of different terms in the vocabulary of the document file. A few of these new elements will be ackronyms, but many more are errorful near matches with existing vocabulary items.

The IRMMC project will not seek to develop new OCR methods, but rather to work with existing recognition software and focus on the types of errors that occur, how these impact on retrival behaviour, and the development of techniques to improve retrieval performance above this initial level.

The work will focus on printed text, rather than the more challenging handwritten text. Indexing and retrieval of handwritten material is an important area for further work, but the OCR error rate is much higher and hence retrieval is a harder task.

4.5 Mixed-Media Retrieval Demonstration System

In order to demonstate the achievements of the project a text based demonstrator system for Mixed-Media retrieval will be constructed. Development of a full scale multimedia demonstration is beyond the scope of this project and the intention of our demonstrator will be to illusrate the techniques developed in the project.

Bibliography

Publications


return to Gareth Jones's home page


Dr Gareth Jones
School of Computing, Dublin City University
Glasnevin, Dublin 9, Ireland
Tel: +353 (0)1 700 5559 Fax: +353 (0)1 7005442
email: Gareth.Jones @ computing.dcu.ie