Information Retrieval: Data Structures & Algorithms

Foreword

Udi Manber

Dept. of Computer Science, University of Arizona


In the not-so-long past, information retrieval meant going to the town's library and asking the librarian for help. The librarian usually knew all the books in his possession, and could give one a definite, although often negative, answer. As the number of books grew - and with them the number of libraries and librarians - it became impossible for one person or any group of persons to possess so much information. Tools for information retrieval had to be devised. The most important of these tools is the index - a collection of terms with pointers to places where information about them can be found. The terms can be subject matters, author names, call numbers, etc., but the structure of the index is essentially the same. Indexes are usually placed at the end of a book, or in another form, implemented as card catalogs in a library. The Sumerian literary catalogue, of c. 2000 B.C., is probably the first list of books ever written. Book indexes had appeared in a primitive form in the 16th century, and by the 18th century some were similar to today's indexes. Given the incredible technology advances in the last 200 years, it is quite surprising that today, for the vast majority of people, an index, or a hierarchy of indexes, is still the only available tool for information retrieval! Furthermore, at least from my experience, many book indexes are not of high quality. Writing a good index is still more a matter of experience and art than a precise science.

Why do most people still use 18th century technology today? It is not because there are no other methods or no new technology. I believe that the main reason is simple: Indexes work. They are extremely simple and effective to use for small to medium-size data. As President Reagan was fond of saying "if it ain't broke, don't fix it." We read books in essentially the same way we did in the 18th century, we walk the same way (most people don't use small wheels, for example, for walking, although it is technologically feasible), and some people argue that we teach our students in the same way. There is a great comfort in not having to learn something new to perform an old task. However, with the information explosion just upon us, "it" is about to be broken. We not only have an immensely greater amount of information from which to retrieve, we also have much more complicated needs. Faster computers, larger capacity high-speed data storage devices, and higher bandwidth networks will all come along, but they will not be enough. We will need better techniques for storing, accessing, querying, and manipulating information.

It is doubtful that in our lifetime most people will read books, say, from a notebook computer, that people will have rockets attached to their backs, or that teaching will take a radical new form (I dare not even venture what form), but it is likely that information will be retrieved in many new ways, by many more people, and on a grander scale.

I exaggerated, of course, when I said that we are still using ancient technology for information retrieval. The basic concept of indexes - searching by keywords - may be the same, but the implementation is a world apart from the Sumerian clay tablets. And information retrieval of today, aided by computers, is not limited to search by keywords. Numerous techniques have been developed in the last 30 years, many of which are described in this book. There are efficient data structures to store indexes, sophisticated query algorithms to search quickly, data compression methods, and special hardware, to name just a few areas of extraordinary advances. Considerable progress has been made for even seemingly elementary problems, such as how to find a given pattern in a large text with or without preprocessing the text. Although most people do not yet enjoy the power of computerized search, and those who do cry for better and more powerful methods, we expect major changes in the next 10 years or even sooner. The wonderful mix of issues presented in this collection, from theory to practice, from software to hardware, is sure to be of great help to anyone with interest in information retrieval.

An editorial in the Australian Library Journal in 1974 states that "the history of cataloging is exceptional in that it is endlessly repetitive. Each generation rethinks and reformulates the same basic problems, reframing them in new contexts and restating them in new terminology." The history of computerized cataloging is still too young to be in a cycle, and the problems it faces may be old in origin but new in scale and complexity. Information retrieval, as is evident from this book, has grown into a broad area of study. I dare to predict that it will prosper. Oliver Wendell Holmes wrote in 1872 that "It is the province of knowledge to speak and it is the privilege of wisdom to listen." Maybe, just maybe, we will also be able to say in the future that it is the province of knowledge to write and it is the privilege of wisdom to query.