Probabilistic Document Retrieval for Full-Text Document Collections Using Logistic Regression

Fredric C. Gey
U.C. Data Archive & Technical Assistance
University of California, Berkeley

Contact Information

2538 Channing Way, #5100
University of California
Berkeley, CA 94720
Phone: (510) 642-6571
Fax : (510) 643-8292
Email: gey@ucdata.berkeley.edu
Home Page:Fred Gey's Home Page

WWW PAGE

Probabilistic Document Retrieval

Keywords:

Information retrieval, Probabilistic document retrieval, Search Engines, digital libraries, cross-language retrieval, Chinese retrieval, logistic regression

Project Award Information

Project Summary:

This project develops and tests new probabilistic approaches to text retrieval. Using the statistical technique of logistic regression, documents are ranked in order of estimated probability of relevance with respect to a query. The methods are subjected to rigorous performance experiments with the collections of documents and queries of the TREC (Text REtrieval Conference) series of conferences sponsored by NIST (National Institute of Standards and Technology) and DARPA (Defense Advanced Research Projects Agency). This project will advance the progress in modern text and document retrieval by developing sound theoretical models of the retrieval process, models which achieve high performance in experimental tests on millions of documents. The research will contribute to understanding of the mechanisms of multilingual retrieval by applying its methodologies to queries and document collections in Chinese and Spanish. The project is also researching methods for cross-language retrieval

Goals, Objectives, and Targeted Activities

The current year's research is targeted in three directions: Future research will also include discovery of threshold mechanisms for bridging between ranked retrieval and information filtering (unranked retrieval).

Indication of Success

This project has demonstrated that logistic regression retrieval algorithms work well not only for English document collections but also for foreign language retrieval. Performance (in terms of TREC conference measures) have been excellent for Chinese, Spanish, and German collections for monolingual retrieval. The project has developed and applied techniques from computational linguistics for phrase discovery for cross-language retrieval. Performance has been good in tests of English to German cross-lanaguage retrieval for the TREC-6 collections.

Project Impact and Outcome

This project has supported the research activities of three PhD candidates of the School of Information Management and Systems. One student, who is fully supported, is pursuing dissertation research into probabilistic neural networks for information retrieval. The other students, who were partially supported for Chinese document retrieval, have left the project for positions at Sun Microsystems and Excite, Inc, where one is now developing the Japanese internet search engine of Excite. Two undergraduate computer science students were employed writing software during 1997-1998. One logistic regression algorithm forms the fundamental search method for the Hotbot internet search engine and the UC Berkeley digital library retrieval engine.

Project References

1. Gey, F, and A Chen, "Term Importance in Routing Retrieval," submitted to Information Retrieval, February 1998.

2. Gey, F, and A Chen, "Intelligent Boolean Filtering for Routing Retrieval," in preparation for publication submittal.

3. Gey F, and A Chen, "Phrase Discovery for English and Cross-Language Retrieval," In Proceedings of TREC-6, the Sixth NIST-DARPA Text REtrieval Conference, National Institute for Standards and Technology, Washington, DC (November 19-21, 1997).

4. Chen, A, L Xu, J He, F Gey, and J Meggs, "Chinese Text Retrieval Without Using a Dictionary," in Proceedings of SIGIR97, the 20th annual ACM conference on Research and Development in Information Retrieval, Philadelphia, PA, July 26-31,1997, pages 42-49.

5. Gey, F, A Chen, J He, J Meggs and L Xu, "Term Importance, Boolean Conjunct Training, Negative Terms, and Foreign Language Retrieval, Probabilistic Algorithms at TREC-5," Proceedings of TREC-5, the Fifth NIST-DARPA Text REtrieval Conference, National Institute for Standards and Technology, Washington, DC (November 20-22, 1996), pages 181-190.

6. He, J., Liangje Xu, A. Chen, J. Meggs and F. Gey, "Berkeley Chinese Information Retrieval at TREC-5: Technical Report" Proceedings of TREC-5, the Fifth NIST-DARPA Text REtrieval Conference, National Institute for Standards and Technology, Washington, DC (November 20-22, 1996), pages 191-195.

Area Background

Information Retrieval algorithms support the computerized search of large document collections (millions of documents) to retrieval small subsets of documents relevant to a user's information need. Such algorithms are the basis for internet search engines and digital library catalogues. The fundamental models for retrieval are Boolean/logic (including fuzzy logic), geometric/vector space similarity, and probabilistic document retrieval. Application areas include foreign language and cross-language retrieval, text categorization and automatic classification, speech and broadcast retrieval. Performance can be subject to unbiased objective testing against test collections of hundreds of queries matched to millions of documents.

Area Reference

Readings in Information Retrieval K. Sparck-Jones and P. Willet eds., published by Morgan Kaufmann, 1997.

Potential Related Projects

I am co-principal investigator of the ARPA research contract "Search Support of Unfamiliar Metadata" which attempts to bridge from ordinary natural language search expressions to domain-specific classification indexing languages. My particular area is searching numeric databases such as U.S. Foreign Trade imports and exports time series which are specified by the 16,000 classifications of the International Harmonized Commodity Classification Scheme. The project is also working on cross-language retrieval using multi-lingual thesauri.