Probabilistic Document Retrieval for Full-Text Document Collections Using
Logistic Regression
Fredric C. Gey
U.C. Data Archive & Technical Assistance
University of California, Berkeley
Contact Information
2538 Channing Way, #5100
University of California
Berkeley, CA 94720
Phone: (510) 642-6571
Fax : (510) 643-8292
Email: gey@ucdata.berkeley.edu
Home Page:Fred Gey's
Home Page
WWW PAGE
Probabilistic
Document Retrieval
Keywords:
Information retrieval, Probabilistic document retrieval, Search Engines,
digital libraries, cross-language retrieval, Chinese retrieval, logistic
regression
Project Award Information
| NSF Award |
- |
#9630765 |
| Program Manager |
Maria Zemankova |
| Datesi |
August 15, 1996 Expires July 31, 1999 (Estimated) |
| Principal Investigator |
Fredric C Gey gey@ucdata.berkeley.edu |
| Sponsor |
U of Cal Berkeley Berkeley, CA 94720 415/642-6000 |
| NSF Program |
INFORMATION & DATA MANAGEMENT |
Project Summary:
This project develops and tests new probabilistic approaches to text retrieval.
Using the statistical technique of logistic regression, documents are ranked
in order of estimated probability of relevance with respect to a query.
The methods are subjected to rigorous performance experiments with the
collections of documents and queries of the TREC (Text REtrieval Conference)
series of conferences sponsored by NIST (National Institute of Standards
and Technology) and DARPA (Defense Advanced Research Projects Agency).
This project will advance the progress in modern text and document retrieval
by developing sound theoretical models of the retrieval process, models
which achieve high performance in experimental tests on millions of documents.
The research will contribute to understanding of the mechanisms of multilingual
retrieval by applying its methodologies to queries and document collections
in Chinese and Spanish. The project is also researching methods for cross-language
retrieval
Goals, Objectives, and Targeted Activities
The current year's research is targeted in three directions:
-
Further development of cross-language retrieval, including phrase discovery
aids
-
Research into the equivalence of neural network and logistic regression
methods of information retrieval
-
Experiments to assess the performance of different regression models
Future research will also include discovery of threshold mechanisms for
bridging between ranked retrieval and information filtering (unranked retrieval).
Indication of Success
This project has demonstrated that logistic regression retrieval algorithms
work well not only for English document collections but also for foreign
language retrieval. Performance (in terms of TREC conference measures)
have been excellent for Chinese, Spanish, and German collections for monolingual
retrieval. The project has developed and applied techniques from computational
linguistics for phrase discovery for cross-language retrieval. Performance
has been good in tests of English to German cross-lanaguage retrieval for
the TREC-6 collections.
Project Impact and Outcome
This project has supported the research activities of three PhD candidates
of the School of Information Management and Systems. One student, who is
fully supported, is pursuing dissertation research into probabilistic neural
networks for information retrieval. The other students, who were partially
supported for Chinese document retrieval, have left the project for positions
at Sun Microsystems and Excite, Inc, where one is now developing the Japanese
internet search engine of Excite. Two undergraduate computer science students
were employed writing software during 1997-1998. One logistic regression
algorithm forms the fundamental search method for the Hotbot internet search
engine and the UC Berkeley digital library retrieval engine.
Project References
1. Gey, F, and A Chen, "Term Importance in Routing Retrieval," submitted
to Information Retrieval, February 1998.
2. Gey, F, and A Chen, "Intelligent Boolean Filtering for Routing Retrieval,"
in preparation for publication submittal.
3. Gey F, and A Chen, "Phrase Discovery for English and Cross-Language
Retrieval," In Proceedings of TREC-6, the Sixth NIST-DARPA Text REtrieval
Conference, National Institute for Standards and Technology, Washington,
DC (November 19-21, 1997).
4. Chen, A, L Xu, J He, F Gey, and J Meggs, "Chinese Text Retrieval
Without Using a Dictionary," in Proceedings of SIGIR97, the 20th annual
ACM conference on Research and Development in Information Retrieval, Philadelphia,
PA, July 26-31,1997, pages 42-49.
5. Gey, F, A Chen, J He, J Meggs and L Xu, "Term Importance, Boolean
Conjunct Training, Negative Terms, and Foreign Language Retrieval, Probabilistic
Algorithms at TREC-5," Proceedings of TREC-5, the Fifth NIST-DARPA Text
REtrieval Conference, National Institute for Standards and Technology,
Washington, DC (November 20-22, 1996), pages 181-190.
6. He, J., Liangje Xu, A. Chen, J. Meggs and F. Gey, "Berkeley Chinese
Information Retrieval at TREC-5: Technical Report" Proceedings of TREC-5,
the Fifth NIST-DARPA Text REtrieval Conference, National Institute for
Standards and Technology, Washington, DC (November 20-22, 1996), pages
191-195.
Area Background
Information Retrieval algorithms support the computerized search
of large document collections (millions of documents) to retrieval small
subsets of documents relevant to a user's information need. Such algorithms
are the basis for internet search engines and digital library catalogues.
The fundamental models for retrieval are Boolean/logic (including fuzzy
logic), geometric/vector space similarity, and probabilistic document retrieval.
Application areas include foreign language and cross-language retrieval,
text categorization and automatic classification, speech and broadcast
retrieval. Performance can be subject to unbiased objective testing against
test collections of hundreds of queries matched to millions of documents.
Area Reference
Readings in Information Retrieval K. Sparck-Jones and P. Willet
eds., published by Morgan Kaufmann, 1997.
Potential Related Projects
I am co-principal investigator of the ARPA research contract "Search Support
of Unfamiliar Metadata" which attempts to bridge from ordinary natural
language search expressions to domain-specific classification indexing
languages. My particular area is searching numeric databases such as U.S.
Foreign Trade imports and exports time series which are specified by the
16,000 classifications of the International Harmonized Commodity Classification
Scheme. The project is also working on cross-language retrieval using multi-lingual
thesauri.