DIR: Distributed Information Retrieval and Web Search


This project was the basis of my PhD research, which aimed to investigate advanced search engine infrastructure for information source selection in distributed information environments. The work for the project was consisted of the following main parts:


The identification of selection cases in distributed textual databases

In this part of work, we first distinguish various potential selection cases in distributed textual databases (DTDs) and categorise the types of DTDs. Based on these results, the relationships between selection cases and types of DTDs are recognised. The necessary constraints of database selection methods in different selection cases are correspondingly discussed, which can be a useful guideline in developing a more effective and suitable selection algorithm.

Keywords: distributed textual databases (DTDs), DTD types, selection cases, necessary constraints

Hierarchical classification for multiple, distributed textual databases

This work describes an alternative hierarchical categorisation of textual databases based on a Baysian network learning algorithm. Our proposed approach, which is based on automatic textual analysis of subject contents of textual databases, attempts to address the database selection problem by firstly classifying textual databases into a hierarchy of topic categories. In addition, a new category assignment strategy called "possibility-window" is described, which allows more appropriate categories to be chosen regarding the content of the databases.

Keywords: hierarchical classification, Bayesian classifiers, multiple web databases

Two-stage statistical language models for textual database selection

For the databases which are categorised into a topic hierarchy, we present a two-stage database selection approach based on statistical language modelling. In our approach, the task of database selection is divided into two distinct steps: (a) First, at the category-specific search stage, with a class-based language model, the search only focuses on the databases in some confined domains, e.g., a specific subject area that the user is interested in. (b) Second, at the term-specific search stage, the selection system computes the likelihood of the databases chosen at the first stage using a term-based language model, and further selects the best databases for the query. Furthermore, a query expansion method is proposed to alleviate the problem of query ambiguity, using a query translation model.

Keywords: database language model, text database selection, hierarchical topics, statistical language modeling, query expansion

Searching distributed textual databases using association rules

In this work, we introduce a data mining method to assist in the process of database selection by extracting potential interesting association rules between textual databases from a collection of previous selection results. With a topic hierarchy, we exploit intraclass and interclass associations between the databases, and use the discovered knowledge on textual databases to fine tune the original selection results so as to improve the effectiveness of database selection. This association-rule approach can be regarded as a step towards the post-processing of database selection.

Keywords: selection result refinement, data mining, association rules, inter-class association, intra-class association

Searching distributed web database using domain-specific ontologies

In this work, a concept-based resource description model is developed, which uses domain-specific ontologies to extract concept-related information from information sources. A context-based query disambiguation approach is presented, which makes use of semantic relationships among concepts in the ontology to help articulate the information needs of users. With the description logic feature of axioms in the ontology, an axiom-based query matching method is described to discover the implicit, useful concepts in resource descriptions with respect to the query.

Keywords: domain-specific ontologies, concept-based resource description model, context-based query disambiguation, concept-based search

Related Publications

Yang H. and Zhang M. (2006) Two-Stage Statistical Language Models for Text Database Selection. Information Retrieval, 9(1), 5-31.

Yang H. and Zhang M. (2005), Ontology-based Resource Descriptions for Distributed Information Sources. In Proceedings of IEEE International Conference of Information Technology and Applications, Sydney, Australia, pp. 143-149.

Yang H. and Zhang M. (2004) Hierarchical Classification for Multiple, Distributed Web Databases. International Journal of Computers and Their Applications, 11 (2): 118-130.

Yang H. and Zhang M. (2004) Association-Rule Based Information Source Selection. In: Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence. Lecture Notes in Artificial Intelligence. Springer Verlag Publishers. Auckland, New Zealand, pp.563-574.

Yang H. and Zhang M. (2004) An Ontology-Based Approach for Resource Discovery. In: Proceedings of International Conference on Intelligent Agents, Web Technology and Internet Commerce-IAWTIC'2004. Gold Coast, Australia, pp.306-317.

Yang H. and Zhang M. (2003) Necessary Constraints for Database Selection in a Distributed Text Database Environment. In: Proceedings of the International Conference on Computational Intelligence for Modeling Control and Automation (CIMCA 2003). Vienna, Austria, pp.25-34.

Yang H. and Zhang M. (2003) A Language Modeling Approach to Search Distributed Text Databases. In: Proceedings of the 16th Australian Joint Conference on Artificial Intelligence. Lecture Notes in Artificial Intelligence. Springer Verlag Publishers. Perth, Australia, pp.196-207.

Yang H. and Zhang M. (2002) Legal Aspects of the Application of Agent-Based Information Retrieval on the Internet. Journal of Law and Information Science, 12 (1): 57-69.

Yang H. and Zhang M. (2002) An Evolutionary Approach for Searching Distributed Collections. In: Proceedings of Australian World Wide Web Conference. Sunshine Coast, Australia, pp.531-545.

Yang H. and Zhang M. (2002) Query Expansion with Naive Bayes for Searching Distributed Collections. In: Proceedings of the 11th International Conference on Intelligent Systems. Boston, US, pp.17-23.

Yang H., Zhang M., and Yang, X. (2001), IISS: A Framework for Intelligent Information Source Selection on the Web. In Proceedings of International Conference on Artificial Neural Networks and Expert Systems, Dunedin, New Zealand, pp. 190-195.

Yang H. and Zhang M. (2001), Legal Aspects of the Application of Agent-based Information Retrieval on the Internet. In Proceedings of International Conference on Information Technology and the Emerging Law, Wollongong, Australia, pp.1-4.

Yang X., Tan M., Lui Z., Zhang M. and Yang H. (2001), Predicting the Goodness of Database Groups Based on Vector Space Model. In Proceedings of Asia Pacific Web Conference, Changsha, China, pp.21-26.