Bio-MITA Project

Bio-MITA: Mining Term Associations from Literature to Support Knowledge Discovery in Biology

Bio-MITA is a BBSRC-funded project, whose purpose is to support biological knowledge discovery by means of text-based term association mining, in order to facilitate access to information and to increase research productivity. We investigated combining various text mining approaches to tackle different biomedical scenarios. So far, we had developed two Bio-NLP systems:

Transcription Factor Identification System
Protein/Gene Name Recognition and Normalization from Biomedical Literature
Related Publications

Transcription Factor Identification System

A text mining system based on machine learning (ML) approaches has been developed, which is designed to automatically recognise transcription factors (TFs), a special type of proteins in gene regulation, from biomedical literature. Three different machine learning approaches, Naive Bayes (NB), Support Vector Machine (SVM), and Maximum Entropy (ME), are employed and integrated into a learning model for the identification of TF-contexts in the text. Furthermore, a phrase-based Conditional Random Fields (CRFs) model, which captures the content and context features of candidate entities, is built to accurately distinguish transcription factors from other biological entities (e.g., target genes, DNA-binding sites, etc.) contained in the TF contexts.

Keywords: transcription factors (TFs), TF context sentence extraction, TF identification, machine learning approaches (NB, SVM, and ME), CRFs

Protein/Gene Name Recognition and Normalization from Biomedical Literature

The system was designed to automatically extract protein/gene mentions from the biomedical text, and link the text mentions to database identifiers (e.g., NCBI EntrezGene, EBI UniProtKB/Swiss-Prot Gene) in support of more sophisticated information extraction tasks. Several protein NER tools were separately used to detect the genes and gene products from the text. An integrated approach was explored to combine different gene lists into a final, complete, and integrated gene list. The mapping of gene mentions to the database identifiers employed a cascaded approach, which combined exact, exact-like and token-based approximate matching by using rule-based flexible representation for both gene synonyms in the databases and gene mentions in text.

Keywords: gene name normalisation, gene name mapping, lexical variability

Related Publications

Yang H., Keane, J, Bergman, C., and Nenadic G. (2009) Assigning Roles to Protein Mentions: the Case of Transcription Factors. Journal of Biomedical Informatics, 43 (5): 887-894.

Yang H., Nenadic G., and Keane, J. (2008) Identification of Transcription Factor Contexts in Literature Using Machine Learning Approaches. BMC Bioinformatics 9(Suppl 3):S11.

Yang H., Nenadic G., and Keane, J. (2007) A Cascaded Approach to Normalizing Gene Mentions in Biomedical Literature. Bioinformation 2(5), 197-206.