MIP: Healthcare and Medical Information Processing


This project is related to my recent research work about medical information processing. The objective of the project is to extract medical information of interests, such as disease status prediction, medication information extraction, and medical relation extraction, from textual documents in the Electronic Health Report (EHR), to support knowledge discovery and put it to practical use in the forms of disease diagnosis, prevention and treatment.

We explored several NLP approaches, such as pattern- or rule-base matching, and machine learning, combined with syntactic and semantic features of medical contexts, for the identification of medical entities (e.g., disease, medication) and the extraction of medical events and relations.


Emotion recogition in Suicide Notes

We developed a system to identify, at the sentence level, affective text of 15 specific emotions from suicide notes. We propose a hybrid model that incorporates a number of natural language processing techniques, including lexicon-based keyword spotting, CRF-based emotion cue identification, and machine learning-based emotion classification. The results generated by different techniques are integrated using different vote-based merging strategies.

The automated system performed well against the manually-annotated gold standard, and achieved encouraging results with a micro-averaged F-measure score of 61.39% in textual emotion recognition, which was ranked 1st place out of 24 participant teams in the 2011 i2b2/VA/Cincinnati Medical Natural Language Processing Challenge, Track 2 Shared Task for sentiment analysis. The results demonstrate that effective emotion recognition by an automated system is possible when a large annotated corpus is available.

Keywords: Emotion recognition, lexicon-based keyword spotting model, machine-learning-based model, hybride model, result integration

Coreference resolution in clinical documents

In this work, we present a heuristics-based approach to coreference resolution of concepts in clinical text. The task includes resolving five specific concept classes, Pronoun, Person, Problem, Treatment, and Test. Our proposed approach exploits a set of heuristics to characterize a wide variety of linguistic and domain-specific features represented in different classes of concepts in clinical text.

We evaluate our approach on the i2b2/VA test data and obtain encouraging results with an unweighted average F-measure of 0.901 using three coreference evaluation metrics, BCUBED, MUC, and CEAF. our results was ranked 3rd place in all of the 20 participant teams in the 2011 i2b2/VA/Cincinnati Medical Natural Language Processing Challenge. The results indicate the heuristics-based approach holds promise and achieves comparable to that of the state-of-the-art learning-based approaches.

Keywords: Coreference resolution, heuristics-based approach, anaphoric concept determination, coreference pairing, coreference chain generation

Identification of Medical Concepts , Assertion Annotation of Medical Problems, and Extraction of Medical Relations

An information extraction system was developed to target for three different IE tasks: the identification of medical concepts (Task 1), assertion classification on medical problems (Task 2), and the extraction of relations between medical concepts (Task 3). We defined a wide variety of linguistic and semantic features for the CRFs-based medical concept recognition. A context-pattern based approach was explored for assertion annotation and relation extraction, which in particular made extensive use of hierarchical structures of dependency parse trees to capture the complicated syntactic relations underlying in assertion/relation statements.

In the shared task evaluation, the system achieved an F-score of 80.82% on medical concept identification, an F-score of 91.83% on assertion annotation, and an F-score of 60.41% on relation extraction, respectively.

Keywords: medical concepts (i.e. problems, tests, and treatments), CRF, context-based patterns, dependency parse trees

Medication Information Extraction from Medical Discharge Summaries

We developed a system to extract medication information from hospital discharge summaries. The system explored several linguistic natural language processing techniques (eg, term-based and token-based rule matching) to identify medication-related information in the narrative text. A number of lexical resources was constructed to profile lexical or morphological features for different categories of medication constituents.

The automated system performed well, and achieved an F-micro of 79.55% for the term-level results and 81.16% for the token-level results.

Keywords: medication information (i.e. drug, dosage, frequency, mode/route, duration, and reason), rule-based approach, token-based lexical resources

Multi-diseases Classification and Multi-disease-statuses Tagging on Medical Discharge Records

A natural language processing (NLP) system has been constructed, which is targeted for the multi-class, multi-label classification challenge in clinical data. This system is exploited to automatically identify obesity disease and 15 of its best represented co-morbidities (e.g., asthma, diabetes, etc.) based on the narrative patient record. For obesity and individual co-morbidities, two types of document-level judgments, textual and intuitive judgments, are provided, which indicate four disease statuses of the disease/co-morbidity, present, absent, questionable, or unmentioned, respectively. Textual judgments that are strictly based on explicit text, and intuitive judgments that are made based on implicit information and are not explicitly asserted in direct relation to obesity or co-morbidities, are separately predicted by the system.

This complicate system combines various text mining approaches, which include the methods from computational linguistics (e.g., term-based exact and approximate matching, and context- or rule-based pattern matching) and machine learning algorithms (e.g., SVM, NB, and ME), to present linguistic, syntactic, and semantic information of obesity and related co-morbidities at individual judgment stages.

The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7th out of 28 teams - the highest was 66%).

Keywords: disease classification, disease status tagging, term-based matching, rule-based and pattern matching, machine learning algorithms (NB, SVM, ME)

Related Publications

Yang H., Willis A., De Roeck A., and Nuseibeh B. (2012) A Hybrid Model for Automatic Emotion Recognition in Suicide Notes. Biomedical Informatics Insights. 2012:5(Suppl. 1): 17-30. (Best research paper award in 2011 i2b2/VA/Cincinnati Medical NLP Challenge)

Yang H., Willis A., De Roeck A., and Nuseibeh B. (2011) A System for Coreference Resolution in Clinical Documents. In 2011 i2b2/VA/Cincinnati Medical NLP Challenge workshop. Washington DC, USA.

Yang H., De Roeck A.(2010) Extraction of Medical Information Using CRFs, Context Patterns, and Dependency Parse Trees. In 2010 i2b2/VA/Cincinnati Medical NLP Challenge workshop. Washington DC, USA.

Yang H.(2010) Automatic Extraction of Medication Information from Medical Discharge Summaries. Journal of the American Medical Informatics Association (JAMIA), 17 (5): 545-548.

Yang H., Spasic I., Keane J., and Nenadic G. (2009) A Text Mining Approach to the Predication of a Disease Status from Clinical Discharge Summaries. Journal of the American Medical Informatics Association (JAMIA), 16 (4): 596-600.