FasTag: Automatic text classification of unstructured medical narratives.

Submitted by ja607 on Sat, 10/03/2020 - 22:08

Title	FasTag: Automatic text classification of unstructured medical narratives.
Publication Type	Journal Article
Year of Publication	2020
Authors	Venkataraman, GRam, Pineda, ALopez, Iv, OJBear Don, Zehnder, AM, Ayyar, S, Page, RL, Bustamante, CD, Rivas, MA
Journal	PLoS One
Volume	15
Issue	6
Pagination	e0234647
Date Published	2020
ISSN	1932-6203
Keywords	Animals, Automation, Data Mining, Databases as Topic, Humans, Narrative Medicine, Reproducibility of Results, Software, Species Specificity
Abstract	Unstructured clinical narratives are continuously being recorded as part of delivery of care in electronic health records, and dedicated tagging staff spend considerable effort manually assigning clinical codes for billing purposes. Despite these efforts, however, label availability and accuracy are both suboptimal. In this retrospective study, we aimed to automate the assignment of top-level International Classification of Diseases version 9 (ICD-9) codes to clinical records from human and veterinary data stores using minimal manual labor and feature curation. Automating top-level annotations could in turn enable rapid cohort identification, especially in a veterinary setting. To this end, we trained long short-term memory (LSTM) recurrent neural networks (RNNs) on 52,722 human and 89,591 veterinary records. We investigated the accuracy of both separate-domain and combined-domain models and probed model portability. We established relevant baseline classification performances by training Decision Trees (DT) and Random Forests (RF). We also investigated whether transforming the data using MetaMap Lite, a clinical natural language processing tool, affected classification performance. We showed that the LSTM-RNNs accurately classify veterinary and human text narratives into top-level categories with an average weighted macro F1 score of 0.74 and 0.68 respectively. In the "neoplasia" category, the model trained on veterinary data had a high validation accuracy in veterinary data and moderate accuracy in human data, with F1 scores of 0.91 and 0.70 respectively. Our LSTM method scored slightly higher than that of the DT and RF models. The use of LSTM-RNN models represents a scalable structure that could prove useful in cohort identification for comparative oncology studies. Digitization of human and veterinary health information will continue to be a reality, particularly in the form of unstructured narratives. Our approach is a step forward for these two domains to learn from and inform one another.
DOI	10.1371/journal.pone.0234647
Alternate Journal	PLoS One
PubMed ID	32569327
PubMed Central ID	PMC7307763
Grant List	T15 LM007033 / LM / NLM NIH HHS / United States R01 HG010140 / HG / NHGRI NIH HHS / United States U01 HG009080 / HG / NHGRI NIH HHS / United States

Search form

Search form