Biomedical text natural language processing (BioNLP) using scispaCy
How to identify diseases, drugs, and dosages from medical record transcriptions
Biomedical text mining and natural language processing (BioNLP) is an interesting research domain that deals with processing data from journals, medical records, and other biomedical documents. Considering the availability of biomedical literature, there has been an increasing interest in extracting information, relationships, and insights from text data. However, the unstructured organization and the domain complexity of biomedical documents make these tasks hard. Fortunately, some cool NLP Python packages can help us with that!
scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text. spaCy’s most mindblowing features are neural network models for tagging, parsing, named entity recognition (NER), text classification, and more. Add scispaCy models on top of it and we can do all that in the biomedical domain!
Here we are going to see how to use scispaCy NER models to identify drug and disease names mentioned in a medical transcription dataset. Moreover, we are going to combine NER and rule-based matching to extract the drug names and dosages reported in each transcription.
Table of Contents
Requirements
- Python 3
- pandas
- spacy>=3.0
- scispacy
You can simply pip install
all of them.
We also need to download and install the NER model from scispaCy. To install the en_ner_bc5cdr_md
model use the following command:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.4.0/en_ner_bc5cdr_md-0.4.0.tar.gz
For updated versions or other models, please check scispaCy page.
Dataset
Unstructured medical data, like medical transcriptions, are pretty hard to find. Here we are using a medical transcription dataset scraped from the MTSamples website by Tara Boyle and made available at Kaggle.
import pandas as pd
med_transcript = pd.read_csv("mtsamples.csv", index_col=0)
med_transcript.info()
med_transcript.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4999 entries, 0 to 4998
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 description 4999 non-null object
1 medical_specialty 4999 non-null object
2 sample_name 4999 non-null object
3 transcription 4966 non-null object
4 keywords 3931 non-null object
dtypes: object(5)
memory usage: 234.3+ KB
description | medical_specialty | sample_name | transcription | keywords | |
---|---|---|---|---|---|
0 | A 23-year-old white female presents with comp... | Allergy / Immunology | Allergic Rhinitis | SUBJECTIVE:, This 23-year-old white female pr... | allergy / immunology, allergic rhinitis, aller... |
1 | Consult for laparoscopic gastric bypass. | Bariatrics | Laparoscopic Gastric Bypass Consult - 2 | PAST MEDICAL HISTORY:, He has difficulty climb... | bariatrics, laparoscopic gastric bypass, weigh... |
2 | Consult for laparoscopic gastric bypass. | Bariatrics | Laparoscopic Gastric Bypass Consult - 1 | HISTORY OF PRESENT ILLNESS: , I have seen ABC ... | bariatrics, laparoscopic gastric bypass, heart... |
3 | 2-D M-Mode. Doppler. | Cardiovascular / Pulmonary | 2-D Echocardiogram - 1 | 2-D M-MODE: , ,1. Left atrial enlargement wit... | cardiovascular / pulmonary, 2-d m-mode, dopple... |
4 | 2-D Echocardiogram | Cardiovascular / Pulmonary | 2-D Echocardiogram - 2 | 1. The left ventricular cavity size and wall ... | cardiovascular / pulmonary, 2-d, doppler, echo... |
The dataset has almost 5000 records, but let’s work with a small random subsample so it doesn’t take too long to process. We also have to drop any rows whose transcriptions are missing.
med_transcript.dropna(subset=['transcription'], inplace=True)
med_transcript_small = med_transcript.sample(n=100, replace=False, random_state=42)
med_transcript_small.info()
med_transcript_small.head()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 3162 to 3581
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 description 100 non-null object
1 medical_specialty 100 non-null object
2 sample_name 100 non-null object
3 transcription 100 non-null object
4 keywords 78 non-null object
dtypes: object(5)
memory usage: 4.7+ KB
description | medical_specialty | sample_name | transcription | keywords | |
---|---|---|---|---|---|
3162 | Markedly elevated PT INR despite stopping Cou... | Hematology - Oncology | Hematology Consult - 1 | HISTORY OF PRESENT ILLNESS:, The patient is w... | NaN |
1981 | Intercostal block from fourth to tenth interc... | Pain Management | Intercostal block - 1 | PREPROCEDURE DIAGNOSIS:, Chest pain secondary... | pain management, xylocaine, marcaine, intercos... |
1361 | The patient is a 65-year-old female who under... | SOAP / Chart / Progress Notes | Lobectomy - Followup | HISTORY OF PRESENT ILLNESS: , The patient is a... | soap / chart / progress notes, non-small cell ... |
3008 | Construction of right upper arm hemodialysis ... | Nephrology | Hemodialysis Fistula Construction | PREOPERATIVE DIAGNOSIS: , End-stage renal dise... | nephrology, end-stage renal disease, av dialys... |
4943 | Bronchoscopy with brush biopsies. Persistent... | Cardiovascular / Pulmonary | Bronchoscopy - 8 | PREOPERATIVE DIAGNOSIS: , Persistent pneumonia... | cardiovascular / pulmonary, persistent pneumon... |
Let’s take one transcription to see how we can work with NER:
sample_transcription = med_transcript_small['transcription'].iloc[0]
print(sample_transcription[:1000]) # prints just the first 1000 characters
HISTORY OF PRESENT ILLNESS:, The patient is well known to me for a history of iron-deficiency anemia due to chronic blood loss from colitis. We corrected her hematocrit last year with intravenous (IV) iron. Ultimately, she had a total proctocolectomy done on 03/14/2007 to treat her colitis. Her course has been very complicated since then with needing multiple surgeries for removal of hematoma. This is partly because she was on anticoagulation for a right arm deep venous thrombosis (DVT) she had early this year, complicated by septic phlebitis.,Chart was reviewed, and I will not reiterate her complex history.,I am asked to see the patient again because of concerns for coagulopathy.,She had surgery again last month to evacuate a pelvic hematoma, and was found to have vancomycin resistant enterococcus, for which she is on multiple antibiotics and followed by infectious disease now.,She is on total parenteral nutrition (TPN) as well.,LABORATORY DATA:, Labs today showed a white blood
So, we can see a lot of entities in this transcription. There are drug, disease, and exam names for example. The text was scraped from a web page and we can identify the different sections from the medical record like “HISTORY OF PRESENT ILLNESS” and “LABORATORY DATA”, but this varies from record to record.
Named entity recognition
Named entity recognition (NER) is a subtask of natural language processing used to identify and classify named entities mentioned in unstructured text into pre-defined categories. scispaCy
has different models to identify different entity types and you can check them here.
We are going to use the NER model trained on the BC5CDR corpus (en_ner_bc5cdr_md
). This corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions. Don’t forget to download and install the model.
import scispacy
import spacy
nlp = spacy.load("en_ner_bc5cdr_md")
spacy.load
will return a Language
object containing all components and data needed to process text. This object is usually called nlp
in the documentation and tutorials. Calling the nlp
object on a string of text will return a processed Doc
object with the text split into words and annotated.
Let’s get all identified entities and print their text, start position, end position, and type:
doc = nlp(sample_transcription)
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
TEXT START END ENTITY TYPE
iron-deficiency anemia 79 101 DISEASE
chronic blood loss 109 127 DISEASE
colitis 133 140 DISEASE
iron 203 207 CHEMICAL
...
vancomycin 781 791 CHEMICAL
infectious disease 873 891 DISEASE
improved.,PT 1348 1360 CHEMICAL
vitamin K 1503 1512 CHEMICAL
uric acid 1830 1839 CHEMICAL
bilirubin 1853 1862 CHEMICAL
Creatinine 1911 1921 CHEMICAL
...
Compazine 2474 2483 CHEMICAL
Zofran 2487 2493 CHEMICAL
epistaxis 2629 2638 DISEASE
bleeding 3057 3065 DISEASE
edema.,CARDIAC 3109 3123 CHEMICAL
adenopathy 3156 3166 DISEASE
...
We can see the model correctly identified and label diseases such as iron-deficiency anemia, chronic blood loss, and many more. Lots of drugs were also identified, like vancomycin, Compazine, Zofran. The model can also identify common laboratory tested molecules such as creatinine, iron, bilirubin, uric acid.
Not everything is perfect though. See how some tokens are weirdly classified as chemicals, possibly due to punctuation marks:
- improved.,PT 1348 1360 CHEMICAL
- edema.,CARDIAC 3109 3123 CHEMICAL
Punctuation marks are usually removed in NLP preprocessing steps, but we can’t remove all of them here, otherwise, we may miss chemical names and would screw up quantities like drug dosage. However, we can solve this problem by removing the “.,” marks that appear to separate some sections of the transcription. It is important to know your data and your data’s domain to have a better comprehension of your results.
import re
med_transcript_small['transcription'] = med_transcript_small['transcription'].apply(lambda x: re.sub('(\.,)', ". ", x))
We can also check the entities using the displacy
visualizer:
from spacy import displacy
displacy.render(doc[:100], style='ent', jupyter=True) # here I am printing just the first 100 tokens
Rule-based matching
Rule-based matching is similar to regular expressions, but spaCy’s rule-based matcher engines and components give you access to the tokens within the document and their relationships. We can combine this with the NER models to identify some pattern that includes our entities.
Let’s extract from the text the drug names and their reported dosages. This could be of real use to identify possible medication errors by checking if the dosages are in accordance with standards and guidelines.
from spacy.matcher import Matcher
pattern = [{'ENT_TYPE':'CHEMICAL'}, {'LIKE_NUM': True}, {'IS_ASCII': True}]
matcher = Matcher(nlp.vocab)
matcher.add("DRUG_DOSE", [pattern])
The code above creates a pattern to identify a sequence of three tokens:
- A token whose entity type is CHEMICAL (drug name)
- A token that resembles a number (dosage)
- A token that consists of ASCII characters (units, like mg or mL)
Then we initialize the Matcher with a vocabulary. The matcher must always share the same vocab with the documents it will operate on, so we use the nlp
object vocab. We then add this pattern to the matcher and give it an ID.
Now we can loop through all transcriptions and extract the text matching this pattern:
for transcription in med_transcript_small['transcription']:
doc = nlp(transcription)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # get string representation
span = doc[start:end] # the matched span
print(string_id, start, end, span.text)
DRUG_DOSE 137 140 Xylocaine 20 mL
DRUG_DOSE 141 144 Marcaine 0.25%
DRUG_DOSE 208 211 Aspirin 81 mg
DRUG_DOSE 216 219 Spiriva 10 mcg
DRUG_DOSE 399 402 nifedipine 10 mg
DRUG_DOSE 226 229 aspirin one tablet
DRUG_DOSE 245 248 Warfarin 2.5 mg
DRUG_DOSE 67 70 Topamax 100 mg
...
DRUG_DOSE 193 196 Metamucil one pack
DRUG_DOSE 207 210 Nexium 40 mg
DRUG_DOSE 1133 1136 Naprosyn one p.o
DRUG_DOSE 290 293 Lidocaine 1%
DRUG_DOSE 37 40 Altrua 60,
...
DRUG_DOSE 74 77 Lidocaine 1.5%
DRUG_DOSE 209 212 Dilantin 300 mg
DRUG_DOSE 217 220 Haloperidol 1 mg
DRUG_DOSE 225 228 Dexamethasone 4 mg
DRUG_DOSE 234 237 Docusate 100 mg
DRUG_DOSE 250 253 Ibuprofen 600 mg
DRUG_DOSE 258 261 Zantac 150 mg
...
DRUG_DOSE 204 207 epinephrine 7 ml
DRUG_DOSE 214 217 Percocet 5,
DRUG_DOSE 55 58 . 4.
DRUG_DOSE 146 149 . 4.
DRUG_DOSE 2409 2412 Naprosyn 375 mg
DRUG_DOSE 141 144 Wellbutrin 300 mg
DRUG_DOSE 146 149 Xanax 0.25 mg
DRUG_DOSE 158 161 omeprazole 20 mg
...
Nice, we did it!
We successfully extracted drugs and dosages, including different kinds of units like mg, mL, %, packs.
Conclusions
Here we learned how to use some features of scispaCy and spaCy like NER and rule-base matching. We used one NER model, but there lots of others and you should totally check them out. For instance, the en_ner_bionlp13cg_md
model can identify anatomical parts, tissues, cell types, and more. Imagine what else you could do with that!
We also didn’t focus too much on preprocessing steps, but they are fundamental to get better results. Don’t forget to explore your data and adapt the preprocessing steps to the NLP tasks you want to do.
References
Neumann, M., King, D., Beltagy, I., & Ammar, W. (2019). Scispacy: Fast and robust models for biomedical natural language processing. arXiv preprint arXiv:1902.07669.
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python.