Introduction

Data acquisition for observational clinical studies is a time-consuming and costly process that requires medical expertise. Despite a widespread adoption of electronic health records (EHRs) in Poland, this process is not significantly accelerated, because only a fraction of data collected in the EHRs are structured data (eg, billing codes, such as the International Classification of Diseases, 10th Revision [ICD-10] classifications, laboratory examinations), and the rest consist of unstructured data, for example, textual reports. Despite known limitations of Polish and other health care systems,1-3 billing codes are widely used for large population studies in cardiology and internal medicine due to a lack of alternative solutions for time-efficient large population data collection.

Natural language processing (NLP) technologies allow computers to interpret, manipulate, and comprehend human language. A public release of ChatGPT, an NLP-based chatbot, sparked interest in text processing among researchers, showing a potential for scientific process facilitation. Globally, a lot of effort has been made to develop tools utilizing NLP to unlock the potential of these textual data from EHRs. Most of these efforts were traditionally focused on English due to availability of datasets and more advanced text-processing tools for this language. The availability of good-quality EHR-derived data is vital for clinical research progress, and is essential for conducting meaningful projects that utilize artificial intelligence in medicine. This fits in with the objectives of the European Health Data Space regulation proposal4 aiming for better health care data availability for patients, researchers, and industry in the European Union; therefore, progress in efficient NLP application in cardiology is desired.

Our manuscript displays a practical use case of NLP in the AssistMED project, during which we developed a set of tools based on NLP for EHR data analysis in Polish. We demonstrate how NLP techniques can be used for continuation of the CRAFT registry (Multicenter Experience in Atrial Fibrillation Patients Treated with Oral Anticoagulants; NCT02987062), with fully automatic data retrieval and manual validation by humans. Herein, we present a complete workflow of data acquisition with the AssistMED tool, briefly describe the design of our solution, and provide an engaging but concise discussion on the NLP of medical documentation for research data acquisition.

Patients and methods

Rationale for and functioning of AssistMED

The idea of the project stemmed from an observation that discharge reports in Poland follow a typical format and contain equivalent textual fields. The data types for analysis are, therefore, 1) descriptive discharge diagnoses, 2) discharge recommendations, and 3) echocardiography reports (if present). Such data are stored in the EHR system in an organized form, and may be acquired by a clinical researcher in cooperation with the hospital information technology department. Data acquisition requires a legal analysis and consent of the institutional executive and data protection offices. Our study followed both these steps. The National Health Fund central electronic documentation platform could be a data source for even larger-scale applications in Poland.

Our algorithm can receive data in Excel spreadsheet format (.xlsx) (data examples in Polish are presented in Supplementary material, Table S1). Anonymized data can be uploaded to an online or offline computer application, and data analysis can be initiated. Currently, the algorithm can detect 72 clinical conditions related to cardiology and internal medicine, medications from 22 drug classes, and 15 numeric echocardiographic parameters. New diagnoses and medications can be added. The analyzed data are presented in an Excel spreadsheet, and a basic statistical report can be provided. They generally represent basic clinical characteristics of a cohort of patients required for any clinical research.

Algorithm implementation

The algorithm itself utilizes NLP for entity recognition. For diagnosis detection, we systematically established our database of possible expressions related to each clinical condition in Polish, as proposed by clinicians (KO, PB, PL, MG, AC, and CM). The diagnoses are appropriately structured, recognizing that a specific diagnosis often signifies the presence of another, more general disease (eg, a history of coronary artery bypass grafting indicates the presence of coronary artery disease or a diagnosis of carotid artery disease means that a patient has atherosclerosis). The NLP techniques we adopted allow for flexibility in recognition, which means a condition is recognized despite possible minor typos or different word ordering due to advanced calculations of similarity to examples in the database.

We created our database of medicinal products available on the Polish market for medicine detection. This database was created by retrieving information from the Office for Registration of Medicinal Products, Medical Devices, and Biocidal Products in Poland. However, some corrections were necessary, as the Anatomical Therapeutic Chemical Classification System utilized in such databases does not always reflect the clinically used categorization of medical substances. Additional ordering and unification of the terminology on substances, guided by a consensus reached within our clinical team, was required. Data are structured and can be retrieved at 3 levels of detail: drug class, active substance, and dosing. For dose detection, our clinicians (KO, PB, PL, MG, AC, and CM) developed a list of the most common patterns of drug dosing description in discharge recommendations. Based on this, we designed our dosage detection rules, supported with a publicly-available machine-learning MED7 module for better accuracy. Dosing detection is not possible for compound drugs.

As for echocardiographic analysis, the clinicians suggested a list of required echocardiographic numeric parameters and possible expressions. We applied upper and lower boundaries on the extracted values using reference ranges. Based on this set of rules, we designed a system to detect the parameters and their values. The read parameter value is subsequently normalized to a universal unit of measurement, and checked for plausibility based on falling within the range of clinically possible values for the parameter. An adequate note is included in the final output, if a read value falls outside the range.

The algorithm knowledge base, rules, and parameters were iteratively modified on a random sample of 200 manually annotated records from our cardiology department to reach its current performance. A technical description of the implementation of the AssistMED algorithms has been documented and published in research literature.5

Patients

The study material consisted of anonymized documentation of 10 314 consecutive patients discharged from a single tertiary cardiology center between January 1, 2016 and July 15, 2019. For patients hospitalized more than once (n = 2598), only the latest hospitalization was considered for analysis, to retrieve the most current available data. Therefore, we had 7716 individual patient records available for analysis.

The main inclusion criteria of the retrospective CRAFT registry were a diagnosis of atrial fibrillation (AF) and anticoagulation with oral anticoagulants (OACs); therefore, we focused on comprehensive characterization of patients with AF diagnosis.

The entire available database (7716 records) was subjected to an automatic NLP-based analysis by the AssistMED algorithms to extract data on clinical conditions, medications with dosage, and echocardiographic parameters if echocardiography was performed. Each annotation item suggested by the algorithm was subsequently reviewed by a human (patients with AF diagnosis confirmed by human assessment had all their characteristics manually verified; patients without confirmed AF were not verified for other medical conditions, drugs etc.). Inaccurate suggestions of the algorithm were corrected through a single human verification in a convenient data reviewing and correction module (Supplementary material, Figure S1), with the task split between 2 annotators who had to accept, deny, or correct each algorithm suggestion (diagnosis, drug, dose) or add potential missing data.

Additionally, we conducted a separate analysis of a sample of 100 records randomly selected from the entire dataset of 7716 patients (with and without AF diagnosis), in which both annotators (AB, MC) independently analyzed data without suggestions from the AssistMED analysis. The results from the annotators were compared to check the interannotator agreement. Data classification by the annotators was also compared to automatic classification, to see if there are any differences when the annotators were blinded to the algorithm suggestions.

The annotators did not participate in the algorithm design and knowledge base preparation to limit annotation bias.

Statistical analysis

The Cohen κ coefficient assessed inter-rater reliability between the annotators and the AssistMED algorithm performance in comparison with human judgment. Importantly, for numeric variables (echo parameter values, daily dosage of medications), a complete agreement between the parameter values of 2 compared classifications was treated as an agreement for the κ coefficient calculation (unavailable data or minor differences in values between the 2 classifications were treated as disagreements). The interpretation of the κ coefficient results was as follows: values equal to or below 0 indicated no agreement, those between 0.01 and 0.2 denoted none to slight agreement, between 0.21 and 0.4 poor agreement, between 0.41 and 0.6 moderate agreement, between 0.61 and 0.8 substantial agreement, and between 0.81 and 1 almost perfect agreement. Of note, the P value below 0.05 for the Cohen κ coefficient indicated that the degree of agreement (Cohen κ value) was significant.

Continuous variables were presented as median and interquartile range (IQR), while categorical variables were reported as frequency and percentage. The Fisher exact test was used to compare the frequency of categorical variables (eg, diagnosis, drug class), and the Wilcoxon signed-rank test was used for comparing continuous variables (eg, daily drug dosage, echo parameter values). For categorical variables (diagnosis, drug class), point estimates for sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) with human annotation as reference were calculated.

P value below 0.05 was considered significant for all tests. All tests were 2-tailed.

Statistical analyses and calculations were performed using Python (scipy. stats v1.9.3, pyirr v0.84.1.2, and NumPy 1.22.3 libraries, Python Software Foundation, Wilmington, Delaware, United States).

Total time required for manual annotation of the data portion was defined as the time spent actively within the annotation module of the AssistMED web application. The time for automatic annotation was defined as the total time from the process initiation in the web application to generation of the final report.

Ethics statement

Due to an observational design of the study, involving anonymized patient data, neither the ethics committee’s approval, nor the patients’ informed consent were required.

Results

In Figure 1 we resent patient flow in the study. First, a separate analysis was performed in a sample of the same 100 records annotated independently by 2 annotators. Its results indicated an almost perfect agreement for most items (moderate agreement for peripheral artery disease and cardiac resynchronization therapy), signifying equivalency of the 2 annotators’ judgment. Detailed comparisons between the studied items can be found in Supplementary material, Tables S2–S5.

Figure 1. Flowchart presenting the study design

Abbreviations: AF, atrial fibrillation; NLP, natural language processing

In the entire available dataset, the algorithm identified 3029 and the annotators 3030 patients with AF. The accuracy, sensitivity, and specificity of the automatic AF identification were 99.93%, 99.9%, and 99.96%, respectively. The algorithm falsely identified 2 AF diagnoses and missed 3 AF diagnoses. OACs were taken by 2601 of these patients according to the AssistMED, and 2624 according to the annotators. The algorithm’s accuracy in identifying patients fulfilling both principal CRAFT registry inclusion criteria (presence of AF and OAC prescription) was 99.5%, sensitivity was 98.8%, and specificity 99.8%, indicating high agreement. This signifies that the AssistMED-identified cohort would consist of largely the same patients if used for the CRAFT registry as the human-identified cohort.6

There were 3032 individual patient records in the combined dataset available for comparison (5 AF records inaccurately identified by the AssistMED, that is, 2 false positives and 3 false negatives included in all further analyses). All analyses were performed on dependent samples; thus, similar records in human- and AssistMED-identified cohorts were required. Baseline characteristics regarding concomitant diseases, drugs, and echocardiographic parameters for both cohorts were determined per the identification method.

The major aspects of the baseline characteristics of automated and manual retrieval are presented in Tables 1, 2 and 3 (the rest of the available data are provided in Supplementary material, Tables S6–S7).

Table 1. Diagnosis detection (n = 3032), AssistMED performance in comparison with human annotators

Diagnosis

Cases detected by a human, n (%)

Cases detected by AssistMED, n (%)

Sensitivity

Specificity

PPV

NPV

P value (Fisher test)

Cohen κ

P value (Cohen κ)

Heart failure

1393 (45.96)

1389 (45.83)

1

1

1

1

0.94

0.99

<⁠0.001

Hypertension

2173 (71.69)

2179 (71.89)

1

0.99

1

1

0.89

0.99

<⁠0.001

Poorly controlled hypertension

16 (0.53)

18 (0.59)

0.94

1

0.83

1

0.86

0.88

<⁠0.001

Diabetes

893 (29.46)

895 (29.53)

1

1

1

1

0.98

1

<⁠0.001

Diabetes and glycemic disorders

1092 (36.03)

1078 (35.57)

0.984

1

1

0.99

0.73

0.98

<⁠0.001

Ischemic stroke

250 (8.25)

250 (8.25)

0.99

1

0.99

1

1

0.99

<⁠0.001

Ischemic stroke or TIA

303 (10)

302 (9.96)

0.99

1

0.99

1

>0.99

0.99

<⁠0.001

Ischemic stroke or TIA or systemic embolism

310 (10.23)

322 (10.62)

0.99

1

0.95

1

0.64

0.97

<⁠0.001

Atherosclerosis (any evidence)

1401 (46.22)

1416 (46.72)

1

1

0.99

1

0.72

0.98

<⁠0.001

Carotid artery disease

113 (3.73)

109 (3.6)

0.94

1

0.97

1

0.84

0.95

<⁠0.001

PCI

653 (21.54)

646 (21.31)

0.99

1

1

1

0.85

0.99

<⁠0.001

CABG

227 (7.49)

229 (7.56)

1

1

0.99

1

0.96

1

<⁠0.001

STEMI

122 (4.03)

102 (3.37)

0.84

1

1

0.99

0.2

0.91

<⁠0.001

NSTEMI

282 (9.3)

278 (9.17)

0.99

1

1

1

0.89

0.99

<⁠0.001

MI (any type)

745 (24.58)

742 (24.48)

1

1

1

1

0.95

1

<⁠0.001

Coronary artery disease

1348 (44.47)

1350 (44.54)

1

1

1

1

0.98

1

<⁠0.001

Gastrointestinal bleeding

123 (4.06)

119 (3.93)

0.97

1

1

1

0.84

0.98

<⁠0.001

Intracranial bleeding

31 (1.02)

30 (0.99)

0.94

1

0.97

1

>0.99

0.95

<⁠0.001

Labile INR

12 (0.4)

12 (0.4)

1

1

1

1

1

1

<⁠0.001

Liver disease

12 (0.4)

9 (0.3)

0.75

1

1

1

0.66

0.86

<⁠0.001

Chronic kidney disease

872 (28.77)

873 (28.8)

0.99

1

0.99

1

0.98

0.99

<⁠0.001

Alcoholism

27 (0.89)

27 (0.89)

0.96

1

0.96

1

1

0.96

<⁠0.001

Abbreviations: CABG, coronary artery bypass graft; INR, international normalized ratio; MI, myocardial infarction; NPV, negative predictive value; NSTEMI, non–ST-segment elevation myocardial infarction; PCI, percutaneous coronary intervention; PPV, positive predictive value; STEMI, ST-segment elevation myocardial infarction, TIA, transient ischemic attack

Table 2. Drug groups, active substances, and dosage detection (n = 3032), AssistMED performance in comparison with human annotators

Drug class

Cases detected by a human, n (%)

Cases detected by AssistMED, n (%)

Sensitivity

Specificity

PPV

NPV

P value (Fisher test)

Cohen κ (drug group agreement)

P value (Cohen κ, drug group agreement)

Cohen κ (active substance agreement)

P value (Cohen κ, active substance agreement

Cohen κ (active substance and dose agreement)

P value (Cohen κ, active substance and dose agreement)

ACEI

1574 (51.91)

1564 (51.58)

0.99

0.997

0.997

0.990

0.82

0.987

<⁠0.001

0.986

<⁠0.001

0.907

<⁠0.001

Amiodarone

220 (7.26)

221 (7.29)

0.991

0.999

0.986

0.999

>0.99

0.988

<⁠0.001

0.988

<⁠0.001

0.604

<⁠0.001

Antiarrhythmic 1c

144 (4.75)

132 (4.35)

0.91

1

0.992

0.996

0.5

0.947

<⁠0.001

0.947

<⁠0.001

0.699

<⁠0.001

ASA

390 (12.86)

405 (13.36)

0.997

0.994

0.960

1

0.59

0.975

<⁠0.001

0.975

<⁠0.001

0.271

<⁠0.001

β-Blocker

2504 (82.59)

2474 (81.6)

0.986

0.989

0.998

0.935

0.33

0.953

<⁠0.001

0.953

<⁠0.001

0.906

<⁠0.001

CCBs (dihydropyridine)

724 (23.88)

724 (23.88)

0.986

0.996

0.986

0.996

1

0.982

<⁠0.001

0.983

<⁠0.001

0.897

<⁠0.001

CCBs (non-dihydropyridine)

20 (0.66)

20 (0.66)

1

1

1

1

1

1

<⁠0.001

1

<⁠0.001

0.933

<⁠0.001

Digoxin

297 (9.8)

298 (9.83)

0.997

0.999

0.993

1

>0.99

0.994

<⁠0.001

0.994

<⁠0.001

0.656

<⁠0.001

SGLT2i

18 (0.59)

19 (0.63)

1

1

0.947

1

>0.99

0.973

<⁠0.001

0.973

<⁠0.001

0.336

<⁠0.003

Gliptin

34 (1.12)

32 (1.06)

0.941

1

1

0.999

0.9

0.969

<⁠0.001

0.969

<⁠0.001

0.801

<⁠0.001

GLP-1 agonist

2 (0.07)

3 (0.1)

1

1

0.667

1

>0.99

0.8

<⁠0.001

0.8

<⁠0.001

0.571

<⁠0.03

Metformin

498 (16.43)

493 (16.26)

0.990

1

1

0.998

0.89

0.994

<⁠0.001

0.994

<⁠0.001

0.840

<⁠0.001

MRA

827 (27.28)

810 (26.72)

0.976

0.999

0.996

0.991

0.64

0.981

<⁠0.001

0.981

<⁠0.001

0.902

<⁠0.001

NSAID

5 (0.17)

6 (0.2)

0.8

0.999

0.667

1

>0.99

0.727

<⁠0.001

0.727

<⁠0.001

0.282

<⁠0.11

Heparin

245 (8.08)

283 (9.33)

0.992

0.986

0.859

0.999

0.09

0.913

<⁠0.001

0.901

<⁠0.001

0.448

<⁠0.001

NOAC

1879 (61.97)

1860 (61.35)

0.988

0.997

0.998

0.981

0.63

0.983

<⁠0.001

0.983

<⁠0.001

0.743

<⁠0.001

VKA

747 (24.64)

753 (24.84)

0.992

0.995

0.984

0.997

0.88

0.984

<⁠0.001

0.981

<⁠0.001

0.282

<⁠0.001

Antiplatelet

336 (11.08)

340 (11.21)

1

0.999

0.988

1

0.9

0.993

<⁠0.001

0.993

<⁠0.001

0.062

<⁠0.001

ARB

590 (19.46)

572 (18.87)

0.968

1

0.998

0.992

0.58

0.979

<⁠0.001

0.979

<⁠0.001

0.84

<⁠0.001

Sotalol

53 (1.75)

57 (1.88)

1

0.999

0.930

1

0.77

0.963

<⁠0.001

0.963

<⁠0.001

0.784

<⁠0.001

Statin

1931 (63.69)

1911 (63.03)

0.982

0.987

0.993

0.97

0.61

0.966

<⁠0.001

0.966

<⁠0.001

0.92

<⁠0.001

Sulfonylurea

239 (7.88)

222 (7.32)

0.925

1

0.996

0.994

0.44

0.955

<⁠0.001

0.955

<⁠0.001

0.858

<⁠0.001

Abbreviations: ACEI, angiotensin-converting enzyme inhibitor; ARB, angiotensin II receptor blocker; ASA, acetylsalicylic acid; CCB, calcium channel blocker; GLP-1, glucagon-like peptide 1 agonist; MRA, mineralocorticoid receptor antagonist; NSAID, nonsteroidal anti-inflammatory drug; NOAC, non–vitamin K antagonist oral anticoagulant; SGLT2i, sodium glucose cotransporter 2 inhibitor; VKA, vitamin K antagonist; others, see Table 1

Table 3. Echocardiographic parameters (n = 3032), AssistMED performance in comparison with human annotators

Echocardiographic parameter

Cases detected by a human, n (%)

Median (IQR) (human)

Cases detected by AssistMED, n (%)

Median (IQR) (AssistMED)

Cohen κ (full agreement in parameter value)

P value (Cohen κ)

P value (Wilcoxon test)

AVA

126 (11.55)

1.56 (1.1–1.8)

126 (11.55)

1.6 (1.1–1.8)

0.96

<⁠0.001

0.69

AVAi

73 (6.69)

0.73 (0.4–0.9)

67 (6.14)

0.73 (0.41–0.9)

0.92

<⁠0.001

>0.99

AcT

556 (50.96)

102 (92–120)

557 (51.05)

102 (92–120)

0.99

<⁠0.001

>0.99

Ao

908 (83.23)

3.5 (3.2–3.8)

905 (82.95)

3.5 (3.2–3.8)

0.99

<⁠0.001

>0.99

IVS

939 (86.07)

1.2 (1.1–1.3)

941 (86.25)

1.2 (1.1–1.3)

0.99

<⁠0.001

>0.99

LA

930 (85.24)

4.7 (4.2–5.1)

928 (85.06)

4.7 (4.2–5.1)

0.98

<⁠0.001

0.95

LAA

602 (55.18)

30 (26–36)

599 (54.9)

30 (26–35.6)

0.96

<⁠0.001

0.74

LVDD

948 (86.89)

5.1 (4.6–5.7)

950 (87.08)

5.1 (4.6–5.7)

0.99

<⁠0.001

0.95

LVEF

1009 (92.48)

54 (44–60)

1010 (92.58)

54 (44–60)

0.99

<⁠0.001

0.95

PWD

917 (84.05)

1.1 (1–1.2)

914 (83.78)

1.1 (1–1.2)

0.99

<⁠0.001

0.95

RAA

508 (46.56)

26 (22–31)

509 (46.65)

26 (22–31)

0.97

<⁠0.001

0.86

RV

915 (83.87)

3 (2.8–3.3)

911 (83.5)

3 (2.8–3.3)

0.97

<⁠0.001

0.95

SPAP

167 (15.31)

46 (39–55)

170 (15.58)

46 (39–55)

0.97

<⁠0.001

>0.99

TAPSE

563 (51.6)

21 (18–24)

561 (51.42)

21 (18–24)

0.98

<⁠0.001

>0.99

TRPG

590 (54.08)

27 (22–35)

599 (54.9)

27 (22–35)

0.97

<⁠0.001

0.93

Abbreviations: AVA, aortic valve area; AVAi, indexed aortic valve area; AcT, pulmonary acceleration time; Ao, aortic diameter; IQR, interquartile range; IVS, interventricular septum diameter; LA, left atrial anteroposterior diameter; LAA, left atrial area; LVDD, left ventricular diastolic diameter; LVEF, left ventricular ejection fraction; PWD, posterior wall diameter; RAA, right atrial area; RV, right ventricle; SPAP, estimated systolic pulmonary arterial pressure; TAPSE, tricuspid annulus plane systolic excursion; TRPG, tricuspid regurgitation pressure gradient

For diagnosis detection, in most cases, there was an almost perfect agreement between the AssistMED and the annotators; the lowest (moderate) agreement was identified for type 1 diabetes, a history of valvuloplasty, and a history of systemic embolism.

For medication detection, there was an almost perfect agreement in the drug group and active substance identification. Accurate identification of dosage proved most challenging (Table 3; Supplementary material, Table S7), as the algorithm failed to detect the dosage more frequently than the annotators (reflected as more missing dosage data). Most drugs showed a substantial agreement in dose detection (the lowest, that is, slight agreement for antiplatelet dose detection was identified). The disagreements were primarily attributed to a lack of dose detection (more missing data on dosage for the automatic data retrieval method; Supplementary material, Table S7), and not to incorrect dosage recognition. This was reflected by low Cohen κ (low agreement), but no significant differences in identified dosages (the paired Wilcoxon test omits cases with no available dosage). Such a situation was detected for antiplatelets and acetylsalicylic acid (Table 3; Supplementary material, Table S7). However, for vitamin K antagonists there was a significant difference in dose detection.

For echocardiography, we found an almost perfect agreement for all detected parameters.

The calculated CHA2DS2VASc and HAS-BLED scores based on both classifications (human vs algorithm) yielded the following results: median (IQR), 3 (2–5) vs 3 (2–5); P = 0.74 and 1 (1–2) vs 1 (1–2); P = 0.63, respectively, indicating equivalent assessment of thrombosis and bleeding risk for both methods.

A complete automatic detection took 3 hours and 15 minutes (about 6.5 min per 100 records), while human verification of the algorith work took 71 hours and 12 minutes (about 2 h and 22 min per 100 records). The analysis of 100 records blinded to the algorithm indications revealed that the first annotator spent 5 hours and 50 minutes on the task, while the second spent 4 hours and 44 minutes. Therefore, the mean manual-only annotation time was 5 hours and 17 minutes per 100 patient records analyzed. Based on this, the estimated time of the human-only database collection would take 159 hours. This signifies that automated retrieval was 20 times faster than human annotation, and 50 times faster than fully manual retrieval. Human verification of the algorithm suggestions was 2.2 times more rapid than fully manual retrieval.

The separate sample of 100 records annotated blinded to the AssistMED classification was additionally evaluated. The results achieved for this dataset by the algorithm and the annotators were compared and are presented in Supplementary material, Tables S8–S15. The results indicated an almost perfect agreement between automatic and manual analysis for diagnosis, medications (reduced agreement in the case of similar drug dose identification as in the main cohort), and echocardiographic parameters, indicating that working on the algorithm suggestions did not bias judgment of the annotators.

Discussion

The results indicated that NLP-based cohort acquisition yielded a cohort highly similar to that retrieved by human annotators. Unsurprisingly, automatic detection was more rapid. Our findings indicate that utilization of NLP may enable a comprehensive assessment of multiple cardiovascular and internal diseases, medications, dosing, and numeric echocardiographic parameters. Dosage detection was, unsurprisingly, the most challenging for the algorithm as the most detailed feature. This algorithm behavior was documented in our prior publication,5 which qualitatively described sources of such errors.

To present the results in the context of conventionally and widely used automatic cohort retrieval methodologies, we may compare them to the accuracy of the ICD-10 diagnostic codes from the National Health Fund database. Disease characterization of the cohort utilizing the AssistMED was more accurate than the analysis of administrative ICD-10 codes, according to our previous study.2 In that study, comparing manually gathered data in the CRAFT registry and the ICD-10–based data, we demonstrated sensitivity of 83% for AF detection, 82% for heart failure, 89% for hypertension, and 69% for thromboembolic events, to mention a few. Specificity was generally decent but varied depending on the condition (32% for hypertension and 40% for atherosclerosis). Ultimately, all these inaccuracies translated in the final cohort of patient baseline characteristics looking substantially different than that observed in the CRAFT registry, with significant differences in estimated CHA2DS2VASc and HAS-BLED scores.7

According to a systematic review on text processing in medicine,8 NLP application attempts are popular in the cardiovascular field, likely due to a need for large cohorts of patients and a higher percentage of data being unstructured than in other medical specialties. The AssistMED project is one of the first in Poland, with a vast spectrum of data recovered simultaneously, as compared with other studies, and with a large validation cohort.9 Output data are tailored to the needs of clinical researchers in an inpatient setting and, therefore, have a potential for broader application. The proposed design facilitates acquisition of output data that are appropriately structured, allowing for rapid analysis and comprehensibility for clinical researchers. Furthermore, an automatic summary statistics generation module may facilitate the initial stage of a research project, for example, providing information on the number of patients with specific characteristics in the past to ascertain future study recruitment.

The following section will briefly describe various text-processing solutions used in electronic medical documentation from a technical perspective and will provide examples of vital research problems they solve.

Landscape of text processing of medical documentation for clinical data retrieval

Rule-based and dictionary-based algorithms

The simplest methods of text processing from a technical perspective are rule-based and dictionary-based algorithms that enable detection of specified patterns in the presented text. Their development includes establishing a database of terms / expressions / patterns in the data that need to be recognized. This process often requires cooperation of text-processing experts. An advantage of this approach is its predictability, that is, errors are easily tracked, and the algorithm can be gradually improved. Its main drawbacks are lack of flexibility (even a typo in the text can make it unrecognizable to the algorithm) and generalizability issues (developed dictionaries and patterns are most pertinent to the data for which the algorithm was developed, and thus may perform unsatisfactory at other institutions and are only applicable to the same language). Despite these limitations, there are multiple successful examples in the literature.

One of the largest and most clinically sound example is a study by van Dijk et al.3 The authors utilized a text mining technique based on designed regular expressions for the entire EHR documentation during LoDoCo2 trial (Low Dose Colchicine for Secondary Prevention of Cardiovascular Disease)10 prescreening and data collection phases. Mean accuracy, sensitivity, and specificity of the automatically extracted data were 88%, 81%, and 83%, respectively. The lowest accuracy was found for hypertension (62.6%), antiplatelet therapy (68.8%), and β-blocker use (73.3%). Despite these limitations, the tool allowed for manual screening of only 20.1% of the original 92 466 patients for the trial inclusion, resulting in 82.4% of the final participants being recruited through this prescreening method, which was a remarkably time-efficient solution.

Another study by Karystianis et al11 extracted mentions of 5 diseases, smoking status, family history, and medications from clinical notes. The project was an ambitious attempt to use less stereotypical textual data, that is, daily clinical notes. The authors utilized a rule-based approach with dictionaries developed especially for that purpose. Mean average sensitivity reached 90%. Errors were predominantly due to a lack of context analysis and unforeseen shortcuts frequently used in clinical notes. Additionally, coronary artery disease was deemed the most challenging to identify, causing many undetected cases. A diagnosis of coronary artery disease, even if not stated directly, is hinted by other diagnoses, such as a history of coronary artery bypass grafting, percutaneous coronary intervention, or myocardial infarction. Taking this into account, such clinically meaningful hierarchies have been implemented in the AssistMED tool. The authors discussed the problems of analyzing clinical notes, such as their less predictable structure, complex context analysis, and frequent jargon usage.

An NLP tool, EchoInfer, was developed to automatically extract cardiovascular structure and function data from echocardiographic reports.12 EchoInfer achieved comparable results with an average sensitivity of 92.21% evaluated at a single institution.

Supervised machine learning

Supervised machine learning methods are flexible, adaptable to new language patterns, and can achieve high accuracy with proper training. Supervised learning means that a computer needs to know the final answer, that is, whether a particular diagnosis / drug / echocardiography parameter is associated with an individual patient. In this technique, the text is preprocessed and then treated as a signal for machine learning. The computer learns to associate certain constellations of text chunks with the presence or absence of a disease. An advantage of the approach is that once the algorithm is trained, it usually works well at other institutions, meaning it is portable. However, this technique requires large samples of high-quality, labeled data to become functional. Examples available in the literature identify a few diseases at a time.

Weissler et al13 presented an elegant solution of text-processing based on various textual data types from EHRs to recognize patients with peripheral artery disease. A majority of the texts available in EHRs contributed to training of the machine learning model (including progress notes, consults, etc.). Based on a cutoff selected by the authors, the algorithm achieved a sensitivity of 90% and specificity of 62%. The approach is exciting, as it goes beyond classifying specific text fields in the EHR and aggregates all available textual data.

Deep learning methods: large language models

Large language models (LLMs) excel in language comprehension and context awareness, and thanks to their pretraining with openly available online data, they also include medical information. Pretraining means that such models can perform decently in out-of-the-box tasks. Fine-tuning is required for further improvement. Here, a big step forward is that the fine-tuning does not need so much annotated data, which is the main limitation of machine-learning approaches. Additionally, context awareness and acknowledgement of shortcuts may enable them to be effectively applied for tasks such as diagnosis and drug detection in heterogeneous data, such as daily clinical notes. This may help correctly identify drug intake status (whether taken, discontinued, halted, considered for introduction, or allergic to) or disease status (confirmed, included in differential diagnosis but not yet confirmed, or excluded). Our algorithm solely incorporated simple negation detection, limiting its capabilities in this matter, and high accuracy of our approach is partly attributed to processing of more organized textual data types in EHRs.

Even ChatGPT shows a potential for proposing ICD-10 codes from provided pieces of anonymized medical documentation and categorizing findings from echocardiographic reports, which may aid in structuring medical data. There are also LLMs pretrained on real-world medical data (MedPalm 2, https://arxiv.org/pdf/2212.13138.pdf, Clinical BERT, https://arxiv.org/abs/1904.03323, MedBERT).14 MedPalm 2, for example, achieved a passing rate on the United States Medical Licensing Examination.15

With fine-tuning, LLMs could effectively perform tasks such as parameter value retrieval from magnetic resonance imaging reports, as demonstrated by Singh et al,14 which is a task similar to that attempted in our echocardiographic parameters retrieval. Notably, the authors developed the model with just 370 human annotations by fine-tuning the BERT-LARGE model, thus overcoming a typical limitation of such projects utilizing artificial intelligence, that is, the requirement for large datasets with high-quality annotated data.

There are, however, other limitations of LLMs, for example, hallucinations. This is a well-known problem of ChatGPT, as the chatbot tends to give invented, false information to provide any answer to a question at hand, which is undesirable in research data acquisition. Furthermore, although LLMs excel in language translation, most of their training data are in English. This might compromise the results of medical data understanding and categorization for other less-represented languages, including Polish. Suwała et al16 already reported unsatisfactory performance of ChatGPT at the Polish board certification examination in internal medicine, despite being able to pass multiple other international exams, for example the European Exam in Core Cardiology.17

In summary, so far, dictionary-based and rule-based approaches have been the most popular in the medical field, but more advanced text-processing techniques are being adopted. Table 4 summarizes subjective advantages and disadvantages of the discussed NLP approaches. The approaches used to develop the AssistMED tool could be described as hybrid, as we utilized dictionaries, rules, and some machine learning to achieve the presented results. Therefore, more portability can be expected than in traditional dictionary- and rule-based algorithms, although this requires future testing on data from other institutions.

Table 4. Advantages and disadvantages of current natural language programming (NLP) approaches

NLP approach

Advantages

Disadvantages

Rule-based and dictionary-based

  • Full control of the algorithm performance
  • Predictable results
  • Large datasets of annotated data not needed at the project onset but for validation of the results
  • Possibility to design a decently performing NLP tool categorizing multiple conditions at a time
  • Low development costs
  • Less favorable accuracy at other institutions
  • Portability concerns
  • Health care professional involvement is often needed during development
  • Unsatisfactory results in disorganized text data, eg, progress notes
  • Favorable results in selected textual data types, not the entirety of electronic health records
  • Pertinent only to the language the tool is developed in

Supervised machine learning

  • Usually portable
  • Can be used at other institutions with good results
  • Predictable results
  • Ability to analyze various textual data types at once
  • Can be language-agnostic
  • Potentially widely applicable
  • Requirement of large high-quality annotated datasets for training at the project onset
  • Risk of overtraining the model during development
  • A model too accustomed to training data, which in the end performs badly on new data
  • High development costs
  • Practical use cases in the literature demonstrate identification of 1 or a few conditions at a time due to discussed limitations

Deep-learning, large language models

  • Potentially allows for accurate identification in less organized texts, such as progress notes
  • Lower amounts of annotated data required than in conventional machine-learning due to pretraining on vast amounts of openly available data
  • Solution developed with these techniques will potentially be applicable in different languages
  • Black-box approach, ie, lack of understanding why a certain decision was made
  • Lack of predictability
  • Model hallucinations
  • High development costs

Study limitations

Despite optimistic results, our approach has significant limitations that must be addressed.

First, the AssistMED algorithm has only been validated in a single tertiary cardiology center, and its applicability to data from other centers is still to be determined.

Second, we limited our analysis to specific textual data types of the discharge documentation, not the entire textual data available in the EHRs. However, choosing these specific textual data types takes advantage of natural tendencies ingrained in everyday clinical practice in Poland, that is, listing clinical diagnoses descriptively and including medication dosing in discharge recommendations. These data types are more homogeneous and more straightforward to analyze than disorganized progress notes, and processing such texts brought satisfactory results. As discussed before, analysis of other data types is challenging, and published results indicate relatively moderate retrieval accuracy,11,18 despite demonstrated research utility of such data.

Third, the AssistMED algorithm lacks advanced context analysis, its only capability is detecting negation in the diagnosis recognition module. This is largely the reason for inaccuracies reflected in the results. The source of errors includes detection of a condition that is not yet established (eg, “patient qualified for elective PCI [percutaneous coronary intervention] scheduled on…”—there was no PCI yet, but the algorithm recognized the condition), significant typos that precluded identification, identification of a drug which is not taken (eg, “please do not take bisoprolol”—the patient is not taking bisoprolol), and random algorithm errors. As discussed, LLMs are most likely to resolve such problems, as they require advanced language understanding.

Conclusions

NLP tools implemented in the AssistMED project could quickly and accurately characterize patients with AF, as compared with human-based retrieval. Further improvements will likely arise by adopting LLMs for similar tasks.