Introduction

Acute pancreatitis (AP) is one of the most common gastrointestinal diseases requiring hospitalization. It is associated with a high hospitalization cost in many countries. The global incidence of AP has risen over the last decades, with an average annual percent change of 3.07%,1 resulting in an increased burden on health care systems. Although most patients with AP usually experience a mild disease that is self-limited and lasts approximately a week, about 20% to 30% progress to severe acute pancreatitis (SAP), with mortality rates ranging between 13% and 35%.2 A majority of patients with SAP require acute care and nutritional support in an intensive care unit (ICU).3 Early and accurate identification of SAP is crucial to reduce mortality rates and improve clinical outcomes.4 Therefore, it is important to recognize prognostic factors and establish a prediction model (prognostic scoring system) with high discriminatory efficiency for SAP.

Several prognostic scoring systems, such as the Ranson score, Acute Physiology and Chronic Health Evaluation (APACHE) II, Bedside Index of Severity in Acute Pancreatitis (BISAP), and Japanese severity score (JSS), have been commonly used for the prediction of AP severity in clinical practice.3 However, each of them has certain limitations. For example, some variables included in the Ranson score need to be calculated within 48 hours of hospital admission, resulting in a high risk of missing the optimal timing of treatment.5 APACHE II is difficult and cumbersome to be widely applied in clinical practice, as it comprises 12 mandatory variables that are not routinely obtained in patients who are not critically ill.5 Due to its simplicity, BISAP is useful for early prediction of severity in patients with AP; however, its accuracy is relatively low.5 Therefore, there is still no gold standard prognostic score for predicting SAP.

Nowadays, artificial intelligence (AI) methods are being widely utilized to determine prognosis of various diseases, and play an important role in clinical settings, as they can assist in clinical decision-making.6-8 Artificial neural networks (ANNs) are a subset of traditional machine learning methods, belonging to the field of AI. Their structure and function are designed to resemble biologic nervous systems, with powerful learning algorithms and training capabilities to perform simulations with high accuracy.6 Using high-performance computer clusters, Andersson et al9 established an ANN model for prediction of SAP which outperformed a logistic regression model and APACHE II (area under the curve [AUC] values of 0.92, 0.84, and 0.63, respectively, for the ANN model, logistic regression model, and APACHE II). ANN models have relatively high sensitivity and specificity, but their interpretability is low because of the black box effect, which limits their clinical application.10

The extreme gradient boosting (XGBoost) algorithm has remarkable features that enable flexible and efficient processing of missing data. Additionally, it assembles weak prediction models to construct an accurate one, and has been used in clinical practice to predict the severity and outcomes of AP.11,12 Thapa et al11 developed an XGBoost algorithm to identify patients who would benefit from treatment of SAP. Their study was limited by the fact that persistent systemic inflammatory response syndrome (SIRS), rather than persistent organ failure, was considered a gold standard for establishing the SAP diagnosis.11 In addition, local individualized prediction was not accounted for. Kui et al12 developed an early achievable severity index using the XGBoost machine learning algorithm for prediction of severe AP within 24 hours of hospital admission. However, the authors did not exclude patients with organ failure on admission, nor did they provide a comparison between XGBoost and ANN models.12 Therefore, the aim of the present study was to develop and validate an interpretable XGBoost model, and to compare its performance with that of the traditional ANN model for predicting SAP.

Patients and methods

Inclusion and exclusion criteria

This study was a post-hoc analysis of our previous cohort studies, which included 648 consecutive, eligible patients with AP, treated at the First Affiliated Hospital of Wenzhou Medical University, a tertiary referral center in mainland China.4,13 Patients with AP admitted to the hospital within 72 hours of the symptom onset were enrolled in the study between April 1, 2012 and December 31, 2015.13 For the diagnosis of AP, at least 2 of the following features were required: characteristic abdominal pain consistent with AP, laboratory investigations with amylase and / or lipase levels more than 3 times the upper limit of normal, and typical abdominal findings on cross-sectional imaging.4 The exclusion criteria were described in detail previously, and comprised pancreatitis induced by trauma or endoscopy, concomitant pancreatic cancer, acute exacerbation of chronic pancreatitis, a history of surgery or treatment with lipid-lowering agents, malnutrition, and liver or kidney disease.13

Data collection

The collected data included age, sex, body mass index (BMI), duration of symptoms, presence of SIRS, etiology of AP, and selected laboratory parameters. Duration of symptoms was defined as the time from the onset of symptoms to admission. The analyzed symptoms included abdominal pain and other gastrointestinal symptoms related to AP. SIRS was defined as the presence of the least 2 of the following criteria: 1) body temperature greater than 38 °C or lower than 36 °C; 2) respiratory rate greater than 20/min or partial pressure of carbon dioxide lower than 32 mm Hg; 3) heart rate greater than 90 bpm; 4) leukocyte count greater than 12 × 109/l or lower than 4 × 109/l, or more than 10% of immature forms.14 Various blood biochemical indicators, including liver and kidney function parameters, blood glucose, lipids, coagulation parameters, serum calcium, and C-reactive protein (CRP), were collected according to the previously described data regarding predictive scores, such as APACHE II and BISAP.4 Imaging examinations (computed tomography or ultrasonography) were performed to determine the presence of pleural effusion.13

Definition of severity and study end point

The criterion for SAP diagnosis was persistent organ failure lasting more than 48 hours.13 The definition of organ failure was based on a modified Marshall score greater than or equal to 2, which means that at least 1 organ system, including the respiratory, cardiovascular, and renal systems, is functionally impaired.4 The primary end point of the study was occurrence of SAP during hospitalization.

Sample size and missing values

We calculated the sample size of this study based on data from our previous paper.4 Data on serum calcium and CRP values were partially missing in our cohort. In order to address this issue, we used the multiple imputation by chained equations (MICE) method to sustain the completeness of the sample size, reduce biased parameter estimates, and increase statistical power of the XGBoost and ANN analyses.15 MICE is one of the most common and flexible algorithms, which iteratively fits a predictive model for variables with missing values and creates a “complete” dataset.15,16

Ethics statement

The Ethics Committee of the First Affiliated Hospital of Wenzhou Medical University approved this study protocol (KY2023-R270). Written informed consent from the participants was not required because their data were analyzed retrospectively and anonymously.

Statistical analysis

Categorical variables were presented as numbers and percentages and compared by the Fisher exact test or the χ2 test. According to the results of the Shapiro–Wilk test, continuous variables were expressed as mean (SD) when they were normally distributed, or as median with interquartile range (IQR) when their distribution was non-normal. Continuous variables were compared by the t test or the Wilcoxon rank-sum test, as appropriate.

An exploratory variable importance analysis was performed to evaluate the role of different variables in SAP prediction by both XGBoost and ANN models. For the XGBoost model, variable importance was quantified by a Shapley additive explanations (SHAP) summary plot, and the individual predictions were explained by a SHAP force plot.17 For the ANN model, the importance of each variable was determined by evaluating how much the accuracy decreases after adding a variable to the ANN model, using mean decrease accuracy.6

Model evaluation was based on 5-fold cross-validation, which means that the entire cohort of 648 patients was randomly divided into 5 equal subsets. One of these subsets was randomly selected as the test set (129 patients), while the remaining ones were labelled as training sets (519 patients). The XGBoost and ANN models were developed on the training sets (n = 519) and independently validated on the test set (n = 129) using the “caret” package.18 To build and tune the XGBoost and ANN models on the training set, we used 5-fold cross-validation as the resampling method to avoid overfitting of the model with new data.18 The training set (n = 519) was divided into 5 equal-size subsamples, of which 4 subsamples (n = 415) served for training and the remaining one (n = 104) for testing of all possible permutations. The analysis was repeated 3 times (folds).17 The mean area under the receiver operating characteristic (ROC) curves with 95% CI as well as area under precision recall curve (AUC-PR) were used to evaluate the discriminatory power of the models.17 Comparison of the AUC values was performed using the method proposed by Cleves et al.19

Sensitivity, specificity, and diagnostic accuracy of the XGBoost and ANN models were calculated, and the optimal cutoff value was selected according to the maximum value of the Youden index (sensitivity + specificity – 1). The local interpretable model-agnostic explanation (LIME) plot was used to explain the individual prediction to overcome the black box effect of the XGBoost output and improve its interpretability.17 With this novel explanation technique, classifier predictions were interpreted and reliably explained by learning interpretable models locally around the prediction.20

A data flow diagram of our study is shown in Supplementary material, Figure S1. All statistical analyses were performed with R software, version 4.1.1 (R Foundation for Statistical Computing, Vienna, Austria) and STATA software, version 10.0 (StataCorp LP, College Station, Texas, United States). A 2-tailed P value below 0.05 was considered significant.

Results

Baseline characteristics

The 3 most common etiologies of AP in the study population were biliary abnormalities (42.4%), excessive use of alcohol (16.4%), and hypertriglyceridemia (5.6%). The median (IQR) length of hospital stay for patients with and without SAP was 16 (10–31) and 10 (7–13) days, respectively. The incidence of SAP and mortality during hospitalization were 10% and 1.54%, respectively. There was no difference between the training and test sets with respect to most laboratory and clinical characteristics. However, the patients in the training set had higher BMI than those included in the test sets. The SIRS rate and serum creatinine level in the test sets were higher than those observed in the training set (P <⁠0.05). Baseline characteristics of patients included in the training and test sets are presented in Supplementary material, Table S1.

Univariable analysis of the training sample

A total of 23 variables were included in the univariable analysis (Table 1). Of those, 8 did not differ significantly between the patients with and without SAP. The patients with SAP more often presented with SIRS, pleural effusion, and abnormal serum total cholesterol (TC) concentration, as compared with the patients without SAP. Similarly, the SAP group had higher serum hematocrit, aspartate aminotransferase (AST), glucose, serum creatinine, blood urea nitrogen (BUN), CRP, and triglyceride levels and longer prothrombin time, as compared with the non-SAP group. The individuals with SAP also had a lower level of serum albumin, high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), and serum calcium.

Table 1. Clinical and laboratory characteristics of patients with and without severe acute pancreatitis included in the training sample (n = 519)

Variable

No SAP (n = 467)

SAP (n = 52)

P value

Age, y

47 (37–62)

52.5 (38–68)

0.06

Male sex

288 (61.7)

32 (61.5)

0.99

Duration of symptoms, d, mean (SD)

1.8 (0.8)

1.9 (0.8)

0.42

BMI, kg/m2

23.8 (20.3–26.3)

23.9 (22–25.9)

0.94

SIRS

160 (34.3)

35 (67.3)

<⁠0.001

Etiology of AP

Biliary etiology

205 (43.9)

19 (36.5)

0.23

Hypertriglyceridemia

23 (4.9)

5 (9.6)

Alcohol

61 (13.1)

4 (7.7)

Other

178 (38.1)

24 (46.2)

Pleural effusion

60 (12.9)

35 (67.3)

<⁠0.001

Laboratory findings

Hematocrit, l/l

0.42 (0.38–0.45)

0.44 (0.4–0.47)

0.01

Platelets, × 109/l

199 (164–233)

190 (142–233)

0.1

Prothrombin time, s

13.8 (13.1–14.6)

14.6 (13.2–15.3)

0.002

Albumin, g/l

36.9 (33.6–40.1)

31.5 (27.7–35)

<⁠0.001

Total bilirubin, µmol/l

20 (14–33)

19 (14–26.5)

0.36

ALT, U/l

40 (18–119)

48 (24–75)

0.61

AST, U/l

33 (20–88)

63 (36–89)

0.003

Glucose, mmol/l

7.8 (6.5–10.6)

10.3 (8.4–14.7)

<⁠0.001

Serum creatinine, μmol/l

63 (53–75)

79 (58–128)

<⁠0.001

BUN, mmol/l

4.5 (3.5–5.9)

7.9 (5.3–11.4)

<⁠0.001

Total cholesterol

<⁠4.2 mmol/l

145 (31.3)

26 (50)

0.002

4.2–6.2 mmol/l

205 (43.9)

10 (19.2)

>6.2 mmol/l

117 (25.1)

16 (30.8)

HDL-C, mmol/l

1.1 (0.8–1.3)

0.6 (0.4–1)

<⁠0.001

LDL-C, mmol/l

2.4 (1.9–3.2)

1.7 (1.3–2.7)

<⁠0.001

Triglycerides, mmol/l

1.3 (0.8–3.4)

2.4 (1.3–7.1)

<⁠0.001

Serum calcium, mmol/l

2.2 (2.1–2.3)

2 (1.6–2.2)

<⁠0.001

CRP, mg/l

31 (9.6–84.9)

76.1 (26.4–90)

0.009

Data are shown as numbers and percentages or median (interquartile range) unless indicated otherwise.

SI conversion factors: to convert ALT and AST to μkat/l, multiply by 0.0167; CRP to nmol/l, by 9.524.

Abbreviations: ALT, alanine aminotransferase; AST, aspartate aminotransferase; BMI, body mass index; BUN, blood urea nitrogen; CRP, C-reactive protein; HDL-C, high-density lipoprotein cholesterol; LDL-C, low-density lipoprotein cholesterol; SAP, severe acute pancreatitis; SIRS, systemic inflammatory response syndrome

Exploratory variable importance analysis of the training sample

The 15 variables (SIRS, hematocrit, prothrombin time, albumin, AST, glucose, serum creatinine, BUN, TC, HDL-C, LDL-C, triglycerides, serum calcium, CRP, and pleural effusion) that were significant in the univariable analysis were used to build the XGBoost and ANN machine learning models. In the ANN model, glucose was found to be the most important predictor of SAP, followed by albumin and presence of pleural effusion (Figure 1). The SHAP summary plot visualized the relative importance of each variable included in the XGBoost model. The 3 most important variables were BUN, presence of pleural effusion, and HDL-C (Figure 2).

Figure 1. Variable importance plot of the artificial neural network (ANN) model for severe acute pancreatitis

Abbreviations: see Table 1

Figure 2. Variable importance plot of the extreme gradient boosting model for severe acute pancreatitis. The Shapley additive explanation (SHAP) value (x axis) reflects the predictive ability of each parameter.

Abbreviations: see Table 1

Model development, 5-fold cross-validation, and calibration on the training sample

The results of 5-fold cross-validation indicated that the XGBoost model achieved a greater mean AUC than the ANN model (mean AUC = 0.92; 95% CI, 0.87–0.97 vs mean AUC = 0.86; 95% CI, 0.78–0.92, respectively; P <⁠0.001) (Figure 3). A greater AUC-PR was also observed for the XGBoost model than for the ANN model (0.63 vs 0.48) (Figure 4). The calibration plots indicated adequate predicted probabilities against the observed proportions of SAP for both the XGBoost and ANN models (Supplementary material, Figure S2).

Figure 3. Receiver operator characteristic curves of the extreme gradient boosting (XGBoost) and artificial neural network (ANN) models for 5-fold cross-validation on the training set

Abbreviations: AUC, area under the curve

Figure 4. Precision-recall (PR) curves for the extreme gradient boosting (XGBoost) and artificial neural network (ANN) models for 5-fold cross-validation in the training set. The vertical line represents precision of the XGBoost and ANN models when the recall (sensitivity) equals 1.

Abbreviations: see Figure 3

Validation, comparison, and calibration of the prediction models on the test samples

The ROC curves for the XGBoost model, the ANN model, and the BISAP score for the prediction of SAP are shown in Supplementary material, Figure S3. The XGBoost model achieved the highest AUC (AUC = 0.93; 95% CI, 0.85–1), followed by the ANN model (AUC = 0.87; 95% CI, 0.79–0.96), and the BISAP score (AUC = 0.74; 95% CI, 0.58–0.89; P <⁠0.001). The AUC-PR of the XGBoost model was higher than that of the ANN model (0.59 vs 0.49) (Supplementary material, Figure S4).

Based on the maximum value of the Youden index, the optimal cutoff values of the XGBoost model and the ANN model were 0.24 and 0.05, respectively. The XGBoost model achieved sensitivity of 92.3%, specificity of 92.2%, and diagnostic accuracy of 92.2%. In comparison, the ANN model achieved similar sensitivity (92.3%), lower specificity (73.2%), and lower diagnostic accuracy (75.2%).

The calibration plots visualizing the predicted probabilities against the observed proportions of SAP for the XGBoost and ANN models are shown in Supplementary material, Figure S5.

Explanation: individual prediction on the test sample

To clarify the model prediction for individual patients, the LIME plot was used to visualize 2 typical predictions made by the XGBoost model; 1 for a non-SAP and 1 for a SAP patient (Figure 5).

Figure 5. Local interpretable model-agnostic explanation plot for the individual likelihood of 2 typical predictions, showing the main contributing features behind the model prediction. The length of the color bar represents the degree of contribution. A – a correctly identified case of a non-SAP patient: a 76-year-old woman with no SIRS, hematocrit of 0.3 l/l, prothrombin time of 14.2 s, albumin of 31.1 mg/dl, AST of 58 U/l, glucose of 4.6 mmol/l, serum creatinine of 78 μmol/l, BUN of 6.1 mmol/l, total cholesterol of 5.9 mmol/l, HDL-C of 1.17 mmol/l, LDL-C of 3.73 mmol/l, triglycerides of 1.64 mmol/l, calcium of 1.99 mmol/l, CRP of 137 mg/l, and no pleural effusion. The absence of pleural effusion and normal glucose values were the main reasons for classification in the non-SAP group, outweighing other factors, such as increased BUN and AST values and decreased calcium levels. B – a correctly identified case of a SAP patient: a 41-year-old woman with SIRS, hematocrit of 0.41 l/l, prothrombin time of 13.9 s, albumin of 28.6 mg/dl, AST of 50 U/l, glucose of 10.2 mmol/l, serum creatinine of 55 μmol/l, BUN of 5.8 mmol/l, total cholesterol of 18.3 mmol/l, HDL-C of 0.62 mmol/l, LDL-C of 2.01 mmol/l, triglycerides of 48.2 mmol/l, calcium of 0.87 mmol/l, CRP of 90 mg/l, and pleural effusion. The presence of pleural effusion and low HDL-C values were the main reasons for classification in the SAP group, outweighing other factors, such as normal BUN and LDL-C values.

Abbreviations: see Table 1

For example, the first correctly classified case (case 222) was a non-SAP patient. The woman was 76 years old. The lack of pleural effusion and normal glucose level were the main reasons for classifying the patient as non-SAP, outweighing other factors, such as increased blood urea nitrogen and aspartate transaminase levels, and decreased calcium concentration.

The second correctly classified case (case 224) was a SAP patient. This woman was 41 years old. The presence of pleural effusion and low HDL-C level were the main reasons for patient classification into the SAP group, outweighing other factors, such as normal blood urea nitrogen and LDL-C levels.

Discussion

SAP is characterized by persistent organ failure and high mortality.2 To improve patient prognosis, early identification of this condition is very important. Our study developed the XGBoost and ANN models and compared their efficiency for SAP prediction. The results of 5-fold cross-validation showed that the XGBoost model outperformed the ANN model on the training set, with AUC values of 0.92 and 0.86, respectively. A greater AUC-PR was also observed for the XGBoost model than for the ANN model (0.63 vs 0.48). We validated the results on the test set and utilized a LIME plot to explain individual predictions made by the XGBoost model. Finally, we identified important predictors of SAP, including BUN, pleural effusion, and HDL-C, which were the 3 most important parameters in the XGBoost model.

Increased serum glucose levels are a common early characteristic of AP, and have been generally considered a transient phenomenon throughout AP.21 Their occurrence leads to damage of various pancreatic cells and activation of the neuroendocrine system, which causes exocrine and endocrine dysfunction and affects glucose homeostasis.3,22 According to a cross-sectional study, almost 40% of patients without a history of diabetes mellitus showed altered glucose metabolism (AGM) after an episode of AP.23 Rekeneire et al,24 using tumor necrosis factor α, CRP, and interleukin 6 (IL-6) levels as indicators of the inflammatory event, concluded that dysglycemia was associated with inflammation and showed that this relationship also extended to hyperglycemia. Moreover, several previous longitudinal studies have shown that inflammation may be a predictive factor for the onset of diabetes.25-27 Therefore, the association between diabetes and inflammation could be explained by a reciprocal interaction, suggesting that an incipient rise in serum IL-6 during AP may lead to AGM. High blood sugar levels and abnormal glucose metabolism could be an indication of more serious AP,28 and have been used in prognostic models to predict SAP.13,29 We showed that the glucose level was the most significant variable for forecasting SAP in the ANN model (Figure 1), and ranked fourth in the XGBoost model (Figure 2); therefore, it plays an important role in predicting SAP.

Albumin, an indispensable liver protein responsible for the maintenance of osmolar balance, generation of antioxidative compounds, and trapping free radicals, has also been long considered a negative acute phase protein whose production is reduced in inflammation, opening the way for proinflammatory cytokines.30 Ocskay et al31 showed that a low albumin level on admission was an independent risk factor for both mortality and severe disease in patients with AP, with an odds ratio of 5.256 and 3.62, respectively, in the groups with albumin levels lower than 25 g/l. In our study, albumin was found to be an impactful indicator of SAP according to the variable relevance assessment (Figure 1) in the ANN model, which is consistent with previous studies.30,31 However, it did not show good discriminatory performance for predicting SAP in the SHAP analysis for the XGBoost model (Figure 2).

Pleural effusion is common in AP patients, and it usually resolves as pancreatitis attenuates. Various reasons have been suggested to explain the development of pleural effusion in the setting of pancreatitis, for example, transdiaphragmatic lymphatic blockage, disruption of pancreatic duct, pancreatic pseudocyst, and anatomy (certain anatomic tracts between the chest and abdominal spaces).32 Generally, left-sided effusions that show normal levels of amylase in the fluid are caused by chemical or sympathetic factors.32 If pleural effusion occurs on the right side, a pseudocyst of the pancreas or a fistula between the pancreas and the pleura may be involved in its genesis.33 Pleural effusion has been integrated into several clinical grading scores for predicting SAP, such as the BISAP score,3 and its volume is considered to be a valid imaging biomarker for assessing the severity and clinical course of AP.34,35 In our study, pleural effusion was found to be of great importance in the analysis of variable importance in the ANN model (Figure 1), and the SHAP summary plot for the XGBoost model showed that it was useful for prediction of SAP (Figure 2).

BUN is the most important substance in the metabolism of proteins and is excreted mainly by the kidneys. Acute renal failure is a common organ injury in patients with SAP that increases the risk of mortality.36 Elevated BUN levels can be explained by several factors contributing to the loss of intravascular volume.37 On the one hand, endothelial dysfunction has been found in patients with AP,37,38 which manifests as increased capillary permeability resulting in decreased blood volume. On the other hand, premature activation of pancreatic enzymes during AP leads to autodigestion of surrounding tissues, directly leading to renal injury.3 In addition, cytokines, including IL-1β, IL-8, and IL-6, interact with endothelial cells, leading to renal ischemia and secretion of free oxygen radicals.39 For these reasons, a decrease in splanchnic perfusion, followed by impaired renal blood flow, leads to renal impairment and acute necrosis of the tubules.37 As expected, based on the importance analysis of the ANN model variables (Figure 1), we showed that BUN was a major predictor of SAP. The SHAP analysis indicated that an increased BUN level on admission was the most salient parameter in the XGBoost model (Figure 2).

Cholesterol homeostasis, which requires a complex balance between biosynthesis, absorption, excretion, and esterification, is important for maintaining adequate cellular and systemic responses.40 Elevated levels of HDL-C are commonly strongly related to a decreased risk of cardiovascular disease as a protective factor.41 Recently, studies have indicated that HDL-C may be of vital importance to the immune system, and decreased HDL-C concentration correlates with elevated serum CRP levels.42 HDL-C concentration is significantly reduced during the acute phase of inflammation.43 Studies have further demonstrated that low levels of HDL-C correlate with a worse prognosis in septic patients.44 A possible explanation of this association could be the ability of HDL-C, which carries a lipopolysaccharide-binding protein, to neutralize and clear proinflammatory endotoxins as a component of innate immunity. HDL-C has also been shown to exhibit antioxidant and anti-inflammatory effects,45 whereas free radicals and oxidative stress, in relation to the severity of pancreatitis, are implicated in causing AP.46 In addition, HDL-C inhibits bone marrow–derived hematopoietic stem cell proliferation; this way, the development of immune cells is controlled and inappropriate leukopoiesis is avoided.47 Li et al48 found that serum concentrations of HDL-C correlated negatively with SAP. In our study, patients with SAP presented with low serum HDL-C and calcium values, as compared with those without this condition. The SHAP analysis showed that HDL-C was a useful parameter for predicting SAP in the XGBoost model (Figure 2).

Machine learning has been widely applied to predict the severity or complications of AP.6 ANN models, one of the AI methods designed to mimic the structure and performance of biological nervous systems, have been applied to predict SAP. However, previous research has been hampered by a lack of individual predictability of the model tested.

XGBoost is designed to be an extremely scalable, end-to-end solution. It proposes a new sparse sensing algorithm for parallel tree learning, makes the missing values have a default split direction, and proposes an effective cache structure to increase training efficiency.49 Similarly to the other machine learning methods, it still poses a challenge due to the limited possibility of interpreting the results derived from machine learning. The magnitudes of the variables can be measured and described using a SHAP summary plot, which improves interpretability of the representation.17 This plot displays the relationship between trait values, and the values of SHAP in the training set can also be used to learn how individual patient characteristics affect the performance of the prediction model itself.17 We found that, as compared with the BISAP score and the ANN model, the XGBoost model showed a greater discriminatory ability to predict SAP in both training and test sets (Figures 3 and 4; Supplementary material, Figures S3 and S4). With the help of the XGBoost algorithm, we could identify key parameters and build a prediction model capable of identifying individuals at risk for SAP with high accuracy. The LIME plot offered a visual representation of the individual variable importance, which might help clinicians better interpret the results of the ANN and XGBoost models (Figure 5). With respect to accuracy, the XGBoost model showed the highest discriminatory performance on the test samples (AUC = 0.93), followed by the ANN model (AUC = 0.87), and the BISAP score (AUC = 0.74) (Supplementary material, Figure S3). The AUC-PR analysis confirmed that the XGBoost model performed better than the other models, both on the training and the test samples (Figure 4; Supplementary material, Figure S4).

To our best knowledge, this is the first study to present an interpretable XGBoost model with LIME diagrams for predicting SAP development. The strengths of this study include a large number of patients, which ensures strong statistical power. Both the patients admitted to the ICU and those treated in the general ward were enrolled, thus reducing the selection bias. However, certain limitations need to be acknowledged. Firstly, it was a single-center analysis; thus, the applicability of our model to other cohorts is unknown. Secondly, we did not further subdivide the non-SAP group into mild and moderately severe groups when developing the prediction models. It may have affected the results regarding accuracy of the established models to a certain extent. Thirdly, the failure to compare the XGBoost model with other prediction scores used in clinical practice, such as APACHE II and JSS, may be another shortcoming. Additionally, although it has been previously validated internally with multiple tests through the 5-fold cross-validation technique, it is necessary to test the performance of our XGBoost model on an independent external sample. Lastly, XGBoost models are very sophisticated and difficult to understand even if proven to be effective, thus becoming comparable to a kind of “black box.” Consequently, we demonstrated how, by using LIME graphs, the results can be interpreted more easily.

In conclusion, as compared with the ANN model, an interpretable XGBoost model showed higher discriminatory efficiency for prediction of SAP. Interpretation of the model using a LIME diagram has certain application value in the field of precision medicine.