Development and validation of prediction models for gestational diabetes treatment modality using supervised machine learning: a population-based cohort study | Medicine BMC

Population study and design

The study population was drawn from membership of Kaiser Permanente Northern California (KPNC), an integrated health care delivery system that serves 4.5 million members. KPNC membership represents approximately 30% of the core population and is socio-demographically representative of the population residing in the geographies it serves [11, 12]. The integrated information system allows quantification of predictors and outcomes across pregnancy continuity. Individuals with GDM are identified by searching the KPNC Pregnancy Glucose Tolerance and GDM Registry, an active surveillance registry that downloads laboratory data to determine screening and diagnosis of type 1 diabetes, where type 1 or type 2 diabetes is automatically excluded. Specifically, KPNC pregnant individuals receive a comprehensive (98%) screening for GDM with a 50g, 1-hour glucose challenge test (GCT) at 24-28 weeks gestation. [1]. If the screening test is abnormal, a 100 g, 3 hour oral glucose tolerance test (OGTT) is performed 8-12 hours later. GDM is verified by meeting any of the following criteria: (1) ≥ 2 OGTT plasma glucose values ​​matching or exceeding Carpenter-Coustan thresholds: 1 hour 180 mg/dL, 2 hours 155 mg/dL, and 3 hours 140 mg/dL; or (2) 1-h GCT 180 mg/dL and fasting glucose 95 mg/dL performed alone or during an OGTT [13, 14]. Plasma glucose measurements were performed using the hexokinase method at the KPNC Regional Laboratory, which participated in the College of American Pathology Accreditation and Monitoring Program. [15]. This data-only project was approved by the KPNC Institutional Review Board, which waived the requirement for informed consent from participants.

Among the 405,557 pregnancies with a gestational age at birth less than 24 weeks’ gestation delivered at 21 KPNC hospitals from 1 January 2007 to 31 December 2017, we excluded 375,041 (92.5%) individuals without breast cancer. Among the 30,516 GDM pregnancies, we also excluded individuals with GDM diagnosed before the global GDM screening (n= 42), and derive an analytical sample from 30474 GDM complex pregnancies. We also deduced a discovery cohort containing 27,240 GDM complex pregnancies from 2007 to 2016 and a provisional/prospective validation set of 3,234 GDM complex pregnancies in 2017 (Fig. 1).

graph 1

Flowchart for the development of a pregnancy cohort with gestational diabetes 2007-2017. GDM: gestational diabetes

Check the result

Individuals diagnosed with GDM received blanket referral to the KPNC Regional Perinatal Service Center for a complementary care program beyond standards of antenatal care. MNT was the first treatment. If glycemic control goals are not achieved with MNT alone, pharmacotherapy is initiated. Based on advice on the risks and benefits of oral antidiabetic drugs versus insulin, pharmacotherapy was selected by a combined patient-physician decision-making model: (1) with oral antidiabetic agents such as glyburide and metformin added to the MNT and if optimal glycemic control In persistent blood failure, oral medication was escalated to insulin therapy, and (ii) or with insulin therapy initiated immediately after MNT (additional table explains this in more detail [see Additional file 1]). We searched the Pharmacy Information Administration database for prescriptions for oral agents (glyburide 97.9%, metformin or other) and insulin after a diagnosis of GDM. The treatment modality was classified as MNT only and drug therapy (oral agents and/or insulin) after MNT. Notably, despite the overall sample size, we pooled oral agents (32.6% of the total population) and insulin (6.2%) in pharmacotherapy because of the insufficient ability to predict insulin separately as an outcome.

Candidate Forecasters

Based on the risk factors associated with GDM treatment modality and clinicians’ input, we selected 176 (64 continuous, 112 categorical) sociodemographic, behavioral, and clinical predictors obtained from electronic health records for model development. Candidate predictors were divided into four levels based on availability at different stages of pregnancy (an additional table explains this in more detail). [see Additional file 2]Level 1 Predictions ( ):n= 68) was available at the start of pregnancy and dates back to 1 year before the pregnancy index; level 2 predictionsn= 26) measured from LMP to before GDM diagnosis; level 3 predictionsn= 12) was available at the time of GDM diagnosis; and level 4 (n= 70) includes self-monitoring of blood glucose (SMBG), as a primary measure of glycemic control during pregnancy as recommended by the American Diabetes Association [5], measured in the first week after diagnosis of GDM. All predictors, levels 1 through 4, were measured before the outcome of interest (that is, the end-line of GDM treatment). Pregnant individuals with GDM in our study population had on average, 11.8 weeks (standard deviation: 6.6 weeks), of SMBG measurements between GDM diagnosis and delivery. We included data 1 week after GDM diagnosis to allow early prediction since it takes an average of 5.6 weeks between GDM diagnosis and optimal treatment being offered. Of note, individuals with GDM were offered enrollment in a supplemental GDM care program administered by nurses and dietitians via telemedicine from the KPNC Regional Perinatal Service Center. [16]. All individuals with GDM were instructed to self-monitor and record glucose measurements four times a day: fasting before breakfast and one hour after the start of each meal. SMBG measurements were then reported to registered nurses or dietitians during the weekly telephone consultation calls from registration until birth and the data were recorded in the patient-reported clinical glucose database.

statistical analysis


We calculated the missing values ​​using the random forest algorithm because the algorithm does not require the assumptions of a parametric model, which reduces the efficiency of the predictor (an additional table explains this in more detail). [see Additional file 2]). We evaluated the estimation of the true imputation error using the squared error of the root mean and the proportion of misclassified entries for continuous and categorical variables, respectively. Both values ​​were close to 0, indicating good performance in embedding (an additional table explains this in more detail [see Additional file 3]). After pre-processing, we employed RPearson’s chi-square test and test to compare participant characteristics between discovery and temporal/future validation sets. We conducted the Mann–Kendall test to examine secular trends in GDM treatment approaches across calendar years. The discovery cohort (2007-2016) was divided by calendar year and treatment method for tenfold validation. The temporal/future validation set (2017) is stratified by processing method to calculate the performance of the validated prediction.

Variable selection, full model development and comparison

We performed prediction by classification tree and regression (CART), absolute least contractionary selection factor (LASSO) regression, and super-learner prediction (SL) at levels 1, 1-2, 1-3, and 1-4, respectively. CART and LASSO regression were chosen as simple prediction methods compared to SL. SL defines a set of candidate machine learning algorithms, the library, and collects prediction results through meta-learning via cross-validation. [17]. SL has the approximation property that it is at least as good (at risk, defined by negative log-likelihood) as the best-fit algorithm in the library. [17]. Although the variables included in the final set SL cannot be easily explained for their individual contributions, the SL can be used to obtain optimal prediction performance and to measure simpler and less adaptive methods [17].

We tuned the prediction methods as follows. In CART, the Gini index measured the composition of subgroup heterogeneity with respect to the outcome, and maximum depth (6) was defined as the stopping criterion. When calculating the potential errors from the risk curve estimation, the regularization parameter in the LASSO regression was chosen from the validated error within one standard error of its minimum value. [18]. For SL, we considered a simple and complex library for comparison. The simple library included the mean response and regression of LASSO and CART; The complex library is expanded with the addition of random forest and extreme gradient boosting (XGBoost). Multiple XGBoosts were considered, with their tuning parameters set to 10, 20, 50 trees, 1 to 6 maximum depths, and 0.001, 0.01 and 0.1 downregulation for downregulation.

For models using predictors at each level, prediction results were evaluated using receiver operating characteristic curves and validated across tenfold and area under receiver operating characteristic curve (AUC) statistics in the detection and temporal/future validation sets. We used the Delong test to compare AUCs between different prediction algorithms at the same prediction level and within the same prediction algorithm across levels, respectively [19]. We used permutation-based covariate significance to calculate AUCs with 5 simulations and obtained the top 10 significant features. Alternating with one variable at a time, the method calculated the AUC difference before and after permutation to assign the significance scale [20]. The model with the highest AUC in the validation set was selected as a complete final model.

Develop simpler models

To improve interpretability and potential clinical uptake, we used logistic regression via 10-fold cross-validation to develop simpler models in the discovery set based on a small set of the most important features at each level, as opposed to the full set of features used in the SL complex. We have also chosen the interaction term(s) which considers all cross-products through progressive forward and backward selection by the Akaike Information Standard. We evaluated the predictive performance (eg, simplicity and validation of AUCs) of these simpler models in the validation set. Furthermore, calibration was checked by assessing the quality of a non-calibrated model via the integrated calibration index, which captured the expected probability distribution, along with a calibration chart. The calibration method (i.e., isotonic regression) was carried out to re-calibrate in the event of an observed over- or under-prediction.