School of Public Health

Kristen E. Gray

Prediction modeling for live birth in in vitro fertilization


Approximately 150,000 women undergo in vitro fertilization (IVF) each year to treat infertility. The success of IVF is limited, and the procedure is costly, time-consuming, and poses physical and emotional health risks to the patient. Therefore, generating personalized probabilities of live birth may assist patients and clinicians in decision-making. We sought to examine the ability of individual biomarkers, including anti-Müllerian hormone (AMH, a biomarker of ovarian reserve), and multivariable models to predict the probability of live birth prior to initiating stimulation for IVF.


We included fresh, autologous IVF cycles initiated between 2005 and 2011 from five U.S. infertility clinics. We developed and validated multivariable models predicting probabilities of live birth in 23,154 first IVF cycles, as well as in 8,184 second IVF cycles after a single prior failed cycle using varying levels of model complexity: (a) backwards stepwise logistic regression (p>0.2) with only parameter main effects, (b) with main effects and interactions, and (c) boosted regression trees. For first cycles, eligible predictors included those obtained at the baseline infertility evaluation (e.g., demographics, anthropometrics, pregnancy history, infertility diagnosis, stimulation protocol); which were also examined in second cycles in addition to the treatment response in the previous failed cycle (e.g., dose of gonadotropins, egg and embryo characteristics, cycle outcome, etc.). For comparison, we fit age category and linear age models. Due to missing data we imputed 15 datatsets using multiple imputation by chained equations. In the 20% of data reserved for validation, we estimated the receiver operating characteristic curve (ROC), area under the ROC curve (AUC), and the difference in AUCs between all models, along with bootstrapped 95% confidence intervals (CIs). In a subsample of data from a single clinic, we evaluated the ability of AMH to predict live birth in all fresh, autologous IVF cycles from 2010-2011 (N=834) and compared to widely collected biomarkers of ovarian reserve, including age, antral follicle count (AFC), and follicle stimulating hormone (FSH). We estimated the ROC curves, AUCs, and difference in AUCs between biomarkers, along with bootstrapped 95% CIs. We also evaluated the performance of AMH within subgroups based on age, body mass index (BMI), polycystic ovary syndrome (PCOS) status, and infertility diagnosis. 


In first IVF cycles, all predictors were included in the main effects and interactions model. All multivariable models performed similarly (AUCs=0.67, 95% CIs=0.66, 0.69) and only slightly better than age-based models (age category AUC=0.64, 95% CI=0.63, 0.65; linear age AUC=0.65, 95% CI=0.64, 0.67). In second IVF cycles, many variables from the failed first cycle were included as predictors in addition to most baseline variables. Multivariable models performed only slightly better than age-based models (AUCs=0.63), with AUCs ranging from 0.67 (main effects, 95% CI=0.65, 0.70) to 0.72 (boosted regression, 95% CI=0.68, 0.77). When we examined individual biomarkers of ovarian reserve, AMH, age and FSH had similar performance with AUCs ranging from 0.63 (95% CI=0.59, 0.67) to 0.67 (95% CI=0.64, 0.71); FSH had the poorest performance (AUC=0.55, 95% CI=0.51, 0.59). Only FSH had a significantly different AUC from AMH (difference=0.08, 95% CI=0.04, 0.13). No substantial differences in AMH performance were observed within subgroups.


Multivariable models performed only slightly better than simple age-based models or models based on other single biomarkers of ovarian reserve. There was very little improvement in accuracy with increasing model complexity, with small or no differences when using boosted regression compared to stepwise techniques. All models/individual predictors had only modest performance with AUCs below 0.72. The minimal improvements in model performance for multivariable models are likely not substantial enough to warrant widespread clinical application, which would necessitate software development for calculating individualized probabilities. Despite the modest performance overall, there may be subgroups of women in whom the predictors and chance of live birth differ. Future investigations should examine whether models developed within relevant subgroups, such as those based on age, race, and diagnosis, have better performance.