Cancer survival prediction of six cancer types with germline whole-exome sequencing data from the UK Biobank

Tongqiu Iris Jia | 2022

Advisor: Sara Lindstroem

Full Text

Background: One of the most important tasks in cancer genomics is to predict cancer prognosis based on genetic signatures. We aim to apply machine learning approaches to germline whole-exome sequencing (WES) data to predict survival for breast cancer, colorectal cancer, kidney cancer, lung cancer, lymphoma, and prostate cancer. Methods: We analyzed the UK Biobank exome sequencing data and survival status of 10,721 incident cancer cases. We annotated WES variants with Combined Annotation Dependent Depletion (CADD) and generated gene-level CADD scores, indicating the deleterious mutation impact. Using the gene-level CADD scores, we performed unsupervised feature selection using principal component analysis (PCA), independent component analysis (ICA), random project (RP), autoencoder (AE), denoising autoencoder (DAE), and variational autoencoder (VAE). For each cancer type, we trained logistic regression models to predict cancer survival status using the selected features. Results: All models on all cancer types have a prediction accuracy around 0.5. Overall, the accuracies among deep learning models were comparable to those of linear models. When comparing VAE models with varying latent space, we did not observe an increase in accuracy as the size of latent space increased. Conclusion: Across all six cancer types, the survival prediction accuracy was similar for all models, indicating that more complex deep learning methods did not improve prediction performance. The features or embeddings derived from the six tested dimension reduction methods had limited predictive ability for cancer survival.