Machine Learning for Early Lung Cancer Identification Using Routine Clinical and Laboratory Data

Share this article:

Share your details and we'll email you our Publication

Authors:

Michael K Gould

Brian Z. Huang

Martin C. Tammemagi

Yaron Kinar

Ron Shiff

Absctract

Rational

Most lung cancers are diagnosed at an advanced stage. Pre-symptomatic identification of high-risk individuals can prompt earlier intervention and improve long-term outcomes

Objectives

To develop a model to predict a future diagnosis of lung cancer based on routine clinical and laboratory data, using machine-learning.

Methods

We assembled 6,505 non-small cell lung cancer (NSCLC) cases and 189,597 contemporaneous controls and compared the accuracy of a novel machine-learning model to a modified version of the well-validated PLCOm2012 risk model, using the area under the receiver operating characteristic curve (AUC), sensitivity and diagnostic odds ratio (OR) as measures of model performance.

Results

Among ever-smokers in the test set, the machine-learning model was more accurate than the modified PLCOm2012 for identifying NSCLC 9-12 months before clinical diagnosis (P<0.00001), with an AUC of 0.86, a diagnostic OR of 12.8 3 and a sensitivity of 40.31% at a pre-defined specificity of 95%. In comparison, the modified PLCOm2012 had an AUC of 0.79, an OR of 7.4 and a sensitivity of 27.9% at the same specificity. The machine-learning model was more accurate than standard eligibility criteria for lung cancer screening and more accurate than the modified PLCOm2012 model when applied to a screening-eligible population. Influential model variables included known risk factors and novel predictors such as white blood cell and platelet counts.

Conclusions

A machine-learning model was more accurate for early diagnosis of NSCLC than either standard eligibility criteria for screening or the modified PLCOm2012, demonstrating the potential to help prevent lung cancer deaths through early detection.