Colombine Verzat: Machine Learning for Adverse Event Detection in Latent Tuberculosis Infection Treatment
Posted on Wed 15 July 2020 in theses
One quarter of the world’s population has latent tuberculosis infection (LTBI). In this form of the disease, the bacteria has a 10 to 15% chance to start replicating and cause the patient to develop active tuberculosis. In those cases, preventive therapy is thus essential to limit the spread of the disease. Unfortunately, the treatment for LTBI can cause severe adverse events which discourages patients. Predicting which patients are most at risk of developing adverse events could thus improve treatment efficacy and help achieving WHO goals of TB elimination by 2050.
The goal of this study is to identify whether it is possible to predict the occurrence of adverse events in patients based on their clinical data.
To address this, we disposed of a clinical dataset of 6485 patients who had LTBI and went through treatment. A small part of these patients developed adverse events associated to the treatment. First, we reproduced a study by Campbell et al. performed on this dataset using a logistic regression model. We then investigated the predictive power of this model using generalization and established a baseline from the resulting model. Finally, we explored how non-linear machine learning models could improve the performance compared to the baseline.
We found that multivariate logistic regression yielded a classifier with the following performance: AUC= 0.65 ± 0.04. Although non-linear techniques matched the baseline performance, they failed to significantly improve the prediction further.
These findings suggest that part of the data is linearly separable, while some isolated points in the dataset cannot be easily generalized. Patients with and without adverse events seem to overlap in the variable space, which suggests that an efficient detection of adverse events is difficult to achieve with this dataset. The improvement of the model may require a larger and less imbalanced dataset, possibly along further explanatory variables permitting a better characterization of the patient.
Available Materials