Categories
Nature RSS Feed

Using Artificial Intelligence and Novel Polynomials to Predict Subjective Refraction

Abstract

This work aimed to use artificial intelligence to predict subjective refraction from wavefront aberrometry data processed with a novel polynomial decomposition basis. Subjective refraction was converted to power vectors (M, J0, J45). Three gradient boosted trees (XGBoost) algorithms were trained to predict each power vector using data from 3729 eyes. The model was validated by predicting subjective refraction power vectors of 350 other eyes, unknown to the model. The machine learning models were significantly better than the paraxial matching method for producing a spectacle correction, resulting in a mean absolute error of 0.301aEUR?A+-aEUR?0.252 Diopters (D) for the M vector, 0.120aEUR?A+-aEUR?0.094 D for the J0 vector and 0.094aEUR?A+-aEUR?0.084 D for the J45 vector. Our results suggest that subjective refraction can be accurately and precisely predicted from novel polynomial wavefront data using machine learning algorithms. We anticipate that the combination of machine learning and aberrometry based on this novel wavefront decomposition basis will aid the development of refined algorithms which could become a new gold standard to predict refraction objectively.

Introduction

Globally it is estimated that 153 million people aged 5 or above are visually impaired due to uncorrected refractive errors1.The ability to automatically refract a patient and provide a spectacle prescription, equivalent to the time consuming current gold standard of subjective refraction, is an elusive goal that has intrigued many ophthalmic clinicians and researchers2,3,4.

One such automated and objective method is optical wavefront sensing using aberrometry which allows mathematical reconstruction and analysis of lower and higher order monochromatic aberrations of the eye. This has led many to believe that this objective method had the potential to be the new standard for optimizing correction of refractive errors by converting aberrometry data to accurate sphero-cylindrical refractions5,6,7.

Though several small sample studies showed promising results in terms of accuracy and precision of objective refraction from several methods related to wavefront analysis to date5,6,7,8,9,10,11,12,13 no study has found a validated method that can be used to prescribe a spectacle correction. It was found that results from the aberrometer, autorefractor and subjective refraction, though comparable with each other, were not accurate enough to prescribe spectacles directly from either instruments14. A recent publication found that a visual image quality metric could predict subjective refraction in myopic eyes but not habitually undercorrected hyperopic eyes, though the data set was again small15. Variability in the gold standard subjective refraction measurements themselves were also thought to be a source of poor precision7.

The ocular wavefront error is most commonly described by the Zernike polynomials16. To satisfy orthogonality constraints with low order modes, some higher order Zernike polynomials contain low order terms in their analytical expression leading to lack of accuracy when predicting the sphero-cylindrical refraction17. It is known that conventional therapies such as spectacles or contact lenses correct just the lower-order aberrations but the presence of higher-order aberrations influences the prescription itself12. An important finding by Cheng et al.6 showed that subjective judgment of best focus does not minimize RMS wavefront error (Zernike defocus = 0), nor create paraxial focus (Seidel defocus = 0), but makes the retina conjugate to a plane between these two. The levels of spherical aberration (Z4A?) and secondary astigmatism (Z4A+-2) influenced the levels of defocus and primary astigmatism that produced the best visual performance. These objective metrics were tested based on an assumption that the Zernike polynomial decomposition was producing a clear distinction between the low and high order components of the wavefront error. This assumption could explain the poor correlation between subjective and objective refraction, especially when it came to large amounts of higher order aberration5,6,7,9,11,12,18,19.

A new series of polynomials, labeled LD/HD (Low Degree/ High Degree), have been proposed to provide a more mathematically relevant separation of the higher and lower modes20. In this decomposition, the normalized higher order modes are devoid of low order terms and mutually orthogonal within but not with lower order aberrations. With this approach, the low order wavefront component is equal to the paraxial curvature matching of the wavefront map.

Machine learning is already in use in Ophthalmology for image analysis in medical retina21,22,23, and glaucoma (Visual Fields and Disc Photos)24,25 as well as recent developments for use in diagnoses including retinopathy of prematurity26. It is also used in regression tasks, notably in IOL calculations27. Deep learning has been applied to predict refractive error from fundus images and other image analysis techniques also28,29. Attempts to predict subjective refraction from Zernike polynomials have also been tried using a multilayer perceptron with two hidden layers30.

Our aim was to build and evaluate a set of predictive machine learning models to accurately and precisely objectively refract a patient using wavefront aberrometry with LD/HD polynomial decomposition, and to evaluate the relative importance of each polynomial in the prediction process for each vector.

Results

Patients demographics are presented in TableA 1. Training set and test set were comparable in terms of patient ages, sex-ratio, side repartition, mean refractive spherical equivalent and mean refractive cylinder.

Table 1 Patients demographics in the training set and test set. T-test was performed to test for group comparability. Abbreviations used: Spherical Equivalent (SE), Cylinder (Cyl) and number (n). SE and Cyl are in Diopters (D).

Full size table

The performance of prediction methods are presented in TableA 2. FigureA 1 illustrates the prediction performances of the different approaches for the M vector and Fig.A 2 illustrates those for the J0 and J45 vectors. Statistical testing for differences between the various prediction methods is presented in TableA 3. The XGBoost models using all the polynomials, resulted in a Mean Absolute Error of 0.30aEUR?A+-aEUR?0.25 Diopters (D) for the M vector, 0.12aEUR?A+-aEUR?0.09 D for the J0 vector and 0.09aEUR?A+-aEUR?0.08 D for the J45 vector, whilst the Paraxial matching method resulted in a Mean Absolute Error of 0.40aEUR?A+-aEUR?0.35 D for the M vector, 0.17aEUR?A+-aEUR?0.14 D for the J0 vector and 0.14aEUR?A+-aEUR?0.1 D for the J45 vector. The XGBoost models using only the low-degree polynomials resulted in a Mean Absolute Error of 0.35aEUR?A+-aEUR?0.29 D for the M vector, 0.16aEUR?A+-aEUR?0.14 D for the J0 vector and 0.12aEUR?A+-aEUR?0.10 D for the J45 vector. Bland-Altman plots showed a good agreement between subjective refraction and the predictions obtained with the machine learning models, with no systematic error depending on degree of refractive error (Fig.A 4). Paired t-test were not significant.

Table 2 Absolute Prediction Error, Prediction Error and Precision of the three methods were evaluated (Paraxial Matching, XGBoost using Low order polynomials only, XGBoost using all available polynomials) for each vector M, J0 and J45. P values for the comparison between methods are presented in TableA 3. Abbreviations used: Low Order (LO), Low and High Order (LO/HO) and Paraxial Matching (PM).

Full size table

Figure 1
figure1

Probability density function (Gaussian kernel density estimate) for the Spherical Prediction Error, for the 3 methods studied. We compare paraxial fitting with low degree LD/HD polynomials (Red), with XGBoost model using low degree only (Green) and XGBoost model with all aberrations (Blue). The density of accurate predictions is more important with the latter.

Full size image

Figure 2
figure2

Scatter plot showing the J0 vector prediction error on the X-axis and J45 vector prediction error on the Y-axis with corresponding 95% confidence ellipses for the 3 methods studied. We compare paraxial fitting with low degree LD/HD polynomials (Red), with XGBoost model using low degree only (Green) and XGBoost model with all aberrations (Blue). The black cross locates the (0,0) coordinate. Precision is better using the last method.

Full size image

Table 3 Pairwise statistical comparison of the different prediction methods. Mean Absolute Errors and Mean Errors were compared using a Wilcoxon signed-rank test and Precision differences were compared using the Levene test for equal variances. Abbreviations used: Low (L), High (H), Paraxial Matching (PM).

Full size table

The XGBoost models using all the polynomials performed statistically better than Paraxial matching for every vector and every metric, except for accuracy for the J45 vector prediction. They also performed better than the XGBoost models trained with low-degree polynomials only, although the difference was not significant for precision in predicting the M vector and accuracy in predicting the J0 and J45 vectors.

SHAP value analysis for the three XGBoost models trained with the full set of polynomials is presented in Fig.A 4. It showed that G20(defocus) was by far the most influential feature to predict the M vector, with G4A? (primary spherical aberration) being the second most important feature. The bottom two graphs demonstrate that G22 (Vertical astigmatism) and G4a^’2 (Oblique secondary astigmatism) were the most important features to predict the J0 vector, while G2a^’2 (Oblique astigmatism) and G42 (Vertical secondary astigmatism) were the most important features to predict the J45 vector.

Discussion

The machine learning approach using LD/HD polynomials was more effective than the paraxial matching method for predicting the results of conventional, sphero-cylindrical refraction from wavefront aberrations used by Thibos et al. previously7. Interestingly, the XGBoost models trained using low-order aberrations only proved more accurate than paraxial matching. This could suggest that those low-order polynomials interact, in some circumstances, in a more complex way than previously thought. The best precision and accuracy were obtained when all the novel polynomials coefficients were used as predictors, demonstrating the significant influence of the higher order aberrations on the spectacle correction.

Gradient boosting creates new models that predict the residual errors of prior models during the training process. The models are used together to predict the target value. XGBoost is an implementation of gradient boosted trees focused on performance and computer efficiency. It can perform both regression and classification tasks. It was chosen because of its recognized performances and its resistance to overfitting31.

Feature importance G2A? (defocus) was unsurprisingly the most influential feature to predict the M vector, with G4A? (primary spherical aberration) being the second most important feature. One interesting finding was that G4a^’2 (Oblique secondary astigmatism) was the second most important feature to predict J0, and G42 (Vertical secondary astigmatism) the second most important feature to predict J45, while the inverse would be more intuitive. This demonstrates the interest in the machine learning approach, that allows us to discover new patterns and relationships between predictors by disregarding previous assumptions.

Our results confirm the prevalence of 4th order aberrations within the higher order coefficients influencing the sphero-cylindrical refraction as it has been previously shown6. The LD/HD modes being devoid of defocus terms (radial degree 2), they unambiguously confirm the influence of the radial degree 4 of the wavefront error, on sphero-cylindrical refraction.

The predictive influence of the variables used in the model does not explain their exact role, and that is a weakness of such machine learning algorithms, as interpretability and model comprehension areA limited by the big number of decision trees, their complexity and depth.

Of note, we did not test our method for repeatability. However, it relies solely on the OPD-Scan III output, and this device has already shown very good repeatability32,33,34,35.

Our study had some unavoidable limitations, among which is accommodation. We created a study design using undilated refraction, mirroring the real life clinical environment where spectacle correction is provided in adults, as well as allowing preservation of data volume. We did not test children or elderly patients so cannot generalize to these groups. By virtue of the technique, it is not possible to objectively refract patients with strabismus, corneal scarring, cataracts or vitreous opacity that would preclude clear wavefront analysis.

Precision may be masked by the imprecision of the gold standard of subjective refraction. Of note the examiner was aware of the autorefraction. We hope our study results will enable future development of machine learning algorithms from the LD/HD polynomials and objective refraction techniques, to prescribe glasses efficiently, not only to adults but also to children and vulnerable adults without need for their input or prolonged cooperation.

Methods

This study was approved by the Institutional Review Board at Rothschild foundation and followed the tenets of the Declaration of Helsinki. Informed consent was obtained from all participants. A total of 2890 electronic medical records of patients (6397 eyes) evaluated for refractive surgery at the Laser Vision Institute NoA(C)mie de Rothschild (Foundation Adolphe de Rothschild Hospital, Paris) were retrieved and consenting patients data was analyzed. We excluded patients with strabismus and any other ocular abnormalities except ametropia. After data cleaning, eyes with subjective refraction and a valid wavefront aberrometry examination were randomly split into a 350 eyes test set and a training set, with no cross over of same patient data. A manual review of medical records of eyes in the test set was checked to ensure the quality of data, leaving 3729 eyesA for the training set.

Wavefront analysis was obtained using the OPD-Scan III (Nidek, Gamagori, Japan). The aberrometer was specially configured to run using beta-software incorporating the new series of LD/HD polynomials, noted Gnm using the same double index scheme of the Zernike polynomials. The wavefronts were decomposed up to the 6th order. We chose to stop our polynomials analysis at the 6th order. This cut-off is beyond the number of polynomials that was determined by the members of the Vision Science and its Applications Standards task force (VSIA) to be necessary to describe the HOA of the human eye with sufficient accuracy in 200036. It applied to the paraxial matching analysis as well as the machine learning approach. The first three polynomials (Piston, Tilt, Tip) were removed from the features because of their low relevance in this work. Defocus, Vertical Astigmatism and Oblique Astigmatism constituted the Low order polynomial group, and all the others constituted the High order polynomials group. A 4aEUR?mm pupil disk diameter was chosen to obtain the coefficients and any pupil less than 4aEUR?mm during the acquisition of the wavefront with the OPD-Scan III, was an exclusion criterion. A 4aEUR?mm pupil diameter analysis cut-off was used because it is close to the mean physiological photopic pupil diameter in different studies37,38,39. Our results may not reflect the results that could be found using very large or very small pupils.

Corresponding non-cycloplegic subjective refractions conducted on the same day by an experienced optometrist were analyzed. The maximum plus rule was used to the nearest 0.25 D to minimize accommodation and maximize the depth of focus7.

Each refraction in Sphere S, Cylinder C, and axis A format was transformed into 3D dioptric vector space (M, J0, J45) where the three components are orthogonal. Refraction data sets were vectorized using standard power vector analysis with the components M, J0 and J4540 using Eqs. (1), (2) and (3).

$${rm{M}}=S+frac{C}{2}$$

(1)

$${rm{J0}}=-,frac{C}{2}times ,cos (2alpha )$$

(2)

$${rm{J45}}=-,frac{C}{2}times ,sin (2alpha )$$

(3)

Three machine learning models were separately trained to predict each vector component from the new series of polynomials. We used a Gradient Boosted Trees algorithm (XGBoost)41. Parameter tuning was performed using 5-folds randomized search cross-validation. Mean squared error regression loss was chosen as the evaluation metric. We used Python 3.6.8 with the following libraries: Jupyter 4.4.0. Pandas 0.23.4, Scikit-learn 0.20.2, Matplotlib 3.0.2, Seaborn 0.9.0, XGBoost 0.81.

SHAP (SHapley Additive exPlanations) values were calculated for each model in order to determine the most influential polynomials (Fig.A 3).

Figure 3
figure3

SHAP feature importance for each model of the XGBoost using the all aberrations approach. The top graph (a) displays the most important features for M prediction: G2A? (defocus) and G4A? (primary spherical aberration) were the most influential. The bottom two (b,c) graphs demonstrate that G22 (Vertical astigmatism) and G4a^’2 (Oblique secondary astigmatism) were the most important features to predict the J0 vector, while G2a^’2 (Oblique astigmatism) and G42 (Vertical secondary astigmatism) were the most important features to predict the J45 vector.

Full size image

Figure 4
figure4

Bland-Altman diagrams showing the agreement between the predictions made using the XGBoost model trained with all available aberrations, and subjective refraction, for M (a), J0 (b) and J45 (c). No statistical difference was found using one sample t-test, for each vector prediction.

Full size image

SHAP value is a recently described tool that aims to increase machine learning models interpretability42. It allows us to understand how a specific feature negatively or positively participates in the target variable prediction by computing the contribution of each feature towards the prediction. This allows better estimation of the importance of each feature in the prediction process. A random variable was introduced as a predictive feature during the training in order to help differentiate useful features from the others.

Performances of the machine learning models were evaluated on the test set never seen by the model nor used for the hyperparameters choice, to avoid overfitting. For each machine learning approach (using low order polynomials only, and using every polynomial), the three vectors of the refraction were predicted one by one using the three machine learning models. Paraxial matching predictions were calculated using Eqs. (4aEUR”6)

$${rm{M}}=frac{-{G}_{2}^{0}4sqrt{3}}{{r}^{2}}$$

(4)

$${rm{J}}0=frac{-{G}_{2}^{2}2sqrt{6}}{{r}^{2}}$$

(5)

$${rm{J45}}=frac{-{G}_{2}^{-2}2sqrt{6}}{{r}^{2}}$$

(6)

where Gnm is the nth order LD/HD coefficient of meridional frequency m, and r is the pupillary radius. It isA important to note that as high order LD/HD coefficients are devoid of low-order aberrations, this calculation is equivalent to paraxial curvature matching calculated by computing the curvature at the origin of the Zernike expansion of the Seidel formulae for defocus and astigmatism using Zernike polynomials as described by Thibos et al.7.

Mean absolute errors were calculated for each prediction method. Accuracy of the predictions for each vector was defined as the mean value of the prediction error. Precision was defined as two times the standard deviation (SD) of the prediction error7. Each prediction method was evaluated against each other. Mean absolute prediction errors and mean prediction errors were evaluated using the Wilcoxon-signed rank test with Bonferroni correction. Differences in precision were evaluated using the Levene test with Bonferroni correction. We used a similar confidence ellipse to Thibos et al. to graphically report our results7,43. Bland-Altman plots and paired t-test were conductedA to study the agreement between subjective refraction and the machine learning models predictions. A p-value less than 0.05 was considered significant.

Data availability

The datasets generated during and analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. 1.

    Resnikoff, S., Pascolini, D., Mariotti, S. P. & Pokharel, G. P. Global magnitude of visual impairment caused by uncorrected refractive errors in 2004. Bull. World Health Organ. 86, 63aEUR”70 (2008).

  2. 2.

    BA 1/4 hren, J., Martin, T., KA 1/4 hne, A. & Kohnen, T. Correlation of aberrometry, contrast sensitivity, and subjective symptoms with quality of vision after LASIK. J Refract. Surg. 25, 559aEUR”568 (2009).

  3. 3.

    Pesudovs, K., Parker, K. E., Cheng, H. & Applegate, R. A. The precision of wavefront refraction compared to subjective refraction and autorefraction. Optom. Vis. Sci. 84, 387aEUR”392 (2007).

  4. 4.

    Bullimore, M. A., Fusaro, R. E. & Adams, C. W. The repeatability of automated and clinician refraction. Optom. Vis. Sci. 75, 617aEUR”622 (1998).

  5. 5.

    Watson, A. B. & Ahumada, A. J. Jr. Predicting visual acuity from wavefront aberrations. J. Vis 8(17), 1aEUR”19 (2008).

  6. 6.

    Cheng, X., Bradley, A. & Thibos, L. N. Predicting subjective judgment of best focus with objective image quality metrics. J. Vis. 4, 310aEUR”321 (2004).

  7. 7.

    Thibos, L. N., Hong, X., Bradley, A. & Applegate, R. A. Accuracy and precision of objective refraction from wavefront aberrations. J. Vis. 4, 329aEUR”351 (2004).

  8. 8.

    Maeda, N. Clinical applications of wavefront aberrometry – a review. Clin. Experiment. Ophthalmol 37, 118aEUR”129 (2009).

  9. 9.

    Kilintari, M., Pallikaris, A., Tsiklis, N. & Ginis, H. S. Evaluation of image quality metrics for the prediction of subjective best focus. Optom. Vis. Sci. 87, 183aEUR”189 (2010).

  10. 10.

    Applegate, R. A., Marsack, J. D., Ramos, R. & Sarver, E. J. Interaction between aberrations to improve or reduce visual performance. J. Cataract Refract. Surg. 29, 1487aEUR”1495 (2003).

  11. 11.

    Jaskulski, M., MartA-nez-Finkelshtein, A. & LA3pez-Gil, N. New Objective Refraction Metric Based on Sphere Fitting to the Wavefront. J. Ophthalmol. 2017, 1909348 (2017).

  12. 12.

    Thibos, L. N. Unresolved issues in the prediction of subjective refraction from wavefront aberration maps. J Refract. Surg. 20, S533aEUR”6 (2004).

  13. 13.

    Marsack, J. D., Thibos, L. N. & Applegate, R. A. Metrics of optical quality derived from wave aberrations predict visual performance. Journal of Vision 4, 8 (2004).

  14. 14.

    Bennett, J. R., Stalboerger, G. M., Hodge, D. O. & Schornack, M. M. Comparison of refractive assessment by wavefront aberrometry, autorefraction, and subjective refraction. J. Optom 8, 109aEUR”115 (2015).

  15. 15.

    Hastings, G. D., Marsack, J. D., Nguyen, L. C., Cheng, H. & Applegate, R. A. Is an objective refraction optimized using the visual Strehl ratio better than a subjective refraction? Ophthalmic Physiol. Opt. 37, 317aEUR”325 (2017).

  16. 16.

    Lakshminarayanan, V. & Fleck, A. Zernike polynomials: a guide. J. Mod. Opt. 58, 1678aEUR”1678 (2011).

  17. 17.

    Klyce, S. D., Karon, M. D. & Smolek, M. K. Advantages and disadvantages of the Zernike expansion for representing wave aberration of the normal and aberrated eye. J Refract. Surg. 20, S537aEUR”41 (2004).

  18. 18.

    Guirao, A. & Williams, D. R. A method to predict refractive errors from wave aberration data. Optom. Vis. Sci. 80, 36aEUR”42 (2003).

  19. 19.

    Applegate, R. A., Ballentine, C., Gross, H., Sarver, E. J. & Sarver, C. A. Visual acuity as a function of Zernike mode and level of root mean square error. Optom. Vis. Sci. 80, 97aEUR”105 (2003).

  20. 20.

    Gatinel, D., Malet, J. & Dumas, L. Polynomial decomposition method for ocular wavefront analysis. J. Opt. Soc. Am. 35, 2035 (2018).

  21. 21.

    Choi, J. Y. et al. Multi-categorical deep learning neural network to classify retinal images: A pilot study employing small database. Plos One 12, e0187336 (2017).

  22. 22.

    Tufail, A. et al. An observational study to assess if automated diabetic retinopathy image assessment software can replace one or more steps of manual imaging grading and to determine their cost-effectiveness. Health Technol. Assess. 20, 1aEUR”72 (2016).

  23. 23.

    Ting, D. S. W. et al. Artificial intelligence and deep learning in ophthalmology. Br. J. Ophthalmol 103, 167aEUR”175 (2019).

  24. 24.

    Asaoka, R., Murata, H., Iwase, A. & Araie, M. Detecting Preperimetric Glaucoma with Standard Automated Perimetry Using a Deep Learning Classifier. Ophthalmology 123, 1974aEUR”1980 (2016).

  25. 25.

    Zhu, H., Poostchi, A., Vernon, S. A. & Crabb, D. P. Detecting abnormality in optic nerve head images using a feature extraction analysis. Biomedical Optics Express 5, 2215 (2014).

  26. 26.

    Gupta, K. et al. A Quantitative Severity Scale for Retinopathy of Prematurity Using Deep Learning to Monitor Disease Regression After Treatment. JAMA Ophthalmology 137, 1029 (2019).

  27. 27.

    Lee, A., Taylor, P., Kalpathy-Cramer, J. & Tufail, A. Machine Learning Has Arrived! Ophthalmology 124, 1726aEUR”1728 (2017).

  28. 28.

    Varadarajan, A. V. et al. Deep Learning for Predicting Refractive Error From Retinal Fundus Images. Investigative Ophthalmology & Visual Science 59, 2861 (2018).

  29. 29.

    Libralao, G, Almeida, O, Carvalho, A. Classification of ophthalmologic images using an ensemble of classifiers. Innov Appl Artif Intell., 6aEUR”13 (2005).

  30. 30.

    Ohlendorf, A., Leube, A., Leibig, C. & Wahl, S. A machine learning approach to determine refractive errors of the eye. Invest Ophthalmol Vis Sci 58, 1136 (2017).

  31. 31.

    Reinstein, I. XGBoost, a Top Machine Learning Method on Kaggle, Explained, https://www.kdnuggets.com/2017/10/xgboost-top-machine-learning-method-kaggle-explained.html.Last accessed 2/1/2020 (2017).

  32. 32.

    Asgari, S. et al. OPD-Scan III: a repeatability and inter-device agreement study of a multifunctional device in emmetropia, ametropia, and keratoconus. International Ophthalmology 36, 697aEUR”705 (2016).

  33. 33.

    Hamer, C. A. et al. Comparison of reliability and repeatability of corneal curvature assessment with six keratometers. Clinical and Experimental Optometry 99, 583aEUR”589 (2016).

  34. 34.

    Guilbert, E. et al. Repeatability of Keratometry Measurements Obtained With Three Topographers in Keratoconic and Normal Corneas. Journal of Refractive Surgery 32, 187aEUR”192 (2016).

  35. 35.

    McGinnigle, S., Naroo, S. A. & Eperjesi, F. Evaluation of the auto-refraction function of the Nidek OPD-Scan III. Clinical and Experimental Optometry 97, 160aEUR”163 (2014).

  36. 36.

    Thibos, L. N., Applegate, R. A., Schwiegerling, J. T., Webb, R. & VSIA Standards Taskforce Members. Standards for Reporting the Optical Aberrations of Eyes. Vision Science and its Applications (2000).

  37. 37.

    Sanchis-Gimeno, J. A., Sanchez-Zuriaga, D. & Martinez-Soriano, F. White-to-white corneal diameter, pupil diameter, central corneal thickness and thinnest corneal thickness values of emmetropic subjects. Surgical and Radiologic Anatomy 34, 167aEUR”170 (2012).

  38. 38.

    Hashemi, H. et al. Distribution of Photopic Pupil Diameter in the Tehran Eye Study. Current Eye Research 34, 378aEUR”385 (2009).

  39. 39.

    Oshika, T. et al. Influence of Pupil Diameter on the Relation between Ocular Higher-Order Aberration and Contrast Sensitivity after Laser In Situ Keratomileusis. Investigative Ophthalmology & Visual Science 47, 1334 (2006).

  40. 40.

    Thibos, L. N., Wheeler, W. & Horner, D. Power vectors: an application of Fourier analysis to the description and statistical analysis of refractive error. Optom. Vis. Sci. 74, 367aEUR”375 (1997).

  41. 41.

    Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International conference on knowledge discovery and data mining, ACM, pp. 785aEUR”794 (2016).

  42. 42.

    Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 4765aEUR”4774 (Curran Associates, Inc., 2017).

  43. 43.

    Bland, J. M. & Altman, D. G. Regression Analysis. Lancet 327, 908aEUR”909 (1986).

Download references

Author information

Author notes

R.R and G.D. are joint first authors. R.R. and G.D. completed data analysis, wrote the main manuscript text, prepared the tables and figures. D.G. and J.M. designed the study and developed the polynomial basis. All authors reviewed the manuscript.

Correspondence to
Damien Gatinel.

Ethics declarations

D.G. is a consultant for Nidek but received no funding for this study. G.D., R.R. and J.M. declare no potential conflict of interest.

Additional information

PublisheraEUR(TM)s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the articleaEUR(TM)s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the articleaEUR(TM)s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Categories
Nature RSS Feed

Limited-View and Sparse Photoacoustic Tomography for Neuroimaging with Deep Learning

Abstract

Photoacoustic tomography (PAT) is a non-ionizing imaging modality capable of acquiring high contrast and resolution images of optical absorption at depths greater than traditional optical imaging techniques. Practical considerations with instrumentation and geometry limit the number of available acoustic sensors and their aEURoeviewaEUR? of the imaging target, which result in image reconstruction artifacts degrading image quality. Iterative reconstruction methods can be used to reduce artifacts but are computationally expensive. In this work, we propose a novel deep learning approach termed pixel-wise deep learning (Pixel-DL) that first employs pixel-wise interpolation governed by the physics of photoacoustic wave propagation and then uses a convolution neural network to reconstruct an image. Simulated photoacoustic data from synthetic, mouse-brain, lung, and fundus vasculature phantoms were used for training and testing. Results demonstrated that Pixel-DL achieved comparable or better performance to iterative methods and consistently outperformed other CNN-based approaches for correcting artifacts. Pixel-DL is a computationally efficient approach that enables for real-time PAT rendering and improved image reconstruction quality for limited-view and sparse PAT.

Introduction

Neuroimaging in small animals have played an essential role in preclinical research to provide physiological, pathological, and functional insights that are key for understanding and treating neurological diseases. Over the past few decades, there has been significant advances in magnetic resonance imaging (MRI) and optical imaging techniques for structural and functional neuroimaging. For example, MRI can acquire high resolution images of brain structures over large volumes, 3D connectivity and diffusivity information using diffusion tensor imaging, and brain activity using functional MRI1,2,3. However, MRI has poor temporal resolution and cannot be used to study fast hemodynamic mechanisms and responses. Optical imaging techniques can exploit the diverse biological molecules (e.g. hemoglobin, melanin, and lipids) aEUR” each possessing different optical properties aEUR” present in biological tissues to provide contrast for structural and functional imaging4,5,6. However, strong optical scattering limits the imaging depth of optical techniques to approximately 1aEUR”2aEUR?mm into the brain7.

Photoacoustic tomography (PAT) is an emerging non-invasive hybrid technique that has recently seen substantial growth in numerous preclinical biomedical applications and as a powerful clinical diagnostic tool8,9,10,11. In particular, there is a strong interest in PAT for preclinical structural and functional neuroimaging12,13,14,15,16. Given its unique use of light and sound, PAT combines the high contrast and molecular specificity of optical imaging with the high spatial resolution and centimeter-penetration depth of ultrasound imaging17,18,19. PAT has been demonstrated capable of kilohertz volumetric imaging rates, far exceeding the performance of other modalities, which enables new insights into previously obscure biological phenomena20. There are diverse contrast agents available such as chemical dyes, fluorescent proteins, and nanoparticles that can be used to further enhance the imaging capabilities of PAT21,22.

PAT involves irradiating the biological tissue with a short-pulsed laser. Optical absorbers within the tissue are excited by the laser and undergo thermoelastic expansion which results in the generation of acoustic waves23. A sensor array surrounding the tissue is then used to detect the acoustic waves, and an image is formed from the measured sensor data. PAT image reconstruction is a well-studied inverse problem that can be solved using analytical solutions, numerical methods (e.g. time reversal), and model-based iterative methods24,25,26,27,28. In general, a high-quality image can be reconstructed if the sensor array has a sufficiently large number of sensor elements and completely encloses the tissue. However, building an imaging system with these specifications is often prohibitively expensive, and in many in vivo applications such as neuroimaging, the sensor array typically can only partially enclose the tissue29,30. These practical limitations result in sparse spatial sampling and limited-view of the photoacoustic waves emanating from the medium. Reconstructing from sub-optimally acquired data causes streaking artifacts in the reconstructed PAT image that inhibits image interpretation and quantification31.

To address these issues, iterative methods are commonly employed to remove artifacts and improve image quality. These methods use an explicit model of photoacoustic wave propagation and seek to minimize a penalty function that incorporates prior information32,33,34. However, they are computationally expensive due to the need for repeated evaluations of the forward and adjoint operators, and resulting image quality is dependent on the constraints imposed35,36.

Given the wide success of deep learning in computer vision, there is a strong interest in applying similar methods for tomographic image reconstruction problems37,38,39. Deep learning has the potential to be an effective and computationally efficient alternative to state-of-the-art iterative methods. Having such a method would enable improved image quality, real-time PAT image rendering, and more accurate image interpretation and quantification.

Among the many deep learning approaches for image reconstruction, post-processing reconstruction (Post-DL) is the most widely used and has been demonstrated for improving image reconstruction quality in CT40,41, MRI42, and PAT43,44,45,46,47,48. It was shown capable of achieving comparable or better performance than iterative methods for limited-view and sparse PAT image reconstruction45,49,50,51. In Post-DL, an initial inversion is used to reconstruct an image with artifacts from the sensor data. A convolutional neural network (CNN) is then applied as a post-processing step to remove artifacts and improve image quality. The main drawback of Post-DL is that the initial inversion does not properly address the issues of limited-view and sparse sampling, which results in an initial image with artifacts. Image features (e.g. small vessels) that are missing or obscured by artifacts are unlikely to be recovered by the CNN.

Previous works attempted to improve upon Post-DL by removing the need for an initial inversion step50,52. One approach termed direct reconstruction (Direct-DL) used a CNN to reconstruct an image directly from the sensor data52. The main challenge in using Direct-DL is the need to carefully select parameters (e.g. stride and kernel size) for each convolutional layer in order to transform the sensor data into the desired image dimensions. Changing either the dimensions of the input (e.g. using a different number of sensors) or output would require a new set of convolution parameters and the CNN architecture to be modified. Direct-DL was shown capable of reconstructing an image but underperformed compared to Post-DL. Interestingly, a hybrid approach using a combination of Post-DL and Direct-DL, where an initial inversion and the sensor data are given as inputs to the CNN, was shown to provide an improvement over using Post-DL alone53,54.

Another approach termed aEURoemodel-based learningaEUR? similarly does not require an initial inversion step and achieves state-of-the-art image reconstruction quality50,55,56,57. This approach is like iterative reconstruction and uses an explicit model of photoacoustic wave propagation for image reconstruction. However, the prior constraints are not handcrafted and instead are learned by a CNN from training data. The improved performance does come at the cost of requiring more time to train the CNN and reconstruct an image50. Thus, the choice between model-based learning and direct learned approaches (e.g. Post-DL and Direct-DL) depends on whether the application prioritizes image reconstruction speed or quality.

In this work, we propose a novel approach termed pixel-wise deep learning (Pixel-DL) for limited-view and sparse PAT image reconstruction. Pixel-DL is a direct learned approach that employs pixel-wise interpolation to window relevant information, based on the physics of photoacoustic wave propagation, from the sensor data on a pixel-basis. The pixel-interpolated data is provided as an input to the CNN for image reconstruction. This strategy removes the need for an initial inversion and enables the CNN to utilize more information from the sensor data to reconstruct a higher quality image. The pixel-interpolated data has similar dimensions to the desired output image which simplifies CNN implementation. We compare Pixel-DL to conventional PAT image reconstruction methods (time reversal and iterative reconstruction) and direct learned approaches (Post-DL and a modified implementation of Direct-DL) with in silico experiments using several vasculature phantoms for training and testing.

Methods

The photoacoustic signal is generated by irradiating the tissue with a nanosecond laser pulse I’(t). Light absorbing molecules in the tissue undergo thermoelastic expansion and generate photoacoustic pressure waves23. Assuming negligible thermal diffusion and volume expansion during illumination, the initial photoacoustic pressure x can be defined as

$$x(r)=Gamma (r)A(r)$$

(1)

where A(r) is the spatial absorption function and (Gamma (r)) is the GrA 1/4 neisen coefficient describing the conversion efficiency from heat to pressure58. The photoacoustic pressure wave p(r, t) at position r and time t can be modeled as an initial value problem for the wave equation, in which c is the speed of sound59.

$$({partial }_{tt}-{c}_{0}^{2}Delta )p(r,t)=0,,p(r,t=0)=x,,{partial }_{t}p(r,t=0)=0$$

(2)

Sensors located along a measurement surface ({S}_{o}) measure a time-dependent signal. The linear operator ( {mathcal M} ) acts on (p(r,t)) restricted to the boundary of the computational domain I(C) over a finite time T and provides a linear mapping from the initial pressure x to the measured time-dependent signal y.

$$y={ {mathcal M} }_{p|partial Omega times (0,T)}=Ax,$$

(3)

Time reversal is a robust reconstruction method that works well for homogenous and heterogeneous mediums and also for any arbitrary detection geometry27,28. A PAT image is formed by running a numerical model of the forward problem backwards in time. This involves transmitting the measured sensor data in a time-reversed order into the medium. Time reversal can reconstruct a high-quality image if the acoustic properties of the medium are known a priori and if the sensor array has enough detectors and fully encloses the tissue.

In this work, iterative reconstruction is used to recover the PAT image x from the measured signal y by solving the following optimization problem using the isotropic total variation (TV) constraint

$$x=mathop{{rm{argmin}}}limits_{x{prime} }{Vert y-Ax{prime} Vert }^{2}+lambda {|x{prime} |}_{TV}$$

where the parameter (lambda > 0) is a regularization parameter32,36,60. The TV constraint is a widely employed regularization functional for reducing noise and preserving edges. Iterative reconstruction with a TV constraint works well in the case of simple numerical or experimental phantoms but often leads to sub-optimal reconstructions for images with more complex structures43.

In this work, three different CNN-based deep learning approaches were used for limited-view and sparse PAT image reconstruction (Fig.A 1). These direct learned approaches all began with applying an initial processing step to the PAT sensor data and then recovering the final PAT image using a CNN. The primary difference among these approaches was the processing step used to initially transform the PAT sensor data. In Post-DL, the sensor data was initially reconstructed into an image containing artifacts using time reversal, and the CNN was applied as a post-processing step for artifact removal and image enhancement. In Pixel-DL, pixel-wise interpolation was applied to window relevant information in the sensor data and to map that information into the image space. In the modified Direct-DL implementation (mDirect-DL), a combination of linear interpolation and down sampling was applied so that the interpolated sensor data had the same dimensions as the final PAT image.

Figure 1
figure1

Summary of CNN-based deep learning approaches for PAT image reconstruction. The primary task is to reconstruct an essentially artifact-free PAT image from the acquired PAT sensor data. (a) PAT sensor data acquired using a sensor array with 32 sensors and semi-circle limited-view. (b) Initial image reconstruction with sparse and limited-view artifacts using time reversal for Post-DL. (c) 3D data array acquired after applying pixel-wise interpolation for Pixel-DL. (d) Sensor data interpolated to have matching dimensions as the final PAT image for mDirect-DL. (e) Desired artifact-free PAT image reconstruction from the CNN-based deep learning approaches.

Full size image

After the sensor data was transformed, the final PAT image was recovered using the Fully Dense UNet (FD-UNet) CNN architecture (Fig.A 2). The FD-UNet builds upon the UNet, a widely used CNN for biomedical imaging tasks. by incorporating dense connectivity into the contracting and expanding paths of the network61. This connectivity pattern enhances information flow between convolutional layers to mitigate learning redundant features and reduce overfitting62. The FD-UNet was demonstrated to be superior to the UNet for artifact removal and image enhancement in 2D sparse PAT47.

Figure 2
figure2

FD-UNet CNN Architecture. The FD-UNet CNN with hyperparameters of initial growth rate, ({k}_{1}=16) and initial feature-maps learned, ({f}_{1}=128) is used for PAT image reconstruction. Essentially the same CNN architecture was used for each deep learning approach except for minor modifications. (a) Inputs into the CNN for each deep learning approach. The Post-DL CNN implementation used residual learning which included a skip connection between the input and final addition operation. The initial Pixel-DL input contains aEURoeNaEUR? feature-maps corresponding to the number of sensors in the imaging system. (b) The FD-UNet is comprised of a contracting and expanding path with concatenation connections. (c) The output of the CNN is the desired PAT image. In Post-DL, residual learning is used to acquire the final PAT image.

Full size image

Pixel-wise interpolation uses a model of photoacoustic wave propagation to map the measured time series pressure in the sensor data to a pixel position within the image reconstruction grid that the signal likely originated from. In this work, we choose to apply pixel-wise interpolation using a linear model of photoacoustic wave propagation since the in silico experiments were performed using a homogenous medium (e.g. uniform density and speed of sound). The linear model assumes the acoustic waves are propagating spherically and traveling at a constant speed of sound. Based on these assumptions, the time-of-flight can be easily calculated for a pressure source originating at some position in the medium and traveling to a sensor located on the medium boundary.

Reconstructing an image begins by defining an image reconstruction grid that spans the region of interest in the imaging system (Fig.A 3a). The goal of pixel-wise interpolation is to map the time series pressure measurements of each sensor to the defined reconstruction grid on a pixel-basis, which results in a 3D data array with dimensions corresponding to the 2D image space and sensor number (Fig.A 3b,c). This is achieved by repeating the following interpolation process for each sensor in the sensor array (Fig.A 3daEUR”f). The time-of-flight for a signal originating from each pixel position and traveling to the selected sensor is calculated based on a model of photoacoustic wave propagation. In the case of a linear model, the time-of-flight is proportional to the distance between the selected pixel and sensor (Fig.A 3e). Pressure measurements in the sensor data are interpolated onto the reconstruction grid using the calculated time-of-flight for each pixel (Fig.A 3f).

Figure 3
figure3

Pixel-Wise Interpolation Process. (a) Schematic of the PAT system for imaging the vasculature phantom. The red semi-circle represents the sensor array, and the gray grid represents the defined reconstruction grid. The first sensor (S1) is circled and used as an example for applying pixel-wise interpolation to a single sensor. (b) The PAT time series pressure sensor data measured by the sensor array. (c) Resulting pixel-interpolated data after applying pixel-wise interpolation to each sensor based on the reconstruction grid. (d) Sensor data for S1. Color represents the time at which a pressure measurement was taken and is included to highlight the use of time-of-flight to map the sensor data to the reconstruction grid. (e) Calculated time-of-flight for a signal originating at each pixel position and traveling to S1. (f) Pressure measurements are mapped from the S1 sensor data to the reconstruction grid based on the calculate time-of-flight for each pixel.

Full size image

The CNNs were implemented in Python 3.6 with TensorFlow v1.7, an open source library for deep learning63. Training and evaluation of the network is performed on a GTX 1080Ti NVIDIA GPU. The CNNs were trained using the Adam optimizer to minimize the mean squared error loss with an initial learning rate of 1e-4 and a batch size of three images for 40 epochs. Training each CNN required approximately one hour to complete. Pairs of training datasets ({{x}_{i},,{y}_{i}}) were provided to the CNN during training, where ({x}_{i}) represents the input data (e.g. initial time reversal reconstruction, pixel-interpolated sensor data, and interpolated sensor data) and ({y}_{i}) represents the corresponding artifact-free ground truth image. A separate CNN was trained for each CNN-based approached, imaging system configuration, and training dataset.

Training data were procedurally generated using data augmentation, where new images were created based on a 340 A– 340 pixel-size image of a synthetic vasculature phantom generated in MATLAB (Fig.A 3a). First, scaling and rotation was applied to the initial phantom image with a randomly chosen scaling factor (0.5 to 2) and rotation angle (0-359 degrees). Then a 128 A– 128 pixels sub-image was randomly chosen from the transformed image and translated by a random vertical and horizontal shift (0aEUR”10 pixels) via zero-padding. Outputs from multiple iterations (up to five) of the data augmentation process are summed together to create a training image. The synthetic vasculature phantom dataset was comprised of 500 training images. Testing data were generated from a 3D micro-CT mouse brain vasculature volume64 with a size of 260 A– 336 A– 438 pixels. The Frangi vesselness filter was applied to suppress background noise and enhance vessel-like features65. A new image was created from the filtered volume by generating a maximum-intensity projection of a randomly chosen 128 A– 128 A– 128 pixel sub-volume. The mouse brain vasculature dataset was comprised of 50 testing images.

The aEURoeHigh-Resolution Fundus Image DatabaseaEUR? is a public database that contains 45 fundus images from human subjects that were either healthy, had glaucoma, or had diabetic retinopathy. The images had corresponding vessel segmentation maps created by a group of experts and clinicians within the field of retinal image analysis66. The 45 fundus images were split into a separate training set (N = 15) and testing set (N = 30). The training dataset was procedurally generated using data augmentation based on the images within the training set and was comprised of 500 training images. The testing dataset was comprised of the original 30 images and 20 additional images, generated using data augmentation based on images from the testing set, for a total of 50 testing images.

The aEURoeELCAP Public Lung Image DatabaseaEUR? is a public database that contains 50 low-dose whole-lung CT scans obtained within a single breath hold67. The whole-lung volumes were split into a training (N = 15) and testing set (N = 35). Vessel-like structures were segmented from the whole-lung CT volumes using the Frangi vesselness filter [63]. The training dataset was then generated by taking maximum intensity projection images (MIP) of randomly sampled sub-volumes from the filtered volumes in the training set. Data augmentation was also applied to the MIPs to generate a training dataset comprised of 500 training images. With the same procedures, MIPs were taken from the filtered volumes in the testing set to create a testing dataset comprised of 50 images.

In all three cases (mouse-brain vasculature, fundus image database, and ELCAP Lung database), training and testing data were completely segregated. In the latter two experiments, significant variations were present between the training and testing datasets due to patient-to-patient variability and innate differences in vascular morphology between healthy subjects and patients with varying degrees of disease.

A MATLAB toolbox, k-WAVE, was used to simulate photoacoustic data acquisition using an array of acoustic sensors68. Photoacoustic simulations in the k-WAVE toolbox are implemented using a pseudospectral approach69. Each training and testing image were normalized (values between 0 and 1) and treated as a photoacoustic source distribution on a computation grid of 128 A– 128 pixels. The medium was assumed to be non-absorbing and homogenous with a speed of sound of 1500aEUR?m/s and density of 1000 Kg/m3. The sensor array had 16, 32, or 64 sensor elements equally spaced on a semi-circle with a diameter of 120 pixels. The time reversal method in the k-WAVE toolbox was also used for reconstructing an image from the simulated photoacoustic time series data.

Reconstructed images were compared against the ground truth using the peak-signal-to-noise ratio (PSNR) and structural similarity index (SSIM) as metrics for image quality. PSNR provides a global measurement of image quality, while SSIM provides a local measurement that takes into account for similarities in contrast, luminance, and structure70.

Author S.G.aEUR(TM)s affiliation with The MITRE Corporation is provided for identification purposes only and is not intended to convey or imply MITREaEUR(TM)s concurrence with, or support for, the positions, opinions or viewpoints expressed by the author. Approved for Public Release; Distribution Unlimited. Case Number 18-4405. A(C)2019 The MITRE Corporation. All Rights Reserved.

Results

Conventional PAT image reconstruction techniques (e.g. time reversal and iterative reconstruction) and CNN-based approaches (Post-DL, Pixel-DL, and mDirect-DL) were compared over several in silico experiments for reconstruction image quality and reconstruction time. CNN-based approaches were all implemented using the FD-UNet CNN architecture. Reconstructed images were compared to the ground truth image using PSNR and SSIM as quantitative metrics for image reconstruction quality.

In the first experiment, the CNNs were trained on the synthetic vasculature phantom dataset and tested on the mouse brain vasculature dataset. Although both datasets contained images of vasculature, they were non-matched meaning there were likely image features (e.g. vessel connectivity patterns) in the testing dataset but not in the training dataset. In addition to evaluating the CNNsaEUR(TM) performance, this experiment sought to determine if the CNNs were generalizable when trained on the synthetic vasculature phantom and tested on the mouse brain datasets.

The time reversal reconstructed images had severe artifacts blurring the image and the lowest average PSNR and SSIM for all sparsity levels (Fig.A 4 and TableA 1). Images reconstructed with iterative or a CNN-based method had fewer artifacts and a higher average PSNR and SSIM. Vessels obscured by artifacts in the time reversal reconstructed images were more visible in the other reconstructed images. As expected, increasing the number of sensors resulted in fewer artifacts and improved image quality for all PAT image reconstruction methods. Pixel-DL consistently had a higher average PSNR and SSIM than Post-DL for all sparsity levels and similar scores to iterative reconstruction.

Figure 4
figure4

Limited-view and sparse PAT image reconstruction of mouse brain vasculature. PAT sensor data acquired with a semi-circle limited-view sensor array at varying sparsity levels. (a) Ground truth image used to simulate PAT sensor data. (b) PAT reconstructions with 16 sensors. Vessels are difficult to identify in time reversal reconstruction as a result of artifacts. (c) PAT reconstructions with 32 sensors. Vessels can be clearly seen in CNN-based and iterative reconstructions. (d) PAT reconstructions with 64 sensors. Larger vessels are identifiable in all reconstructed images.

Full size image

Table 1 Average PSNR and SSIM for Micro-CT Mouse Brain Vasculature Testing Dataset (N = 50 testing images).

Full size table

In the case of sparse sampling (especially with 16 sensors), Post-DL often introduced additional vessels that were not originally in the ground truth image (Fig.A 4a,b). This was likely due to the CNN misinterpreting strong artifacts in the input image as real vessels. Pixel-DL exhibited a similar behavior but typically had fewer false additional vessels. This issue was not as prevalent in images reconstructed using the iterative method. However, images reconstructed using iterative reconstruction had an overly smoothed appearance compared to the deep learning-based reconstructed images. This is a pattern commonly observed when using the total variation constraint. Moreover, iterative reconstruct.

Pixel-DL consistently outperformed time reversal in reconstructing images of the synthetic vasculature and mouse brain vasculature (Fig.A 5). Interestingly, mDirect-DL only outperformed time reversal in reconstructing the synthetic vasculature images, which were used to train the CNN. The mDirect-DL reconstructed image of mouse brain vasculature resembled the ground truth image but was substantially worse than the time reversal reconstruction. This indicated that the CNN learned a mapping from the PAT-sensor data to the image space but severely overfitted to the training data. During training, the CNNs for Pixel-DL and mDirect-DL converged to a minimum mean squared error, but the Pixel-DL CNN converged to a lower error.

Figure 5
figure5

Limited-view and sparse Pixel-DL and mDirect-DL PAT image reconstructions. PAT sensor data acquired with 32 sensors and a semi-circle view. (a) CNNs were trained and tested on images of the synthetic vasculature phantom. Both CNN-based approaches successfully reconstructed the example synthetic vasculature phantom image (b) CNNs were trained on images of the synthetic vasculature phantom but tested on mouse brain vasculature images. mDirect-DL failed to reconstruct the example mouse brain vasculature image and performed worse than time reversal.

Full size image

In the second experiment, the CNNs were trained and tested on the lung vasculature and fundus vasculature datasets. This experiment represented a scenario in which the training and testing datasets are derived from segregated anatomical image data. There were natural differences between the training and testing datasets since the original images were acquired from healthy patients and those with varying disease severity.

As expected, the time reversal reconstructed images of lung and fundus vasculature had the most artifacts and the lowest average PSNR and SSIM for all sparsity levels (Fig.A 6 and TableA 2). Images reconstructed with a CNN-based method or iterative reconstruction resulted in fewer artifacts and a higher average PSNR and SSIM. Pixel-DL consistently outperformed Post-DL for both vasculature phantoms for all sparsity levels. Comparable to iterative reconstruction, Pixel-DL had similar performance for the fundus vasculature and outperformed it for the lung vasculature dataset. For images reconstructed from PAT sensor data acquired using 16 sensors, Pixel-DL reconstructed images appeared sharper and were qualitatively superior compared to iteratively reconstructed images despite having similar SSIM and PSNR values.

Figure 6
figure6

Limited-view and sparse PAT image reconstructions of fundus and lung vasculature. PAT sensor data acquired with 32 sensors and a semi-circle view. (a) CNNs were trained and tested on images of lung vasculature (b) CNNs were trained and tested on images of fundus vasculature. Testing images were derived from a separate set of patientsaEUR(TM) lung and fundus images than the training images.

Full size image

Table 2 Average PSNR and SSIM for Lung and Fundus Vasculature Testing Dataset (N = 50 testing images).

Full size table

The average reconstruction time reported for each method are for reconstructing a single image from the PAT sensor data. Time reversal is a robust and computationally inexpensive reconstruction method (~2.57aEUR?seconds per image). Iterative reconstruction removed most artifacts and improved image quality but had a much longer average reconstruction time (~491.21aEUR?seconds per image). Pixel-DL reconstructed images with similar quality to iterative reconstruction and was faster by over a factor of 1000 (~7.9 milliseconds per image). Average reconstruction time for Post-DL is dependent on the initial inversion used since the computational cost of a forward pass through a CNN is essentially negligible. Since time reversal was used as the initial inversion, Post-DL had a longer average reconstruct time than Pixel-DL (~2.58aEUR?seconds per image).

Discussion

In this work, we propose a novel deep learning approach termed Pixel-DL for limited-view and sparse PAT image reconstruction. We performed in silico experiments using training and testing data derived from multiple vasculature phantoms to compare Pixel-DL with conventional PAT image reconstruction methods (time reversal and iterative reconstruction) and direct learned approaches (Post-DL and mDirect-DL). Results showed that Pixel-DL consistently outperformed time reversal, Post-DL, and mDirect-DL for all experiments. Pixel-DL was able to generalize well evidenced by its comparable performance to iterative reconstruction for the mouse brain vasculature phantom despite having only trained on images generated from a synthetic vasculature phantom with data augmentation. Having a more varied training dataset may further improve CNN generalization and performance. When the training and testing data were derived from segregated anatomical data, Pixel-DL had similar performance to iterative reconstruction for the fundus vasculature phantom and outperformed it for the lung vasculature phantom. The total variation constraint used for iterative reconstruction was likely suboptimal for reconstructing lung vasculature images since the lung vessels were small and closely grouped.

The CNN architecture and hyperparameters used for all deep learning approaches implemented were essentially the same. Thus, discrepancies in performance between the approaches were primarily due to their respective inputs into the CNN (Fig.A 4). In Post-DL, the input was an image initially reconstructed from the sensor data using time reversal. The input and output to the CNN are both conveniently images of the same dimensions. This removed the need for the CNN to learn the physics required to map the sensor data into the image space. However, the initial inversion did not properly address the issues of limited-view and sparse sampling which resulted in an initial image with artifacts. Moreover, the CNN no longer had access to the sensor data and was only able to use information contained in the image to remove artifacts. There was likely useful information in the sensor data for more accurately reconstructing the PAT image, which was ignored in this approach.

In Pixel-DL, the initial inversion is replaced with pixel-wise interpolation, which similarly provides a mapping from the sensor data to image space. Relevant sensor data is windowed on a pixel-basis using a linear model of acoustic wave propagation. This enables the CNN to have a richer information source to reconstruct higher quality images. Furthermore, there is no initial inversion introducing artifacts; thus, the CNN does not have an additional task of learning to remove those artifacts.

mDirect-DL similarly did not require an initial inversion and instead used the full sensor data as an input to the CNN to reconstruct an image. The potential advantage of mDirect-DL is that the CNN had full access to the information available in the sensor data to reconstruct a high-quality image. However, reconstructing directly from the sensor data was also a more difficult task because the CNN needed to additionally learn a mapping from the sensor data into the image space. Results showed that the CNN had difficulty in learning a generalizable mapping and overfitted to the training data (Fig.A 5). The FD-UNet was likely not an optimal architecture for this task since it was designed assuming the input was an image. A different neural network architecture for a multidimensional time-series input would be better suited.

A limitation of Post-DL and Pixel-DL for sparse and limited-view PAT is that the reconstructed image could have additional vessels that are not in the ground truth image. This can be problematic depending on the requirements of the application. Large vessels and structures are often reliably reconstructed in the image, but some small vessels could be false additions. This limitation primarily occurred at the sparsest sampling level and could be addressed by increasing the number of sensors used for imaging. The loss function could also be modified to penalize the CNN for reconstructing false additional vessels, but this could lead to the CNN to preferentially not reconstruct small vessels. Alternatively, a model-based learning approach could be used for better image quality if computational cost is not a limitation.

A key challenge in applying deep learning for in vivo PAT image reconstruction is that a large training dataset is required for the CNN to learn and be able to remove artifacts and improve image quality. The training data can be acquired experimentally using a PAT imaging system that has a sufficient number of sensors and full-view of the imaging target. However, this process is often infeasible because it is prohibitively expensive, time-consuming, and needs to be repeated when the imaging system configuration or imaging target is changed. Alternatively, synthetic training data can be generated using numerical phantoms or images from other modalities. In combination with data augmentation techniques, this approach enables for arbitrarily large synthetic training datasets to be created. However, CNN image reconstruction quality is largely dependent on the degree to which the simulations used to generate the training data matches actual experimental conditions. Properly matching the simulation is a non-trivial task that necessitates the PAT imaging system to be well-characterized and understood. Some factors to be considered when creating the simulations include: sensor properties (e.g. aperture size, sensitivity, and directivity), sensor configuration, laser illumination, and medium heterogeneities. Generally, it is preferable to closely match the simulation to the experimental conditions, but post-processing (e.g. filtering and denoising) can also be applied to the experimental data. It is beyond the scope of this work to discuss the impact of each factor in detail, but the issue of medium heterogeneities, specifically for speed of sound, is examined.

In this work, Pixel-DL was applied using a linear model of acoustic wave propagation that assumes the acoustic waves propagate spherically and travel at a constant speed of sound throughout the medium. Although this model was sufficient for the case of a homogenous medium, a different model would be needed if the medium was heterogeneous (e.g. speed of sound and density) such as for in vivo imaging. Naively reconstructing with these assumptions for heterogeneous mediums would result in additional artifacts that degrade image quality and potentially impact CNN performance. The severity of the artifacts would depend on the degree of mismatch between the heterogeneity and assumed value. If the distribution of the heterogeneities or acoustically reflective surfaces is known then they can be accounted for during the time-of-flight calculations when applying pixel-interpolation. However, if it is not known then the CNN should be trained with training data containing examples of heterogeneous mediums similar to what would be anticipated during image reconstruction. This would enable the CNN to learn to compensate for potential artifacts due to applying pixel interpolation with a linear model of acoustic wave propagation when the medium is actually not homogeneous.

The proposed Pixel-DL approach can be used as a computationally efficient method for improving PAT image quality under limited-view and sparse sampling conditions. It can be readily applied to a wide variety of PAT imaging applications and configurations. Pixel-DL enables for the development of more efficient data acquisition approaches. For example, PAT imaging systems can be built with fewer sensors without sacrificing image quality, which would allow for the technology to be more affordable. Pixel-DL achieved similar or better performance and was faster than iterative reconstruction by over a factor of a 1000. It would allow for real-time PAT image rendering which would provide valuable feedback during image acquisition.

In this work we have demonstrated in silico the feasibility of Pixel-DL for PAT imaging of vasculature-like targets. This approach can also be readily applied to ultrasound imaging. Image reconstruction for PAT and ultrasound imaging both largely rely on time-of-flight calculations to determine where the signal originated. Therefore, a similar linear model of acoustic wave propagation can be used to readily apply Pixel-DL for ultrasound image reconstruction problems. Pixel-DL can also be adapted to other imaging modalities if a model mapping the sensor data to the image space is available.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. 1.

    Glover, G. H. Overview of Functional Magnetic Resonance Imaging. Neurosurg. Clin. N. Am. 22, 133aEUR”139 (2011).

  2. 2.

    Kim, S.-G. & Ogawa, S. Biophysical and physiological origins of blood oxygenation level-dependent fMRI signals. J. Cereb. Blood Flow Metab. Off. J. Int. Soc. Cereb. Blood Flow Metab. 32, 1188aEUR”1206 (2012).

  3. 3.

    Chuang, N. et al. An MRI-based atlas and database of the developing mouse brain. NeuroImage 54, 80aEUR”89 (2011).

  4. 4.

    Villringer, A. & Chance, B. Non-invasive optical spectroscopy and imaging of human brain function. Trends Neurosci 20, 435aEUR”442 (1997).

  5. 5.

    Zhang, F. et al. Multimodal fast optical interrogation of neural circuitry. Nature 446, 633aEUR”639 (2007).

  6. 6.

    Boyden, E. S., Zhang, F., Bamberg, E., Nagel, G. & Deisseroth, K. Millisecond-timescale, genetically targeted optical control of neural activity. Nat. Neurosci. 8, 1263aEUR”1268 (2005).

  7. 7.

    Kim, C., Erpelding, T. N., Jankovic, L., Pashley, M. D. & Wang, L. V. Deeply penetrating in vivo photoacoustic imaging using a clinical ultrasound array system. Biomed. Opt. Express 1, 278aEUR”284 (2010).

  8. 8.

    Heijblom, M. et al. Photoacoustic image patterns of breast carcinoma and comparisons with Magnetic Resonance Imaging and vascular stained histopathology. Sci. Rep. 5 (2015).

  9. 9.

    Lin, L. et al. Single-breath-hold photoacoustic computed tomography of the breast. Nat. Commun. 9, 2352 (2018).

  10. 10.

    Zhu, Y. et al. Light Emitting Diodes based Photoacoustic Imaging and Potential Clinical Applications. Sci. Rep 8, 1aEUR”12 (2018).

  11. 11.

    Liba, O. & Zerda, A. de la. Photoacoustic tomography: Breathtaking whole-body imaging. Nat. Biomed. Eng 1, 1aEUR”3 (2017).

  12. 12.

    Hu, S. & Wang, L. V. Neurovascular Photoacoustic Tomography. Front. Neuroenergetics 2 (2010).

  13. 13.

    Wang, D., Wu, Y. & Xia, J. Review on photoacoustic imaging of the brain using nanoprobes. Neurophotonics 3 (2016).

  14. 14.

    Wang, X. et al. Noninvasive laser-induced photoacoustic tomography for structural and functional in vivo imaging of the brain. Nat. Biotechnol. 21, 803aEUR”806 (2003).

  15. 15.

    Li, L. et al. Single-impulse panoramic photoacoustic computed tomography of small-animal whole-body dynamics at high spatiotemporal resolution. Nat. Biomed. Eng 1, 1aEUR”11 (2017).

  16. 16.

    Tang, J., Coleman, J. E., Dai, X. & Jiang, H. Wearable 3-D Photoacoustic Tomography for Functional Brain Imaging in Behaving Rats. Sci. Rep 6, 25470 (2016).

  17. 17.

    Beard, Paul Biomedical photoacoustic imaging. Interface Focus 1, 602aEUR”631 (2011).

  18. 18.

    Zhang, P., Li, L., Lin, L., Shi, J. & Wang, L. V. In vivo superresolution photoacoustic computed tomography by localization of single dyed droplets. Light Sci. Appl 8, 1aEUR”9 (2019).

  19. 19.

    Wang, L. V. Multiscale photoacoustic microscopy and computed tomography. Nat. Photonics 3, 503aEUR”509 (2009).

  20. 20.

    A-zbek, A., DeA!n-Ben, X. L. & Razansky, D. Optoacoustic imaging at kilohertz volumetric frame rates. Optica 5, 857aEUR”863 (2018).

  21. 21.

    Chatni, M. R. et al. Tumor glucose metabolism imaged in vivo in small animals with whole-body photoacoustic computed tomography. J. Biomed. Opt. 17 (2012).

  22. 22.

    Jin, Y., Jia, C., Huang, S.-W., OaEUR(TM)Donnell, M. & Gao, X. Multifunctional nanoparticles as coupled contrast agents. Nat. Commun. 1, 41 (2010).

  23. 23.

    Xia, J., Yao, J. & Wang, L. V. Photoacoustic tomography: principles and advances. Electromagn. Waves Camb. Mass 147, 1aEUR”22 (2014).

  24. 24.

    Xu, M. & Wang, L. V. Universal back-projection algorithm for photoacoustic computed tomography. Phys. Rev. E 71, (2005).

  25. 25.

    Li, S., Montcel, B., Liu, W. & Vray, D. Analytical model of optical fluence inside multiple cylindrical inhomogeneities embedded in an otherwise homogeneous turbid medium for quantitative photoacoustic imaging. Opt. Express 22, 20500aEUR”20514 (2014).

  26. 26.

    Hristova, Y., Kuchment, P. & Nguyen, L. Reconstruction and time reversal in thermoacoustic tomography in acoustically homogeneous and inhomogeneous media. Inverse Probl. 24, 055006 (2008).

  27. 27.

    Treeby, B. E., Zhang, E. Z. & Cox, B. T. Photoacoustic tomography in absorbing acoustic media using time reversal. Inverse Probl. 26, 115003 (2010).

  28. 28.

    Cox, B. T. & Treeby, B. E. Artifact Trapping During Time Reversal Photoacoustic Imaging for Acoustically Heterogeneous Media. IEEE Trans. Med. Imaging 29, 387aEUR”396 (2010).

  29. 29.

    Huang, B., Xia, J., Maslov, K. & Wang, L. V. Improving limited-view photoacoustic tomography with an acoustic reflector. J. Biomed. Opt. 18 (2013).

  30. 30.

    Wu, D., Wang, X., Tao, C. & Liu, X. J. Limited-view photoacoustic tomography utilizing backscatterers as virtual transducers. Appl. Phys. Lett. 99, 244102 (2011).

  31. 31.

    Xu, Y., Wang, L. V., Ambartsoumian, G. & Kuchment, P. Reconstructions in limited-view thermoacoustic tomography. Med. Phys. 31, 724aEUR”733 (2004).

  32. 32.

    Huang, C., Wang, K., Nie, L., Wang, L. V. & Anastasio, M. A. Full-Wave Iterative Image Reconstruction in Photoacoustic Tomography With Acoustically Inhomogeneous Media. IEEE Trans. Med. Imaging 32, 1097aEUR”1110 (2013).

  33. 33.

    Arridge, S. R., Betcke, M. M., Cox, B. T., Lucka, F. & Treeby, B. E. On the Adjoint Operator in Photoacoustic Tomography. Inverse Probl. 32, 115012 (2016).

  34. 34.

    Haltmeier, M. & Nguyen, L. Analysis of Iterative Methods in Photoacoustic Tomography with Variable Sound Speed. SIAM J. Imaging Sci 10, 751aEUR”781 (2017).

  35. 35.

    Zhang, C., Zhang, Y. & Wang, Y. A photoacoustic image reconstruction method using total variation and nonconvex optimization. Biomed. Eng. OnLine 13 (2014).

  36. 36.

    Arridge, S. et al. Accelerated high-resolution photoacoustic tomography via compressed sensing. Phys. Med. Biol. 61, 8908 (2016).

  37. 37.

    Gu, J. et al. Recent advances in convolutional neural networks. Pattern Recognit 77, 354aEUR”377 (2018).

  38. 38.

    Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84aEUR”90 (2017).

  39. 39.

    Wang, G., Ye, J. C., Mueller, K. & Fessler, J. A. Image Reconstruction is a New Frontier of Machine Learning. IEEE Trans. Med. Imaging 37, 1289aEUR”1296 (2018).

  40. 40.

    Jin, K. H., McCann, M. T., Froustey, E. & Unser, M. Deep Convolutional Neural Network for Inverse Problems in Imaging. IEEE Trans. Image Process 26, 4509aEUR”4522 (2017).

  41. 41.

    Han, Y. S., Yoo, J. & Ye, J. C. Deep Residual Learning for Compressed Sensing CT Reconstruction via Persistent Homology Analysis. ArXiv161106391 Cs (2016).

  42. 42.

    Sandino, C. M., Dixit, N., Cheng, J. Y. & Vasanawala, S. S. Deep convolutional neural networks for accelerated dynamic magnetic resonance imaging. /paper/Deep-convolutional-neural-networks-for-accelerated-Sandino-Dixit/de12d079e3821ee22586682594d399cbc59d3ff0 (2017).

  43. 43.

    Hauptmann, A. et al. Model based learning for accelerated, limited-view 3D photoacoustic tomography. ArXiv170809832 Cs Math (2017).

  44. 44.

    Antholzer, S., Haltmeier, M., Nuster, R. & Schwab, J. Photoacoustic image reconstruction via deep learning. In Photons Plus Ultrasound: Imaging and Sensing 2018 vol. 10494 104944U (International Society for Optics and Photonics, 2018).

  45. 45.

    Antholzer, S., Haltmeier, M. & Schwab, J. Deep learning for photoacoustic tomography from sparse data. Inverse Probl. Sci. Eng. 27, 987aEUR”1005 (2019).

  46. 46.

    Schwab, J., Antholzer, S., Nuster, R. & Haltmeier, M. DALnet: High-resolution photoacoustic projection imaging using deep learning. ArXiv180106693 Phys. (2018).

  47. 47.

    Guan, S., Khan, A., Sikdar, S. & Chitnis, P. Fully Dense UNet for 2D Sparse Photoacoustic Tomography Artifact Removal. IEEE J. Biomed. Health Inform., https://doi.org/10.1109/JBHI.2019.2912935 (2019)

  48. 48.

    Allman, D., Reiter, A. & Bell, M. A. L. Photoacoustic Source Detection and Reflection Artifact Removal Enabled by Deep Learning. IEEE Trans. Med. Imaging 37, 1464aEUR”1477 (2018).

  49. 49.

    Davoudi, N., DeA!n-Ben, X. L. & Razansky, D. Deep learning optoacoustic tomography with sparse data. Nat. Mach. Intell. 1aEUR”8, https://doi.org/10.1038/s42256-019-0095-3 (2019).

  50. 50.

    Hauptmann, A. et al. Model-Based Learning for Accelerated, Limited-View 3-D Photoacoustic Tomography. IEEE Trans. Med. Imaging 37, 1382aEUR”1393 (2018).

  51. 51.

    Antholzer, S., Schwab, J. & Haltmeier, M. Deep Learning Versus 1$ -Minimization for Compressed Sensing Photoacoustic Tomography. In 2018 IEEE International Ultrasonics Symposium (IUS) 206aEUR”212, https://doi.org/10.1109/ULTSYM.2018.8579737 (2018).

  52. 52.

    Waibel, D. et al. Reconstruction of initial pressure from limited view photoacoustic images using deep learning. In Photons Plus Ultrasound: Imaging and Sensing 2018 vol. 10494.

  53. 53.

    Lan, H. et al. Hybrid Neural Network for Photoacoustic Imaging Reconstruction. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 6367aEUR”6370, https://doi.org/10.1109/EMBC.2019.8857019 (2019).

  54. 54.

    Lan, H. et al. Ki-GAN: Knowledge Infusion Generative Adversarial Network for Photoacoustic Image Reconstruction In Vivo. In Medical Image Computing and Computer Assisted Intervention aEUR” MICCAI 2019 (eds. Shen, D. et al.) 273aEUR”281 (Springer International Publishing, 2019), https://doi.org/10.1007/978-3-030-32239-7_31.

  55. 55.

    Hauptmann, A. et al. Approximate k-Space Models and Deep Learning for Fast Photoacoustic Reconstruction. In Machine Learning for Medical Image Reconstruction (eds. Knoll, F., Maier, A. & Rueckert, D.) 103aEUR”111 (Springer International Publishing, 2018), https://doi.org/10.1007/978-3-030-00129-2_12.

  56. 56.

    Adler, J. & A-ktem, O. Solving ill-posed inverse problems using iterative deep neural networks. Inverse Probl. 33, 124007 (2017).

  57. 57.

    Schwab, J., Antholzer, S. & Haltmeier, M. Learned backprojection for sparse and limited view photoacoustic tomography. In Photons Plus Ultrasound: Imaging and Sensing 2019 vol. 10878 1087837 (International Society for Optics and Photonics, 2019).

  58. 58.

    Beard, P. Biomedical photoacoustic imaging. Interface Focus 1, 602aEUR”631 (2011).

  59. 59.

    Xu, M. & Wang, L. V. Universal back-projection algorithm for photoacoustic computed tomography. In vol. 5697 251aEUR”255 (International Society for Optics and Photonics, 2005).

  60. 60.

    Beck, A. & Teboulle, M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear. Inverse Problems. SIAM J. Imaging Sci. 2, 183aEUR”202 (2009).

  61. 61.

    Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention aEUR” MICCAI 2015 (eds. Navab, N., Hornegger, J., Wells, W. M. & Frangi, A. F.) 234aEUR”241 (Springer International Publishing, 2015).

  62. 62.

    Huang, G., Liu, Z., van der Maaten, L. & Weinberger, K. Q. Densely Connected Convolutional Networks. ArXiv160806993 Cs (2016).

  63. 63.

    Abadi, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. 19.

  64. 64.

    Dorr, A., Sled, J. G. & Kabani, N. Three-dimensional cerebral vasculature of the CBA mouse brain: a magnetic resonance imaging and micro computed tomography study. NeuroImage 35, 1409aEUR”1423 (2007).

  65. 65.

    Frangi, A. F., Niessen, W. J., Vincken, K. L. & Viergever, M. A. Multiscale vessel enhancement filtering. In Medical Image Computing and Computer-Assisted Intervention aEUR” MICCAIaEUR(TM)98 (eds. Wells, W. M., Colchester, A. & Delp, S.) vol. 1496 130aEUR”137 (Springer Berlin Heidelberg, 1998).

  66. 66.

    Budai, A., Bock, R., Maier, A., Hornegger, J. & Michelson, G. Robust Vessel Segmentation in Fundus Images. International Journal of Biomedical Imaging, https://www.hindawi.com/journals/ijbi/2013/154860/, https://doi.org/10.1155/2013/154860 (2013).

  67. 67.

    Public Lung Image Database, http://www.via.cornell.edu/lungdb.html.

  68. 68.

    Treeby, B. E. & Cox, B. T. k-Wave: MATLAB toolbox for the simulation and reconstruction of photoacoustic wave fields. J. Biomed. Opt. 15, 021314 (2010).

  69. 69.

    Treeby, B. E. & Cox, B. T. k-Wave: MATLAB toolbox for the simulation and reconstruction of photoacoustic wave fields. J. Biomed. Opt. 15, 021314 (2010).

  70. 70.

    Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process 13, 600aEUR”612 (2004).

Download references

Author information

Steven Guan was responsible for the data analysis, wrote the main manuscript, and prepared the figures. Amir A. Khan and Siddhartha Sikdar were advisors that have reviewed and edited the manuscript in addition to playing a critical role in defining the research plan. Parag V. Chitnis was a key contributor in planning and completing the research plan and have also reviewed and made substantial edits to the manuscript.

Correspondence to
Steven Guan or Parag V. Chitnis.

Ethics declarations

The authors declare no competing interests.

Additional information

PublisheraEUR(TM)s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the articleaEUR(TM)s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the articleaEUR(TM)s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Categories
Nature RSS Feed

Applying Machine Learning to Kinematic and Eye Movement Features of a Movement Imitation Task to Predict Autism Diagnosis

Abstract

Autism is a developmental condition currently identified by experts using observation, interview, and questionnaire techniques and primarily assessing social and communication deficits. Motor function and movement imitation are also altered in autism and can be measured more objectively. In this study, motion and eye tracking data from a movement imitation task were combined with supervised machine learning methods to classify 22 autistic and 22 non-autistic adults. The focus was on a reliable machine learning application. We have used nested validation to develop models and further tested the models with an independent data sample. Feature selection was aimed at selection stability to assure result interpretability. Our models predicted diagnosis with 73% accuracy from kinematic features, 70% accuracy from eye movement features and 78% accuracy from combined features. We further explored features which were most important for predictions to better understand movement imitation differences in autism. Consistent with the behavioural results, most discriminative features were from the experimental condition in which non-autistic individuals tended to successfully imitate unusual movement kinematics while autistic individuals tended to fail. Machine learning results show promise that future work could aid in the diagnosis process by providing quantitative tests to supplement current qualitative ones.

Introduction

Autism is a group of complex developmental conditions characterised by deficits in social skills, verbal and non-verbal communication, and restrictive, repetitive behaviours. However, precise expression of symptoms can vary considerably and there are no universal biomarkers. It is one of the most prevalent developmental disorders affecting approximately 1% of the population, resulting in ~700,000 individuals living with autism in the UK1. Currently, its diagnosis relies on clinical experts using observation, interview, and questionnaire techniques, which depend on interpretative coding. The diagnostic process is complex, long and expensive, and the average waiting time between recognising initial concerns and actual clinical diagnosis is more than 3 years in the UK2. Thus, valuable time is lost, because early identification and intervention are associated with better outcomes3. Although, the majority of autistic individuals receive diagnosis in childhood, many remained undiagnosed until adulthood or not at all. The diagnostic process for the adult population is complicated, as current diagnostic instruments have only been validated for use with children4. With adults, clinicians rarely rely on standardised diagnostic methods making diagnosis less accurate, more subjective and lengthier5. Thus researching relevant diagnostic criteria for the adult population is critical and listed as one of top ten priorities by the UKaEUR(TM)s leading autism research charity Autistica4.

In addition to social and communication deficits, current diagnostic criteria recognize repetitive behaviours and movements as core symptoms of autism6. However, a broad range of other motor functions are also implicated in autism. Even in the earliest characterization of the disorder, Kanner7 recognized unusual motor behaviours and described affected children as aEURoeclumsyaEUR?. However, only in the last two decades have motor deficits in autism received more attention and increasingly became recognised as important symptoms. In a recent large meta-analysis of studies investigating gait and balance, arm motor function and movement planning, a large and highly significant overall effect size was found showing gross motor impairments in autistic individuals in all examined domains8. Motor function deficits are likely to be a good autism biomarker as they occur in the majority of autistic individuals9,10,11,12, are present from the first year and persist into adulthood13,14 and can be measured more directly and objectively, compared to social or communication deficits.

The ability to imitate or copy movements performed by others is also altered in autism. Imitation is a common every-day behaviour important for learning, social interaction and language skills. Metanalyses of imitation studies show deficits in autistic individuals15,16, with consistent findings of a reduced rate of spontaneous imitation17 and poorer imitation of non-meaningful actions or imitation of manner and style of actions18,19,20,21. Building on that, several studies have used tasks requiring participants to imitate hand aiming movements while the style of the movement, such as the speed or size, was manipulated22,23,24. These studies used uniform movements allowing them to calculate precise kinematic measures (e.g. velocity, amplitude) by using motion tracking and to objectively compare the performance of autistic and non-autistic individuals. The results consistently showed that imitation precision of the style of the movement is lower in autistic compared to non-autistic adults and also that imitation is not solely an automatic behaviour driven by bottom-up processes, but that top-down attentional processes also play a role22,23,24,25,26,27. For example, group differences in imitation of the style of an action are removed if participants are explicitly asked to attend to the kinematics of the action24. As reduced imitation of movement style by autistic individuals is reported consistently it is likely to be more universally present and specific to autism compared to other imitation and movement deficits. This suggests that kinematic data from such imitation tasks could offer good discriminability between autistic and non-autistic groups as well as potential for good machine learning (ML) classification performance.

ML methods are well suited for the investigation of heterogeneous and multifaceted conditions such as autism because ML methods, in contrast, to more frequently used traditional univariate methods, make use of complex interactions between multiple variables and classes. However, ML has only recently become more widely used in clinical fields, including autism research. Most of the studies which used ML for autism prediction have used brain imaging data28, but studies from other domains also exist including ones which have used movement data.

The studies which have applied ML methods on kinematic data used various tasks: tracking gameplay with sensors on a tablet screen surface29, tracking reach-and-throw a ball movements30,31 and tracking a simple movement imitation task32. The studies had small sample sizes 20 to 82 (mean 40.5) and achieved high classification accuracy rates of 86.7% to 100%. However, the studies used result validation methods which do not necessarily sufficiently control for fitting random noise in the data33,34 and did not test the models with new unseen data. The studies also did not assess if classification performance was statistically significantly different from random guessing.

The ideal ML model would approximate only the regularities, but not the noise inherent in the training data and then generalise well when tested with new unseen data. However, if the model is not sufficiently validated/tested it is unclear how much of its performance is dependent on the noise fitting or on the regularities in the data. Recent ML study surveys suggest that avoiding fitting noise is particularly important when available sample sizes are small aEUR” surveyed studies with smaller sample sizes tended to report higher performance estimates28,35,36, while theoretically the opposite should be the case37,38,39. In our recent study40, in which we asked participants to perform a very simple and short pointing task, we used nested cross-validation, which fully separates training and validation data and has been shown to provide an aEURoealmost unbiased estimate [of performance]aEUR?33, even with small sample sizes36. In the study the sample size was small (NaEUR?=aEUR?46) and 71% classification accuracy was modest compared to other studies which used kinematic data to predict autism diagnosis.

In this study we investigated whether a simple imitation task could discriminate between autistic and non-autistic individuals and characterise autism-specific motor differences. 22 autistic and 22 non-autistic adults performed simple point to point movement sequences after observing them on the screen (Fig.A 1a). A motion tracker was employed to collect kinematic data and we also tracked eye movements, while participants observed the movements to imitate. The style of the pointing movements was manipulated, so that movements were performed either in low or high trajectory and either slow or fast. The behavioural results from this experiment, described in detail in Gowen et al.24, showed that autistic individuals imitated the style of the movement to a lesser extent than non-autistic individuals, consistent with earlier work using this task22. Eye tracking also showed reduced visual attention to the movement when it was presented on the screen. These differences, however, diminished when participants were instructed to pay close attention to movement kinematics. Thus, in this study, we predicted that features from the experimental block when participants were instructed to simply copy what they saw would be more discriminative between groups than features from the block when they were asked to pay attention to the movement kinematics.

Figure 1
figure1

(a) Four pointing locations for movement sequences. For kinematic and eye movement data analysis only movements between the two locations indicated with yellow arrows were included. Visual targets are for illustrative purposes only and were not displayed during the video clips or on the table in front of participants. (b) Description of kinematic features. (c) Kinematic feature structure. (d) Eye movement behaviour feature structure. SD – standard deviation. Panel (a) was adapted from24, creative commons license CC BY 4.0.

Full size image

In contrast to previous ML in autism studies, which have used movement measures, we have also included eye movement behaviour measures, using them separately and in combination. Developing ML models with combined data from different modalities is likely to provide complementary information for predictions and improve classification performance41.

A key consideration for the ML work was the reliability of the methods. We aimed to avoid overfitting at the model development stage for the models to reliably predict labels with unseen testing data. This is not an easy task with datasets such as ours, which have a large number of measures (features) and a small number of observations (samples). Both, validation to avoid noise fitting33,35, and selection of consistent discriminative feature sets used for predictions42,43,44 have to be considered.

To alleviate overfitting in addition to using nested validation we have also tested the models with an independent data. This approach should both provide reliable performance estimates and also show if at model development stage nested validation sufficiently controls noise fitting. The 44-participant sample was split into two parts: data from 30 participants (equally balanced between groups) was used for model development and the remaining data from 14 participants for independent testing of the developed models. As an additional safeguard, we have also assessed if classification performance given by our models was statistically significantly different from random guessing.

To reduce data dimensionality, we have used several traditional feature elimination/selection methods. Those methods are designed to retain features which are most relevant for classification task and some also consider feature redundancy. However, selected feature sets are not always consistent – different feature sets tend to be retained if traditional selection methods are applied on different subsets of the data42,43 aEUR” especially if the data is high dimensional and the sample size is small44. This reduces result interpretability as consistently selected features may aid in understanding and visualisation of the problem. To overcome this issue, we have designed methods aimed at selection stability, with a new aEURoeWrapped t-testaEUR? method showing good results. This allowed a more meaningful exploration of discriminative features to shed light on autism-specific motor patterns.

In sum we have used a movement imitation task which in previous studies has consistently shown differences between autistic and non-autistic individuals. For ML classification we have combined motion and eye tracking data, and the main aim was reliable ML application. Nested validation was used at model development stage to assure good generalisability when the models were tested with hold-out data. To aid understanding which features were important for classifying individuals as autistic or non-autistic we have used feature selection methods aimed at selection stability.

Methods

We have used separate model development and independent model testing datasets. In the development set, the sample consisted of 15 autistic and 15 non-autistic adults, matched for age, gender, handedness, and IQ. A hold-out testing set, also equivalently matched, consisted of 7 autistic and 7 non-autistic participants (TableA 1). The experimental protocol and stimuli are fully described in Gowen et al.24. During the experiment, all participants were asked to imitate sequences of simple hand movements. Participants first watched then imitated a video shown on a screen while their eye and hand movements were recorded using an eye tracker and a motion tracker. Movement sequences were simple and consisted of two point-to-point movements between three out of four possible locations 15aEUR?cm apart on a horizontal straight line, (Fig.A 1a). All movements were performed with the dominant hand and only data collected between the locations indicated with arrows in (Fig.A 1a) were used. In the videos, vertical amplitude and speed of the movements were manipulated resulting in three conditions: direct, elevated, and direct-fast. Another manipulation was related to attention. Participants first performed a block of imitation trials with a general instruction to simply copy what they saw. Then, in a second block, they were explicitly instructed to attend closely to the kinematic characteristics of the movement (the speed and height). This resulted in six different types of trials: condition(3) A– block(2). The experimental procedures involving human subjects described in this paper were carried out in accordance with the Declaration of Helsinki and approved by the University of Manchester research ethics committee, ref: 2017-2541-4204. Informed consent was obtained from all participants.

Table 1 Characteristics of training and test samples, p – two-sample t-test statistic (I+- level 0.05, two-tailed), f – female, m – male, lh – left-handed, rh – right-handed.

Full size table

A Polhemus Fastrak motion tracker was used for kinematic data collection with a single motion sensor attached to the distal phalange of the index finger. The movement was sampled at 120aEUR?Hz in X, Y, Z coordinates, filtered with a 120aEUR?Hz Butterworth filter, and features based on velocity, acceleration, jerk and amplitude were calculated for each pointing movement, (Fig.A 1b). Features were based on the mean and variability (standard deviation (SD)) of each of those measures. In total there were 120 kinematic features per block, 40 per condition (Fig.A 1c).

An EyeLink 1000 Plus eye-tracker (SR Research) was used to collect eye movement behaviour data while participants were watching the videos to be imitated. Features were calculated using Data Viewer (SR Research) and MATLAB. The features were based on saccade measures and on visual attention to the finger performing movement sequences (description is given in Supplementary Methods). Both means and variability measures (SDs) for each measure were calculated resulting in 48 features per block (Fig.A 1d).

For combined data, both kinematic and eye movement behaviour features were combined to a single feature set.

For both datasets, individual trial outliers were removed at the level of participant and group outliers were replaced with group means. Outliers were identified based on the non-recursive procedure recommended by Van Selst and Jolicoeur45. In the eye movement dataset, we have also removed trials in which missing data (blinks, pupil/corneal reflection loss) was over 1/3 of the total trial duration. Outlier removal and trials with missing data resulted in 2.3% missing values in kinematic dataset and 7.0% in eye movement dataset. Missing values were replaced with group means. Features were normalised by using standard score (z-score) transformation. Normalisation parameters (means and SDs) were calculated separately from training data to transform hold out data and validation data in each cross-validation (CV) fold during model development.

For classification, Support Vector Machine (SVM) algorithm46 was used. It separates the classes by maximising the gap between training examples from each class. The examples in the test data are assigned a label based on which side of the gap they fall. In this study SVM with radial basis function (RBF) kernel was used. Regularisation parameters C and I3 were optimised using grid search with grid parameters set to: CaEUR?=aEUR?2j, where jaEUR?=aEUR?1, 2, aEUR| 7 and I3 aEUR?=aEUR?2i, where iaEUR?=aEUR?a^’1, a^’2, aEUR| a^’7 and by using 10-fold CV. SVM and grid search were implemented with Libsvm47 and Scikit-learn48 libraries.

At the model development stage, nested cross-validation (CV)49 (bounded by the dashed line in Fig.A 2) was used for result validation. Nested CV similarly to commonly used K-fold CV50 approach validates the results iteratively in CV folds, using all of the available data for training and also reusing all of it for validation. Both validation methods thus are economical and well suited when available data is small as is the case in this study. The nested CV is, however, different from K-fold CV in a significant aspect aEUR” it avoids pooling training and validation data. When a nested CV is performed a portion of data is split at the beginning of each CV fold for validation and a model is then developed on the reduced training set, including data normalisation feature selection and parameter tuning. This is repeated iteratively with splitting a different portion of the data for validation, and each time developing a new model for training from scratch until all of the data is used. By using the nested CV approach validation data is separate from model development and in that respect this approach is similar to Train/Test Split testing. Varma and Simon33 have demonstrated that nested CV produces almost unbiased performance estimates, while the K-fold CV approach, which pools train and test data, can produce significantly over-optimistic results. In this study 10-fold Nested CV was used and the performance of the model was calculated as a mean performance of ten CV folds. Developed models were tested with independent data (yellow in Fig.A 2) by testing each of 10 developed models separately and averaging the results. The process of model development and testing with an independent sample was repeated 50 times to obtain performance distributions, represented by confidence intervals in the graphs, and both model development and testing classification results are reported in the results section.

Figure 2
figure2

Nested validation for model development and additional testing with independent data. Independent testing (yellow) was applied to each developed model and results averaged. Devel. – development, ACC – overall model accuracy, ACCi. – accuracy in a single CV fold, n – sub-sample size, k – number of CV folds.

Full size image

In this study, we have used several traditional feature selection methods and developed own methods aimed at feature selection stability. Datasets in which the number of features exceeds the number of observations are problematic for pattern recognition34,51,52. However, such datasets are common in neuroimaging, gene expression, behaviour tracking and many other technology-based research areas. In this study both kinematic and eye movement datasets also have more features than samples. Commonly, to overcome this issue, a portion of features are eliminated and in recent years various techniques were developed to accomplish this. Feature elimination/selection achieves several objectives: reliable classifier performance, elimination of irrelevant and redundant features, and selection of stable feature sets.

Using too many features can increase classification error and this effect is exacerbated when sample size is small. Hua et al.53 performed simulations to find an optimal number of features as a function of sample size for different classifiers and found that with training sample sizes comparable to the ones used in this study an optimal number of features is a?^10 and in this study, with each feature selection method, we reduced feature number to 10.

Feature selection/elimination addresses not only classification robustness but also allows to eliminate features which are redundant or irrelevant for a classification task. Feature elimination methods used in this study can be subdivided into filter and wrapper methods. Filter methods simply apply a mathematical rule which addresses feature relevance, redundancy, or both for feature ranking and highly ranked features are selected for predictions. In wrapper methods, the performance of a classifier is a criterion for feature selection/elimination. This way, feature selection is wrapped around a classification model and finds a feature subset giving highest classification performance.

In this study, we have used several traditional filter feature selection methods. These include StudentaEUR(TM)s t-test, which considers only feature relevance by simply ranking features based on how different feature means are between the two classes, as well as two methods which in addition to feature relevance also consider feature redundancy. ReliefF weighs features by considering their interactions54. It uses the K-nearest neighbour method to weigh-up features which discriminate best from the neighbours of the different class. mRMR (minimum redundancy maximum relevance) selects features which discriminate categories well but are dissimilar to each other55. Both minimum redundancy and maximum relevance criteria are based on mutual information. We have also used a wrapper method SVM-RFE56, which eliminates a set number of features which are deemed least important for separating classes by an SVM algorithm, in a number of iterations.

The selected most relevant features may aid in understanding and visualisation of a particular problem and may be useful for biomarker discovery. However, an issue of feature selection stability exists. Frequently, by using different feature selection methods or different subsets of data in a training sample (e.g. in different CV folds), selected features do not match, although classification performance may be comparable42,43. A major contributor to feature selection instability is small sample/high dimensional data44. To measure feature selection stability we used KunchevaaEUR(TM)s index (KI)57. It shows similarity of feature sets as follows:

$$KI=frac{2}{m(m-1)}mathop{sum }limits_{i=1}^{m-1},mathop{sum }limits_{j=m+1}^{m},frac{(|{S}_{i}cap {S}_{j}|l)-{k}^{2}}{k(l-k))},$$

(1)

where m is a number of feature subsets for similarity calculation, l is a number of features in a full dataset, k is a number of features in each subset (must be of equal cardinality). KI takes values between a^’1 and 1, with a^’1 meaning no overlap between feature subsets and 1 meaning that all feature subsets are identical.

Traditional feature selection methods consider feature relevance and/or redundancy, but selection stability is rarely explicitly considered. Therefore, below we present three approaches which aim to increase feature selection stability.

One way to improve the generalization of classifier predictions is to aggregate the predictions of multiple classifiers58. We have applied a similar approach to feature selection by combining different feature selectors. SVM-RFE, t-test, mRMR and ReliefF rankings were combined by simple voting and 10 highest ranked features were selected (Fig.A 3a).

Figure 3
figure3

Overview of our three feature selection methods (a) Ensemble aEUR” rankings of four feature selectors were combined and the 10 highest ranked features were selected. (b) t-test with bagging aEUR” t-test feature selection was performed 100 times on random subsamples of size n/2 and the 10 most frequently selected features in all the subsamples were selected. (c) Wrapped t-test aEUR” 10 features were selected using t-test ranking and their classification performance was assessed with an SVM classifier. If the accuracy was >50%, those features were made to be more likely to be selected in the future iterations by adjusting their t-statistics up (or adjusted down if accuracy was <50% making them less likely to be selected). Adjustments were accumulated until KunchevaaEUR(TM)s index (KI) was equal to 1 in 100 consecutive iterations. Blank box aEUR” initial feature set; blue/white box aEUR” ranked features; red/white box aEUR” combined ranks; red box aEUR” final feature set.

Full size image

Meinshausen and BA 1/4 hlmann59 proposed aEURoestability selectionaEUR? as a technique to improve feature selection stability. A general idea of the technique is that, instead of applying a selection algorithm on a whole dataset to select a feature set, feature selection is performed multiple times on random subsamples of the data. In this study, we have combined this approach with t-test ranking. Features were ranked 100 times on the random subsamples of the data of size n/2 and the final feature set comprised of 10 most frequently selected features in all of the subsamples (Fig.A 3b).

In this work, we introduce a new method which is also centred on feature selection stability and combines aspects of both wrapper and filter approaches. Instead of removing or adding features based on classification performance, as in traditional wrapper methods, we adjusted ranking statistics of a filter selector by a small magnitude in multiple iterations until a ranking algorithm consistently selected identical feature sets (Fig.A 3c). In this study, we have used the absolute value of StudentaEUR(TM)s t-test (two-sample) statistic for ranking. In the first iteration, we have selected 10 features with highest t-statistics and then in subsequent iterations, we adjusted t-statistics for those features based on classification performance in the outer 10-fold nested CV loop. t-statistics were adjusted up or down if classification accuracy was above or below 50% (random guessing level for a balanced two-class data set). Adjustments from all iterations were summed until ranking consistently selected identical feature sets in 100 consecutive iterations. Adjustment magnitude of 0.0001 was chosen because it allowed the algorithm to converge in a manageable number of iterations (<100,000). Although this algorithm is computationally demanding, in the end, it provides a single consistent feature set. It is advantageous compared to other methods as a single consistent set allows a clear interpretation of what measures were important for separating classes.

Statistical result significance was assessed with permutation testing. The labels of the data samples were randomly permutated 100 times and empirical p-statistic calculated as in Ojala and Garriga60. A significance level of 0.05 was used.

Results

Classification performance of all feature selection algorithms, followed by the SVM-RBF classifier, in the general instruction block was higher than in the attention instruction block. The difference was moderate for kinematic data (Fig.A 4a and TableA 2) and more marked for eye movement behaviour (Fig.A 4b and TableA 3) and combined data (Fig.A 4c and TableA 4). Based on these results for further analyses we have used only data from the general instruction block.

Figure 4
figure4

A box and whisker plot showing accuracy distributions of four algorithms with different feature selection using (a) kinematic, (b) eye movement behaviour, and (c) combined data. Green – general instruction block, yellow – attention instruction block.

Full size image

Table 2 Kinematic data classification results. Acc. – accuracy, n.s. – not significant.

Full size table

Table 3 Eye movement behaviour data classification results. Acc. – accuracy, n.s. – not significant.

Full size table

Table 4 Combined data classification results. Acc. – accuracy, n.s. – not significant.

Full size table

Overall, classification accuracies at the model development stage (columns Development) were comparable to accuracies when those models were used to predict labels in independent test data-set (columns Testing), TablesA 2, 3, and 4. Nested cross-validation was sufficient to control overfitting and produced results which generalised well to the independent test sample.

An algorithm using the simplest t-test feature selection outperformed all other algorithms in terms of classification accuracy and feature selection stability (KI) and this was the case for all data types (Fig.A 4). In our previous study where we used kinematic features from a simple movement task40 we have obtained similar results aEUR” t-test feature selection outperformed other algorithms. Additionally, in other studies which had high dimensional/small sample datasets t-test consistently outperformed other algorithms in terms of selection stability and classification accuracy61,62. Taking this into consideration in the next section, we report the results of two variations of t-test feature selection: using bagging, combining t-test with wrapper selection approach, as well as, an ensemble of feature selectors which includes t-test.

The idea behind an ensemble method is that a combination of output produced by multiple algorithms is potentially better than the output of a single algorithm. Ensembles have been shown to produce less variable and more robust results, especially with high dimensional/small sample data63. With our datasets, the ensemble method outperformed SVM-RFE, ReliefF, and mRMR, both in classification accuracy and feature selection stability. t-test alone, however, produced very similar results to the ensemble (Figs.A 4 and 5).

Figure 5
figure5

A box and whisker plot showing accuracy distributions of three algorithms with different feature selection using kinematic, eye movement behaviour, and combined data. Dashed lines show classification performance using t-test feature selection alone.

Full size image

The t-test feature selection consistently produced the most stable feature sets which nearly consistently outperformed other algorithms, thus we further explored two variations of t-test algorithm, with the aim of improving feature selection stability. First, we used t-test with bagging. Instead of applying the t-test algorithm on a whole data, feature selection was performed 100 times on the random subsamples of the size n/2. The results of t-test with bagging, however, did not show improvement on t-test alone both in stability or performance (Figs.A 4 and 5).

Finally, we have used our new aEURoeWrapped t-testaEUR? method which combines features of both filter and wrapper methods. FigureA 6 shows that cumulative adjustments of t-statistics progressively led to more stable feature selection demonstrated by increasing KI and also progressively better classification accuracy at a model development stage. Importantly, classification accuracy also increased on the independent dataset with the kinematic data (Fig.A 6a) and combined data (Fig.A 6c). There was no such effect with eye data (Fig.A 6b). With kinematic data classification accuracy was 73%, with a sensitivity of 88%, specificity of 59%, and p < 0.01, with eye data 70% accuracy, 43% sensitivity, 97% specificity, paEUR?=aEUR?0.02, with combined data 78% accuracy, 57% sensitivity, 99% specificity, p < 0.01. This approach produced the best classification accuracy on kinematic and combined datasets (TableA 5). However, it did not show improvement of the eye dataset. This was likely because eye features were more similar and inter-correlated ((bar{{r}}=0.48)) than kinematic ((bar{{r}}=0.20)) or combined ((bar{{r}}=0.20)) features, and the number of features was considerably lower.

Figure 6
figure6

Illustration of Wrapped t-test algorithm performance over successive iterations with (a) kinematic, (b) eye movement behaviour, and (c) combined datasets. In each model development iteration feature selection was performed 10 times aEUR” in each nested CV fold. Iterations were performed until KI was equal to 1 in 100 subsequent iterations. Thick dash-dot lines show fitted 5th order polynomial trend.

Full size image

Table 5 Classification results with ensemble, t-test with bagging, and Wrapped t-test feature selection.

Full size table

Here we interpret why selected kinematic features were salient for classification. For similar interpretation of selected discriminative features in eye movement and combined datasets see a Supplementary Note. Wrapped t-test feature selection was stable aEUR” we repeated feature selection on randomly sub-sampled data ten times and consistent feature sets were selected each time. Among the selected features in the kinematic dataset, 7 out of 10 features were from the elevated condition (Fig.A 7c). This corresponds with the behavioural results in previous studies, which used imitation tasks and have shown reduced vertical amplitude modulation in individuals with autism22,23,24. Our data also shows a significant difference in movement vertical amplitude between autistic and non-autistic individuals when they were asked to imitate elevated trials, t(42)aEUR?=aEUR?3.0, paEUR?=aEUR?0.004, daEUR?=aEUR?0.93, Fig.A 7a. There were also differences in the acceleration profile between autistic and non-autistic individuals in elevated trials (Fig.A 7b). Autistic individuals reached peak acceleration earlier in the movement, t(42)aEUR?=aEUR?2.5, paEUR?=aEUR?0.017, daEUR?=aEUR?0.75, and peak deceleration later in the movement, t(29.4)aEUR?=aEUR?3.0, paEUR?=aEUR?0.006, daEUR?=aEUR?0.90. This corresponds with the selected discriminative feature set as six out of ten features were acceleration/deceleration measures in elevated condition. Overall, both discriminative features and statistical differences suggest that non-autistic individuals reduced acceleration and deceleration and increased vertical amplitude in order to copy unusual elevated movement kinematics, while autistic individuals tended to retain their usual style of movement.

Figure 7
figure7

(a) Movement vertical amplitude and (b) acceleration averaged for autistic and non-autistic participants in the elevated experimental condition, general instruction block. Shaded areas show the difference between groups. (c) Features selected with Wrapped t-test selection method. Mean difference column shows whether the mean for particular feature was greater for autistic (A) or non-autistic (N) class and gives a p-value of two-sample t-test.

Full size image

Discussion

Differences between autistic and non-autistic individuals have been shown across a broad range of movement and movement imitation tasks8,15,16. Among the more consistent findings is reduced imitation of movement style/manner by autistic individuals18,19,20,22,23. In this study, we have used an imitation task in which movement style was manipulated and explored whether kinematic and eye movement behaviour measures could predict autism diagnosis. Developed models achieved a classification accuracy of 73% with kinematic data and 70% with eye movement data. We have also combined data from both kinematic and eye movement behaviour modalities. This provided complementary information for predictions and models on combined data gave the highest classification performance of 78%.

To date, only a handful of ML studies used kinematic data for autism prediction. All studies had small sample sizes and achieved high classification accuracy. Models developed with small sample/high dimensional data are prone to fit noise and not necessarily an underlying pattern separating classes51,64. However, completely separating training and validation data (treating validation data as unseen) is sufficient to control overfitting and produce reliable performance estimates33,36. To the best of our knowledge, those studies did not avoid pooling training and validation data while developing their ML models and did not take other steps to control for the fitting of random noise in the data. Classification results were not tested with independent data and researchers did not assess the statistical significance of the results.

Our focus for ML work was the reliability of used methods. We aimed to avoid fitting the noise in the data during model development stage to help assure that the models reliably predict labels with independent/unseen data during the testing stage. This was, however, not an easy task because of the characteristics of our datasets, which had a small number of samples and a high number of features. Such datasets are problematic for pattern recognition34,35,51,52 and both result validation and consistent feature selection required careful consideration.

For result validation at the model development stage, we used nested CV because it was shown to produce reliable results33. Moreover, we have shown that nested CV produces reliable results regardless of sample size36. In addition to that, we have further verified that our models were not overfitted, by testing the results with an independent dataset. Classification performance at the model development stage was comparable to independent validation performance, showing that nested validation was sufficient to control overfitting. This was the case even though training and independent testing datasets were significantly different in terms of gender composition. There were 20% of females in model development dataset and 57% in the independent testing dataset (TableA 1). There are more males than females with autism diagnosis and recent studies suggest that there may be phenotypic gender differences65,66.

Reliable feature selection was also a difficult issue to overcome. Traditional filter and wrapper feature selection methods produced only modest classification results, however, ranking feature sets with t-test consistently outperformed other methods in terms of feature selection stability and classification performance. Therefore, we developed two feature selection variations based on t-test with the main goal to improve feature selection stability. We used t-test with bagging by randomly subsampling data and aggregating feature ranks from multiple iterations. This method, however, did not show improvement on using t-test feature selection alone. A new aEURoeWrapped t-testaEUR? algorithm combined aspects of filter and wrapper approaches. We adjusted t-statistics used for feature ranking by a small magnitude in multiple iterations based on classifier performance. We ran this algorithm until t-test algorithm consistently ranked identical feature sets. With increasing feature selection stability this algorithm increasingly fitted training data, importantly, it also produced a better performance on independent data as well. In addition to good classification performance, this method provided a stable final set of 10 features which has helped to illustrate movement imitation differences in autism.

In the kinematic feature set, selected using a wrapped t-test, seven out of ten features were from the trials where movements were performed in elevated amplitude, and the same condition showed significant differences between autistic and non-autistic groups using parametric statistical tests. After watching videos of elevated movements to imitate, autistic individuals performed movements using a lower vertical amplitude than non-autistic individuals. As a consequence, autistic individuals also reached peak acceleration earlier in the movement and peak deceleration later in the movement. These results suggest that autistic individuals tended to retain their usual style of the movement when the movement to imitate had unusual kinematics. This is consistent with a number of studies which shown that while autistic individuals are more able to imitate goals of the action, they are less proficient at imitating the style or kinematics18,19,20,21,22,24,67.

In a previous study32, which applied ML methods to the data from a similar imitation experiment with different participants, researchers selected exclusively only variability measures as most discriminative kinematic features, although feature selection was performed in a not fully algorithmic way (several feature selection algorithms were combined with selection decisions by researchers). This was not the case in our study with both means and SDs selected (Fig.A 7c). However, in full feature sets, we found that autistic individuals tended both to perform movements with greater variability and to pay visual attention to the observed movement more variably. In the kinematic dataset autistic individuals showed higher variability than non-autistic individuals in 73% of 120 variability features (9% statistically significantly at 0.05 I+- level, two-tailed), in the eye movement behaviour dataset in 90% of 48 variability features (42% statistically significantly at 0.05 I+- level, two-tailed). Increased variability is a common finding in autism, reported for reaching movements68, hand aiming movements69, sustained force70, precision grip71, walking72 and Saccadic eye movements73. Increased variability suggests differences in sensorimotor control and is especially apparent with challenging tasks and those requiring precision70.

Conclusion

In this study, we used a movement imitation task, which based on previous evidence, suggested good discriminability between autistic and non-autistic groups. ML classified autistic and non-autistic individuals with 73% accuracy using kinematic measures and with 70% accuracy using eye movement behaviour measures. Moreover, combining measures from both modalities provided complementary information for predictions and gave a classification accuracy of 78%. We have overcome overfitting and stable feature selection issues by using nested validation and feature selection aimed at selection stability and show that even small-sample studies can achieve statistically significant predictions which generalise to unseen data. The results show a promise that future work could aid in diagnostic process, by reliably applying ML methods and possibly combining features from several modalities.

Data availability

The datasets generated and analysed during the current study are available at The University of Manchester repository: https://doi.org/10.17632/fnt6jtc5np.4.

References

  1. 1.

    Brugha, T. S. et al. Epidemiology of autism spectrum disorders in adults in the community in england. Arch. Gen. Psychiatry 68, 459 (2011).

  2. 2.

    Crane, L., Chester, J. W., Goddard, L., Henry, L. A. & Hill, E. Experiences of autism diagnosis: A survey of over 1000 parents in the united kingdom. Autism 20, 153aEUR”162 (2016).

  3. 3.

    Bradshaw, J., Steiner, A. M., Gengoux, G. & Koegel, L. K. Feasibility and effectiveness of very early intervention for infants at-risk for autism spectrum disorder: A systematic review. J. Autism Dev. Disord. 45, 778aEUR”794 (2015).

  4. 4.

    Cusack, J. & Sterry, R. Your questions: shaping future autism research (London: Autistica, 2016).

  5. 5.

    Rutherford, M. et al. A national study to investigate the clinical use of standardised instruments in autism spectrum disorder assessment of children and adults in scotland. Res. Autism Spectr. Disord. 29, 93aEUR”100 (2016).

  6. 6.

    American Psychiatric Association. Diagnostic and statistical manual of mental disorders: DSM-5 R (American Psychiatric Pub, Washington, DC, 2013).

  7. 7.

    Kanner, L. et al. Autistic disturbances of affective contact. Nerv. Child 2, 217aEUR”250 (1943).

  8. 8.

    Fournier, K. A., Hass, C. J., Naik, S. K., Lodha, N. & Cauraugh, J. H. Motor coordination in autism spectrum disorders: A synthesis and meta-analysis. J. Autism Dev. Disord. 40, 1227aEUR”1240 (2010).

  9. 9.

    Green, D. et al. The severity and nature of motor impairment in aspergeraEUR(TM)s syndrome: a comparison with specific developmental disorder of motor function. J. child psychology psychiatry 43, 655aEUR”668 (2002).

  10. 10.

    Green, D. et al. Impairment in movement skills of children with autistic spectrum disorders. Dev. Medicine & Child Neurol. 51, 311aEUR”316 (2009).

  11. 11.

    Hilton, C. et al. Relationship between motor skill impairment and severity in children with asperger syndrome. Res. Autism Spectr. Disord. 1, 339aEUR”349 (2007).

  12. 12.

    Miyahara, M. et al. Brief report: motor incoordination in children with asperger syndrome and learning disabilities. J. autism developmental disorders 27, 595aEUR”603 (1997).

  13. 13.

    Abu-Dahab, S. M. N., Skidmore, E. R., Holm, M. B., Rogers, J. C. & Minshew, N. J. Motor and tactile-perceptual skill differences between individuals with high-functioning autism and typically developing individuals ages 5aEUR”21. J. Autism Dev. Disord. 43, 2241aEUR”2248 (2013).

  14. 14.

    Biscaldi, M. et al. Deficits in motor abilities and developmental fractionation of imitation performance in high-functioning autism spectrum disorders. Eur. Child & Adolesc. Psychiatry 23, 599aEUR”610 (2014).

  15. 15.

    Edwards, L. A. A meta-analysis of imitation abilities in individuals with autism spectrum disorders. Autism Res. 7, 363aEUR”380 (2014).

  16. 16.

    Williams, J. H. G., Whiten, A. & Singh, T. A systematic review of action imitation in autistic spectrum disorder. J. Autism Dev. Disord. 34, 285aEUR”299 (2004).

  17. 17.

    Ingersoll, B. The effect of context on imitation skills in children with autism. Res. Autism Spectr. Disord. 2, 332aEUR”340 (2008).

  18. 18.

    Rogers, S. J., Bennetto, L., McEvoy, R. & Pennington, B. F. Imitation and pantomime in high-functioning adolescents with autism spectrum disorders. Child Dev. 67, 2060 (1996).

  19. 19.

    Vanvuchelen, M., Roeyers, H. & De Weerdt, W. Nature of motor imitation problems in school-aged males with autism: how congruent are the error types? Dev. Medicine Child Neurol. 49, 6aEUR”12 (2007).

  20. 20.

    Vivanti, G., Nadig, A., Ozonoff, S. & Rogers, S. J. What do children with autism attend to during imitation tasks? J. Exp. Child Psychol. 101, 186aEUR”205 (2008).

  21. 21.

    Vivanti, G., Trembath, D. & Dissanayake, C. Mechanisms of imitation impairment in autism spectrum disorder. J. Abnorm. Child Psychol. 42, 1395aEUR”1405 (2014).

  22. 22.

    Wild, K. S., Poliakoff, E., Jerrison, A. & Gowen, E. Goal-directed and goal-less imitation in autism spectrum disorder. J. Autism Dev. Disord. 42, 1739aEUR”1749 (2012).

  23. 23.

    Forbes, P. A. G., Pan, X. & Hamilton, A. F. Reduced mimicry to virtual reality avatars in autism spectrum disorder. J. Autism Dev. Disord. 46, 3788aEUR”3797 (2016).

  24. 24.

    Gowen, E., Vabalas, A., Casson, A. J. & Poliakoff, E. Instructions to attend to an observed action increases imitation in autistic adults. Autism, https://doi.org/10.1177/1362361319882810 (2019).

  25. 25.

    Hayes, S. J., Dutoy, C. A., Elliott, D., Gowen, E. & Bennett, S. J. Atypical biological motion kinematics are represented by complementary lower-level and top-down processes during imitation learning. Acta Psychol. 163, 10aEUR”16 (2016).

  26. 26.

    Hayes, S. J., Roberts, J. W., Elliott, D. & Bennett, S. J. Top-down attentional processes modulate the coding of atypical biological motion kinematics in the absence of motor signals. J. Exp. Psychol. Hum. Percept. Perform. 40, 1641 (2014).

  27. 27.

    Bek, J., Poliakoff, E., Marshall, H., Trueman, S. & Gowen, E. Enhancing voluntary imitation through attention and motor imagery. Exp. Brain Res. 234, 1819aEUR”1828 (2016).

  28. 28.

    Arbabshirani, M. R., Plis, S., Sui, J. & Calhoun, V. D. Single subject prediction of brain disorders in neuroimaging: Promises and pitfalls. NeuroImage 145, 137aEUR”165 (2016).

  29. 29.

    Anzulewicz, A., Sobota, K. & Delafield-Butt, J. T. Toward the Autism Motor Signature: Gesture patterns during smart tablet gameplay identify children with autism. Sci. Reports 6, 1aEUR”13 (2016).

  30. 30.

    Crippa, A. et al. Use of machine learning to identify children with autism and their motor abnormalities. J. Autism Dev. Disord. 45, 2146aEUR”2156 (2015).

  31. 31.

    Perego, P., Forti, S., Crippa, A., Valli, A. & Reni, G. Reach and throw movement analysis with support vector machines in early diagnosis of autism. In 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2555aEUR”2558 (IEEE, 2009).

  32. 32.

    Li, B., Sharma, A., Meng, J., Purushwalkam, S. & Gowen, E. Applying machine learning to identify autistic adults using imitation: An exploratory study. PloS One 12, e0182652 (2017).

  33. 33.

    Varma, S. & Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinforma. 7, 1aEUR”8 (2006).

  34. 34.

    Combrisson, E. & Jerbi, K. Exceeding chance level by chance: The caveat of theoretical chance levels in brain signal classification and statistical assessment of decoding accuracy. J. Neurosci. Methods 250, 126aEUR”136 (2015).

  35. 35.

    Varoquaux, G. Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage 180, 68aEUR”77 (2018).

  36. 36.

    Vabalas, A., Gowen, E., Poliakoff, E. & Casson, A. J. Machine learning algorithm validation with a limited sample size. PLoS One 14, 1aEUR”20, https://doi.org/10.1371/journal.pone.0224365 (2019).

  37. 37.

    Figueroa, R. L., Zeng-Treitler, Q., Kandula, S. & Ngo, L. H. Predicting sample size required for classification performance. BMC Med. Informatics Decis. Mak. 12 (2012).

  38. 38.

    Mukherjee, S. et al. Estimating dataset size requirements for classifying dna microarray data. J. Comput. Biol. 10, 119aEUR”142 (2003).

  39. 39.

    Beleites, C., Neugebauer, U., Bocklitz, T., Krafft, C. & Popp, J. Sample size planning for classification models. Anal. Chimica Acta 760, 25aEUR”33 (2013).

  40. 40.

    Vabalas, A., Gowen, E., Poliakoff, E. & Casson, A. J. Kinematic features of a simple and short movement task to predict autism diagnosis. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 1421aEUR”1424 (IEEE, 2019).

  41. 41.

    Zhang, D. et al. Multimodal classification of alzheimeraEUR(TM)s disease and mild cognitive impairment. Neuroimage 55, 856aEUR”867 (2011).

  42. 42.

    BolA3n-Canedo, V., SA!nchez-MaroA+-o, N. & Alonso-Betanzos, A. A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34, 483aEUR”519 (2013).

  43. 43.

    Kalousis, A., Prados, J. & Hilario, M. Stability of feature selection algorithms: A study on high-dimensional spaces. Knowl. Inf. Syst. 12, 95aEUR”116 (2007).

  44. 44.

    Loscalzo, S., Yu, L. & Ding, C. Consensus group stable feature selection. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 567aEUR”576 (ACM, 2009).

  45. 45.

    Van Selst, M. & Jolicoeur, P. A solution to the effect of sample size on outlier elimination. The Q. J. Exp. Psychol. Sect. A 47, 631aEUR”650 (1994).

  46. 46.

    Boser, B. E., Guyon, I. M. & Vapnik, V. N. A training algorithm for optimal margin classifiers. Proc. fifth annual workshop on Comput. learning theory 144aEUR”152 (1992).

  47. 47.

    Chang, C.-C. & Lin, C.-J. Libsvm: A library for support vector machines. ACM Transactions on Intell. Syst. Technol. 2, 1aEUR”39 (2013).

  48. 48.

    Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825aEUR”2830 (2011).

  49. 49.

    Stone, M. Cross-validatory choice and assessment of statistical predictions. J. Royal Stat. Soc. 36, 111aEUR”147 (1974).

  50. 50.

    James, G., Witten, D., Hastie, T. & Tibshirani, R. An introduction to statistical learning: with applications in R (Springer, New York, 2013).

  51. 51.

    Raudys, S. & Jain, A. Small sample size effects in statistical pattern recognition: recommendations for practitioners. IEEE Transactions on Pattern Analysis Mach. Intell. 13, 252aEUR”264 (1991).

  52. 52.

    Kanal, L. & Chandrasekaran, B. On dimensionality and sample size in statistical pattern classification. Pattern Recognit. 3, 225aEUR”234 (1971).

  53. 53.

    Hua, J., Xiong, Z., Lowey, J., Suh, E. & Dougherty, E. R. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21, 1509aEUR”1515 (2005).

  54. 54.

    Urbanowicz, R. J., Olson, R. S., Schmitt, P., Meeker, M. & Moore, J. H. Benchmarking relief-based feature selection methods for bioinformatics data mining. J. Biomed. Informatics 85, 168aEUR”188 (2018).

  55. 55.

    Peng, H., Long, F. & Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Mach. Intell. 27, 1226aEUR”1238 (2005).

  56. 56.

    Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389aEUR”422 (2002).

  57. 57.

    Kuncheva, L. I. A stability index for feature selection. Int. Multi-conference: artificial intelligence applications 390aEUR”395 (2007).

  58. 58.

    Bauer, E. & Kohavi, R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36, 105aEUR”139 (1999).

  59. 59.

    Meinshausen, N. & BA 1/4 hlmann, P. Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72, 417aEUR”473 (2010).

  60. 60.

    Ojala, M. & Garriga, G. C. Permutation tests for studying classifier performance. J. Mach. Learn. Res. 11, 1833aEUR”1863 (2010).

  61. 61.

    Haury, A. C., Gestraud, P. & Vert, J. P. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 6, 1aEUR”12 (2011).

  62. 62.

    Dernoncourt, D., Hanczar, B. & Zucker, J. D. Analysis of feature selection stability on high dimension and small sample data. Comput. Stat. Data Analysis 71, 681aEUR”693 (2014).

  63. 63.

    BolA3n-Canedo, V., SA!nchez-MaroA+-o, N. & Alonso-Betanzos, A. Data classification using an ensemble of filters. Neurocomputing 135 (2014).

  64. 64.

    Jain, A. K. & Chandrasekaran, B. 39 dimensionality and sample size considerations in pattern recognition practice. Handb. Stat. 2, 835aEUR”855 (1982).

  65. 65.

    Krishnan, M. C. Sex differences in autism spectrum disorder. The Complex. Autism Spectr. Disord. 26, 69aEUR”86 (2018).

  66. 66.

    Lai, M. C. et al. A behavioral comparison of male and female adults with high functioning autism spectrum conditions. PLoS ONE 6, e20835 (2011).

  67. 67.

    Hayes, S. J., Andrew, M., Elliott, D., Gowen, E. & Bennett, S. J. Low fidelity imitation of atypical biological kinematics in autism spectrum disorders is modulated by self-generated selective attention. J. Autism Dev. Disord. 46, 502aEUR”513 (2016).

  68. 68.

    Mari, M., Castiello, U., Marks, D., Marraffa, C. & Prior, M. The reach-to-grasp movement in children with autism spectrum disorder. Philos. Transactions Royal Soc. Lond. Ser. B-Biological Sci. 358, 393aEUR”403 (2003).

  69. 69.

    Glazebrook, C. M., Gonzalez, D., Hansen, S. & Elliott, D. The role of vision for online control of manual aiming movements in persons with autism spectrum disorders. Autism 13, 411aEUR”433 (2009).

  70. 70.

    Mosconi, M. W. et al. Feedforward and feedback motor control abnormalities implicate cerebellar dysfunctions in autism spectrum disorder. J. Neurosci. 35, 2015aEUR”2025 (2015).

  71. 71.

    David, F. J., Baranek, G. T., Wiesen, C., Miao, A. F. & Thorpe, D. E. Coordination of precision grip in 2-6 years-old children with autism spectrum disorders compared to children developing typically and children with developmental disabilities. Front. Integr. Neurosci. 6, 122 (2012).

  72. 72.

    Vernazza-Martin, S. et al. Goal directed locomotion and balance control in autistic children. J. Autism Dev. Disord. 35, 91aEUR”102 (2005).

  73. 73.

    Schmitt, L. M., Cook, E. H., Sweeney, J. A. & Mosconi, M. W. Saccadic eye movement abnormalities in autism spectrum disorder indicate dysfunctions in cerebellum and brainstem. Mol. Autism 5, 1aEUR”13 (2014).

Download references

Acknowledgements

E.G., E.P., and A.J.C hold academic positions at the University of Manchester and are funded centrally. A.V. was supported by the UK Engineering and Physical Sciences Research Council (website: https://epsrc.ukri.org/) and its Doctoral Training Partnership with the University of Manchester (ref.: EP/M507969/1). Funders did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional internal or external funding received for this study.

Author information

A.V., E.G., E.P., and A.J.C. conceptualised/designed the study, developed methodology, and reviewed the manuscript. A.V. conducted the experiment, collected data, performed formal analyses, visualised the results, and wrote the main manuscript text. A.J.C. acquired funding.

Correspondence to
Andrius Vabalas.

Ethics declarations

The authors declare no competing interests.

Additional information

PublisheraEUR(TM)s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the articleaEUR(TM)s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the articleaEUR(TM)s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Categories
Nature RSS Feed

Construction of a web-based nanomaterial database by big data curation and modeling friendly nanostructure annotations

Abstract

Modern nanotechnology research has generated numerous experimental data for various nanomaterials. However, the few nanomaterial databases available are not suitable for modeling studies due to the way they are curated. Here, we report the construction of a large nanomaterial database containing annotated nanostructures suited for modeling research. The database, which is publicly available through http://www.pubvinas.com/, contains 705 unique nanomaterials covering 11 material types. Each nanomaterial has up to six physicochemical properties and/or bioactivities, resulting in more than ten endpoints in the database. All the nanostructures are annotated and transformed into protein data bank files, which are downloadable by researchers worldwide. Furthermore, the nanostructure annotation procedure generates 2142 nanodescriptors for all nanomaterials for machine learning purposes, which are also available through the portal. This database provides a public resource for data-driven nanoinformatics modeling research aimed at rational nanomaterial design and other areas of modern computational nanotechnology.

Introduction

The global market value of nanotechnology is expected to reach $90.5 billion by 20211 as commercial and consumer nano-products continue to rise2,3,4. Increased production, use and environmental accumulation of these nanomaterials present important toxicology concerns5,6,7. A variety of in vitro and in vivo assays evaluating their potential environmental and human health effects have generated vast quantities of experimental data8,9, requiring data extraction, analysis, and sharing for guiding the safe design of next-generation nanomaterials10,11. This urgency is echoed in the recent Nanoinformatics Roadmap 2030 in USA and Europe, aimed at promoting the capture, preservation, and dissemination of publicly available data on nanomaterials. The Roadmap, which outlined the importance of coordinating research efforts and charting the challenges in nanoinformatics as a set of milestones, envisages the flow of data from experimentalists into structured databases that can be used by computational modelers to predict nanomaterial properties, exposure and hazard values that will support regulatory actions12.

Two large databases for chemicals and proteins have already impacted different areas of science. As a small molecule database, PubChem provides structural annotation (e.g., chemical structures, SMILES, and InChi key), physicochemical properties (e.g., logP and molecular weight) and available bioactivities (e.g., EC50 and IC50)13. Since its launch in 2004, PubChem has served various scientific communities including cheminformatics, chemical biology, medicinal chemistry, and drug discovery. Another crucial database for scientific community is the Protein Data Bank (PDB)14, which provides three-dimensional structures of biological macromolecules, (e.g., proteins and nucleic acids) as PDB files for broad researchers in fields like molecular biology, structural biology, and computational biology. However, a comparable nanomaterial database is not available. The key to building such a database of nanomaterials is nanostructure annotationaEUR”a computer-friendly format for encoding information.

Several nanomaterial databases serving specific areas are available (TableA 1)15,16,17,18,19. For example, the cancer Nanotechnology Laboratory (caNanoLab) database (https://cananolab.nci.nih.gov/) built by the National Cancer Institute in 200715 is designed to expedite and validate the use of nanotechnology in biomedicine. However, it is not fully accessible to the public because it contains proprietary data. While these nanomaterial databases, which are shown in TableA 1, share published data and have been used for modeling studies16,20,21, they are limited by the way they are curated. Although, new file formats (e.g., JSON17 and ISA-TAB-Nano22) are also specially designed in several nanomaterial databases, such as eNanomapper and NANoREG, to store and manage the curated nanomaterial data. Nanomaterial entities (e.g., composition, physicochemical properties, and biological activities of the nanomaterials) in these databases exist as text outputs extracted directly from publications, ignoring nanostructure annotations that are critical for modeling studies. As a result, variables (e.g., physicochemical properties) used in previous modeling studies were mostly experimentally generated. Without nanostructure annotations, diverse structural information for predictive modeling and other research such as nanostructure analysis and visualization cannot be performed.

Table 1 Nanomaterial databases.

Full size table

Here, we report a publicly available nanomaterial database that contains annotated nanostructures of diverse nanomaterials suitable for immediate modeling research. The database, constructed from thousands of scientific papers, currently contains 705 unique nanomaterials, 1365 physicochemical property (e.g., logP, zeta potential, and hydrodynamic diameter) and 2386 bioactivity (e.g., cell viability, cellular uptake, and ROS) data points. All experimentally obtained information on the structure of the nanomaterials, such as form, size, shape, and surface ligand were annotated and stored as PDB files, which are downloadable from the web portal (http://www.pubvinas.com/). The PDB files can be used to generate nanodescriptors, which were created in-house to quantitatively represent nanostructure diversity. Using these nanodescriptors, we developed predictive models for three critical property/bioactivity endpoints of various nanomaterials using machine learning (k-nearest neighbor) and deep learning (deep neural network) approaches. This is the largest and the only nanomaterial database that contains nanostructure annotations to support nanomaterial modeling and rational nanomaterial design. Furthermore, the predictive models developed from this database can be used to predict three critical properties and bioactivity (i.e., logP, zeta potentials, and cellular uptake) of new nanomaterials.

Results

A total of 705 nanomaterials, comprising 414 gold nanoparticles (GNPs), 17 silver nanoparticles (AgNPs), 12 platinum nanoparticles (PtNPs), 12 palladium nanoparticles (PdNPs), 80 carbon nanotubes (CNTs), 48 buckminsterfullerenes (C60), 34 quantum dots (QDs), 32 metal oxides nanoparticles (MONPs), 21 DNA origami nanoparticles (DnaNPs), 11 dendrimers, and 24 cyclic peptide nanotubes (CPNTs), were annotated for the database. FigureA 1 shows 16 representative nanostructures covering all nanomaterial types in the database and are rendered by visual molecular dynamics (VMD) using the QuickSurf method23. This method uses positions of atoms and the Monte Carlo simulation for generating the volumetric density maps and isosurface that simulate electron density and solvent accessible surface for the input nanostructures. For example, GNP164 represents the 164th gold nanoparticle in the database that has a core diameter of 5aEUR?nm (Fig.A 1, see Supplementary Data for other structure information). The nanostructures varied in material type, size, shape, and surface ligand. For example, C60NP42 and AgNP14 are 1aEUR?nm and 40aEUR?nm, respectively. Although most nanomaterials are spherical, the database also contains rod-like (e.g., GNP412, CNT80, and CPNT15) and irregular (e.g., Dendrimer6 and DnaNP7) nanomaterials. Different surface chemistries of the nanostructures were rendered with different colors. For example, the nanoparticle PdNP12 (logPaEUR?=aEUR?2.52) with hydrophobic surface ligands are shown as cyan while the nanoparticle PtNP8 (logPaEUR?=aEUR?a^’1.47) with hydrophilic surface ligands are rendered purple. Other structural details can also be observed, for example, the long surface ligand chains on GNP164 are shown as tentacles. These detailed 3D plots of nanomaterials in the database provide direct impressions of the relevant surface chemistry and physicochemical properties.

Fig. 1: Visualization of 16 representative nanomaterials in the database.
figure1

The database contains 705 nanomaterials that vary in material type, size, shape, and surface ligand. Most nanomaterials were spherical but rod-like and irregular ones were also annotated and included in the database. Different surface chemistries of the nanostructures were rendered in different colors by QuickSurf drawing method in VMD, offering direct impressions of the nanomaterials. Emboldened text represents text identifiers that can be used to search for the nanomaterial in the database.

Full size image

FigureA 2 is an overview of the data curated in this study (see Supplementary Data for details), including physicochemical properties (logP and zeta potential), bioactivities (cell viability, reactive oxidative stress (ROS), and cellular uptake), along with the nanomaterial types and structure information (surface ligands and size). Although majority of the nanomaterials are GNPs, there are 291 other types of nanomaterials (Fig.A 2a). The functions of nanomaterials are affected by surface small molecules (e.g., drugs and peptides), which determine their diverse applications (e.g., drug delivery and tumor diagnosis). As shown in Fig.A 2b, the number of surface ligands ranged from 1 (such as C60 nanomaterials) to more than 6000 (such as GNP12). This is because ligand density is highly affected by the properties of the surface ligands. For example, similar sized GNP (~5.8aEUR?nm) can have around 200 ligands per particle for positively charged ligands (e.g., GNP130) and negatively charged ligands (e.g., GNP138). Meanwhile, ligands without charges can pack up to over 700 surface ligands per GNP (e.g., GNP152). Among the 705 nanomaterials, one contained up to four different ligands (GNP392) and there were in total 314 unique surface ligands. The spherical nanomaterials in the database also had a wide size distribution (Fig.A 2c). At the lower end, there are GNPs with diameter less than 10aEUR?nm that are suitable for biomedical applications24,25. Some spherical nanoparticles have sizes ranging from 10 to 45aEUR?nm.

Fig. 2: Overview of the nanomaterial database.
figure2

aaEUR”h Distributions of nanomaterials accounting to a nanomaterial type, b surface ligand number, c nanomaterial size, d logP, e zeta potential, f cell viability, g reactive oxidative stress (ROS) and h cellular uptake. Nanomaterials in the database show chemical, structural, and biological diversity. The numbers in the brackets of b, c represent the range of the surface ligand number and nanomaterial size, respectively.

Full size image

The nanomaterials in this database are also biologically diverse (Fig.A 2daEUR”h). The logP values of the nanomaterials, which describe the hydrophobicity of relevant nanomaterials, ranged from a^’2.68 to 2.72. Zeta potentialaEUR”the charge at the interface between the nanomaterial surface and its liquid mediumaEUR”of nanomaterials in this database was tested in three solutions (water, aqueous buffer, and serum) and they ranged from a^’93.73aEUR?mV to 86.80aEUR?mV (Fig.A 2e). Cell viability showed a spread from 2% to 118.05% (Fig.A 2f), indicating the various nanomaterials induced varying degrees of cytotoxicity. ROS level, which is used to evaluate cellular oxidative stress, linked to cancer, diabetes, and aging, also ranged widely from 0.44 to 4.10 (Fig.A 2g). For nanomaterials, cellular uptake is usually a prerequisite for their applications in drug delivery, bioimaging and, etc26. In this database, cellular uptake capacity of all nanomaterials varied from a^’1.87aEUR?gaEUR?cella^’1 to 1.36aEUR?gaEUR?cella^’1 with a log10-tranformation (Fig.A 2h).

After annotating and saving the structures of all 705 nanomaterials in our database as PDB files, we calculated 680 nanodescriptors using the Virtual Nanostructure Simulations (VINAS) toolbox27aEUR”an in-house cheminformatics program designed to calculate descriptors based on the annotated nanomaterial structures. The current descriptors calculated by VINAS are based on Delaunay tessellation, which is a fast way to transform the nano surface geometry into quantitative values as nanodescriptors. Using the 680 calculated nanodescriptors, we performed principal component analysis (PCA) and used the top three principal components, which account for 79% of the total descriptor variance, to show the occupation of all nanomaterials in a 3D chemical space (Fig.A 3a). All the nanomaterials were structurally diverse and occupied most of this chemical space. Compared to other nanomaterials, MONPs occupied a larger area because the relevant VINAS nanodescriptor values, which are based on atomic properties, varied significantly according to the unique atoms (e.g., Zn, Co, and Ce) that make up each MONPs.

Fig. 3: Nanostructure diversity analysis.
figure3

a Nanomaterial chemical space shown by principal component analysis (PCA) of all 705 nanomaterials. The three principal components (PC1, PC2, PC3) account for 43%, 23, and 13% of the total descriptor variance, respectively. Different colors were associated with different nanomaterial classifications. Six nanomaterials are shown with their identifiers (i.e., PtNP1, PdNP1, Dendrimer4, CPNT24, GNP406, and MONP10). b Distribution of the 248, 160 Euclidean distances calculated from each pair of nanomaterials in the database. The distribution ranged from 0.004 to 17.31with an average of 5.3 (black dashed line). Normalized distribution curve is shown as red dotted line.

Full size image

Chemical structure is the key to determine a moleculeaEUR(TM)s physicochemical properties and biological activities. The content that structurally similar molecules should exhibit similar bioactivities is the fundamental hypothesis of all quantitative structure-activity relationship (QSAR) and other relevant modeling studies28,29. To quantitatively study the structural similarity among nanomaterials, we calculated the pairwise Euclidean distance for all nanomaterials. All nanodescriptor results were normalized to a range between 0 and 1 before calculation. A total of 248,160 distances were generated among each two of the 705 nanomaterials. The distribution of values ranged from 0.004 to 17.31 with an average of 5.3 (Fig.A 3b). Two substances are typically considered similar if their normalized Euclidean distance is less than 0.530,31. In this database, some nanomaterials that belong to different nanomaterial types, are also structurally similar. For example, the Euclidean distances between PtNP1 and PdNP1, and between Dendrimer4 and CPNT24 are 0.037 and 0.14, respectively. PtNP1 and PdNP1 with Euclidean distance near zero are considered structurally similar because they are about the same size (6aEUR?nm and 5.8aEUR?nm, respectively) and have the same surface ligand at the similar density (371 and 365 ligands per particle, respectively). Although Dendrimer4 is irregular and CPNT24 is rod-like, they are considered structurally similar because they have similar sizes (2aEUR?nm and 1.41aEUR?nm * 1.44aEUR?nm) and atomic compositions (C, N, O, and H). Some structural outliers such as GNP406 and MONP10 were also seen. GNP406 is structurally different because it is a rod-like gold nanoparticle (most are spherical) that is also relatively large at 30aEUR?nmaEUR?A–aEUR?33aEUR?nm. MONP10, which is a La2O3 metal oxide nanoparticle around 24.6aEUR?nm in diameter, is structurally different because of the unique properties of the Lanthanum (La) atom.

To share the structural annotated data, we developed an online database portal (http://www.pubvinas.com/) that currently can be used to download the PDB files, visualize the nanostructures and upload new data (Fig.A 4a). A full-time computer systems administrator will be responsible for maintaining the portal. Each PDB file of the nanomaterials can be downloaded by clicking the dropdown bars with their corresponding classification (e.g., gold nanoparticles, silver nanoparticles, and platinum nanoparticles). Users can view the nanostructure online from the corresponding PDB file and open the downloaded PDB file using well-known cheminformatics software (e.g., VMD, RasMol, and MOE). An example PDB file is shown in Fig.A 4b. The first part of the file contains the basic information on the structure of the nanomaterial (e.g., the form, shape and size); the second part contains information about the atoms (e.g., atom type and coordinates); and the third part includes information on the bond/connection between atoms. Users may also share their new data (e.g., new nanomaterials synthesized and/or tested against new bioassays) by uploading them as a text file (Fig.A 4a). After reviewing the upload files, the system administrator will generate the PDB files and add the new dataset to the nanomaterial database. We expect to add more functions, such as an online toolbox to calculate nanodescriptors and several trained models, in the future to predict the properties of new nanomaterials.

Fig. 4: PubVINAS online portal.
figure4

a Screenshot of PubVINAS. The bars on the above and left of the picture show the user functions (e.g., download/upload data, visualize data, select data based on classifications, and, etc.) and b example PDB file as an output shown as three parts: (1) the basic information, (2) atom type and coordinates, and (3) the connections between atoms.

Full size image

Using data from the database, we used k-Nearest Neighbor (kNN), a traditional machine learning approach, and deep neural network (DNN), a representative deep learning algorithm, to build computational models that will identify quantitative relationships between the annotated nanostructures and target activities. Two properties and one bioactivity (i.e., logP, zeta potential tested in water at pHaEUR?=aEUR?7, and cellular uptake capacity in A549 cells) were selected for modeling. The logP dataset contains 147 unique nanomaterials, including 123 GNPs, 12 PtNPs and 12 PdNPs. The zeta potential dataset contains 213 unique nanomaterials, including 148 GNPs, 6 AgNPs, 12 PtNPs, 12 PdNPs, 8 MONPs, 24 QDNPs, and 3 Dendrimers. The cellular uptake dataset contains 71 GNPs, which were tested against A549 cells. Each model was developed using the kNN and DNN approach with VINAS nanodescriptors calculated from the associated nanomaterials in the dataset. The performance of the model was evaluated by both the 5-fold cross-validation and external prediction methods common in modeling studies32,33. For each endpoint, the available data were randomly split into a training set (80% of the data) for developing the model, and a test set (20% of the data) for external validation of the model. The training set was further split into five subsets. The model was developed using four of the five subsets and the remaining subset was used for validation. This procedure was repeated five times until all subsets were used for validation once.

The correlations between experimental and predicted values of the six resulting models based on kNN and DNN are shown in Fig.A 5, which also includes the root mean square error (RMSE) and correlation coefficients (R2). Overall, both R2 and RMSE for 5-fold cross validation (R2_5CV and RMSE_5CV) and external prediction (R2_val and RMSE_val) are at the same order of magnitude, indicating the 5-fold cross-validation process and external prediction yielded similar results. All correlation coefficients (both R2_5CV and R2_val) were above 0.5, indicating that all six models successfully predicted the relationships between the annotated the nanostructures and target activities34. When comparing R2_5CV and R2_val, kNN models (Fig.A 5a, c, e) showed better predictability than DNN models (Fig.A 5b, d, f). Although DNN is a popular modeling tool and has demonstrated high predictability in recent modeling challenges in drug discovery35,36, it performed differently in other studies37,38. Here, the lower predictability of DNN models is likely due to overfitting caused by too many neurons in the layers compared to the size of the input data. Both kNN (Fig.A 5e) and DNN (Fig.A 5f) cellular uptake models performed better (i.e., higher R2 values) than the logP and zeta potential models.

Fig. 5: Correlations between experimental (Exp) and predicted (Pred) values.
figure5

kNN (a, c, e) and DNN (b, d, f) models are developed for predicting logP (a, b), zeta potential (c, d) and cellular uptake (e, f). logP dataset contains 147 unique nanomaterials, including 123 GNPs, 12 PtNPs and 12 PdNPs. Zeta potential dataset contains 213 unique nanomaterials, including 148 GNPs, 6 AgNPs, 12 PtNPs, 12 PdNPs, 8 MONPs, 24 QDNPs, and 3 Dendrimers. Cellular uptake dataset contains 71 GNPs, which were tested against A549 cells. Root mean square error (RMSE) and correlation coefficients (R2) are also shown. RMSE_5CV and R2_5CV represent the RMSE and R2 values for 5-fold cross validation, while RMSE_val and R2_val represent the values for external prediction. R2_CV and R2_val above 0.5 indicate high correlation between Exp and Pred values.

Full size image

The resulted models, especially the kNN models, can be used to predict new nanomaterials directly from their structures and assist rational nanomaterial design. Because the cellular uptake dataset consists of only one type of nanomaterial (GNP) so that the applicability of the resulted cellular uptake model can be reliably applied to predict new GNPs. The logP and zeta potential datasets consist of various types of nanomaterials collected from different sources. The two models can be used to predict the properties of a wide range of nanomaterials. In addition, based on the same nanostructure annotation method, machine learning models were recently built to predict the inflammatory responses and cytotoxicity of various carbon nanoparticles39. Once a new nanomaterial is virtually designed using computer, its properties will be assessed using the developed models before chemical synthesis. This procedure will greatly save resources by prioritizing new nanomaterials with desired properties and/or cellular uptake potentials.

Discussion

In summary, we constructed a universal nanomaterials database containing structure annotations suitable for direct computational modeling. The database currently contains 705 unique nanomaterials with multiple biological testing results. Structures of these nanomaterials were annotated and stored as PDB files that are retrievable from online portal. The new data being uploaded in the future will rapidly expand the database. We also developed several machine learning models using three property and bioactivity datasets in this database and showed the models had highly accurate predictability based on cross-validation and external validation results (i.e., R2aEUR?>aEUR?0.5). The resulted models can be used to predict two critical properties and one bioactivity of new nanomaterials directly from their nanostructures. Some materials such as alloy nanomaterials40, polymeric micelles41, mesoporous nanomaterials42, and metal-organic frameworks (MOFs)-based nanomaterials43 were tentatively not included in the database because their nanostructures were poorly defined and the related publications currently lack quality control information on their synthesis. Other nanomaterials that were annotated still lack representative data in some target endpoints, for example, cellular uptake potentials. For the database to be more useful, there is still a need to generate more biological data of diverse nanomaterials.

Methods

The database was compiled from in-house data (297 unique nanomaterials) and external data (408 unique nanomaterials). The in-house data were collected from our previously published studies (these references were provided in Supplementary References). The external data was collected by manual literature searching. This process resulted in more than 1000 papers with nanomaterial data for further examination. The data were included into the database with the following conditions satisfied: (1) the material (e.g., core atoms) and size information were provided in this paper; (2) the surface ligand structures can be annotated and transferred into SMILES; (3) the nano-bioactivity or physicochemical property data were provided with detailed experimental information. There are 69 publications that were identified to contain useful data by fulfilling all criterions (these references were provided in Supplementary References). Each publication was manually examined, and relevant structure information (e.g., core, size, and surface ligands), experimental data, and testing details were extracted from the corresponding papers. For raw data with size and shape information of a set of nanoparticles instead of a single molecular entity, the same core was set for all the nanoparticles in this data source. Data were also obtained directly from figures of published papers using PlotDigitizer. The surface ligand structures were converted to SMILES, which were shown in Supplementary Data.

For nanoparticles, the core atoms were first put together as a nano core based on the particle size information. Then the associated surface ligands were randomly placed on the core surface. For GNPs, AgNPs, PtNPs, PdNPs, MONPs, and QDs, the core of the corresponding nanostructure was generated by replicating the unit cell of the most thermodynamically stable crystal structures and then deleting atoms outside the input diameter data. The lattice parameters (e.g., unit cell lengths and angles) were obtained from the Materials Project (https://materialsproject.org/). For CNTs, the python toolkit scikit-nano (https://scikit-nano.org/) was applied to construct the carbon core (pristine CNTs). All the surface ligands were optimized before being grafted to the nano core. As for C60, the SMILES obtained from the paper44 were directly converted to PDB file. The PDB files of DnaNPs were either collected from the corresponding papers45,46,47,48 or generated by the Legogen49. The PDB files of dendrimers were collected from corresponding papers50,51,52,53. For CPNTs, the nanostructures were generated by an in-house program written in C++54. In this procedure, the amino acids were firstly connected as various cyclic peptides through peptide bonds and then these cyclic peptides were stacked as CPNTs through H-bonds.

At first, 126 tetrahedron fragments were generated for each nanostructure based on our previous study, which were calculated by combining the Delaunay tessellation and atom types27. In our previous study, the value of a nanodescriptor was calculated as the value of each tetrahedron electronegativity multiplied by its occurrences in the nanostructure. As described above, the range of nanomaterial size has a wide distribution in the current database. As a result, there will be a large difference of the tetrahedron occurrences between the large nanomaterials and small nanomaterials. In order to resolve this issue, property-based descriptors were also calculated in this study. The procedure can be described as follows: (1) The occurrence of each tetrahedron was converted to frequency (the occurrence of each tetrahedron divided by the total number of all the tetrahedrons in each nanostructure). (2) More atomic properties were introduced, which included the calculated radii (Rcal), the covalent radii (Rcov), the empirical radii (Remp), the atom mass (M), the boiling point (Tbol), the density (I?), the electron affinity (Eea), the electronegativity (I?), the heat of fusion (I”Hfus), the heat of vaporization (I”Hvap), the first ionization energy (IE1), the second ionization energy (IE2), the melting point (Tmel), the molar volume (Vmol), the specific heat (Q), the thermal conductivity (I>>) and the valence (q). Then, these 17 property values of each tetrahedron were multiplied respectively by the corresponding tetrahedron frequency, as described in our previous study27. As a result, 17 descriptor matrices were generated that each descriptor matrix contained 126 individual descriptors (the tetrahedron fragments integrated with atomic properties). The calculated nanodescriptors for all nanomaterials are available from the web portal. After removing descriptors with limited information (e.g., with consistent values over all nanomaterials), total 680 nanodescriptors were used for modeling purpose. The nanostructure annotations and nanodescriptor generations were described in details in our previous papers27,55.

The datasets were split into training sets (80% of the original datasets) and test sets (20% of the original datasets). The training sets were used to build models, and the associated test sets were used to evaluate the developed models. The performance of each model was indicated by 5-fold cross validation within the training set and the external validation by predicting the test set. In this study, two different machine learning approaches were used to develop the computational models. The k-nearest neighbor (kNN) method used the weighted average of nearest neighbors as its prediction and employed a variable selection procedure to define neighbors27,55, which was developed in-house (also available at http://chembench.mml.unc.edu/). The deep neural network (DNN) is a multi-layer feedforward neural network, which was implemented using Keras 2.2.4 (https://keras.io/) python deep learning library, with the TensorFlow backend. The DNN architecture used in this study included a sequence of five dense layers (three hidden layers), which were fully connected neural layers. Three hidden layers contained 512, 128, and 64 nodes, respectively. The relu was used as activation function to perform non-linear transformations. The dropout function, set as 0.2, was used to prevent overfitting of the resulting models. The rmsprop and mean squared error (MSE) were used as optimizer and loss function to compile the DNN model in this study. The learning rate was set as the default value of the rmsprop optimizer. Each DNN model was trained for 300 epochs.

Data availability

All experimental data can be accessed from the Supplementary Data or from the Experimental data page of the web portal (http://www.pubvinas.com/).

References

  1. 1.

    McWilliams, A. The Maturing Nanotechnology Market: Products and Applications (BCC Research, Wellesley, MA, 2016).

  2. 2.

    Quadros, M. E. & Marr, L. C. Silver nanoparticles and total aerosols emitted by nanotechnology-related consumer spray products. Environ. Sci. Technol. 45, 10713aEUR”10719 (2011).

  3. 3.

    Stamm, H., Gibson, N. & Anklam, E. Detection of nanomaterials in food and consumer products: bridging the gap from legislation to enforcement. Food Addit. Contam. 29, 1175aEUR”1182 (2012).

  4. 4.

    Vance, M. E. et al. Nanotechnology in the real world: redeveloping the nanomaterial consumer products inventory. Beilstein J. Nanotechnol. 6, 1769aEUR”1780 (2015).

  5. 5.

    Valsami-Jones, E. & Lynch, I. How safe are nanomaterials? Science 350, 388aEUR”389 (2015).

  6. 6.

    Cao, M., Li, J., Tang, J., Chen, C. & Zhao, Y. Gold nanomaterials in consumer cosmetics nanoproducts: analyses, characterization, and dermal safety assessment. Small 12, 5488aEUR”5496 (2016).

  7. 7.

    DjuriA!iA?, A. B. et al. Toxicity of metal oxide nanoparticles: Mechanisms, characterization, and avoiding experimental artefacts. Small 11, 26aEUR”44 (2015).

  8. 8.

    Zhang, Y. et al. Perturbation of physiological systems by nanoparticles. Chem. Soc. Rev. 43, 3762aEUR”3809 (2014).

  9. 9.

    Sharifi, S. et al. Toxicity of nanomaterials. Chem. Soc. Rev. 41, 2323aEUR”2343 (2018).

  10. 10.

    Maojo, V. et al. Nanoinformatics: a new area of research in nanomedicine. Int. J. Nanomed. 7, 3867aEUR”3890 (2012).

  11. 11.

    Hendren, C. O., Powers, C. M., Hoover, M. D. & Harper, S. L. The nanomaterial data curation initiative: a collaborative approach to assessing, evaluating, and advancing the state of the field. Beilstein J. Nanotechnol. 6, 1752aEUR”1762 (2015).

  12. 12.

    Haase, A. & Klaessig, F. EU US Roadmap Nanoinformatics 2030A (EU NanoSafety Cluster, 2018).

  13. 13.

    Kim, S. et al. PubChem substance and compound databases. Nucleic Acids Res. 44, D1202aEUR”D1213 (2016).

  14. 14.

    Rose, P. W. et al. The RCSB protein data bank: Integrative view of protein, gene and 3D structural information. Nucleic Acids Res. 45, D271aEUR”D281 (2017).

  15. 15.

    Gaheen, S. et al. CaNanoLab: data sharing to expedite the use of nanotechnology in biomedicine. Comput. Sci. Disco. 6, 014010 (2013).

  16. 16.

    Trinh, T. X., Ha, M. K., Choi, J. S., Byun, H. G. & Yoon, T. H. Curation of datasets, assessment of their quality and completeness, and nanoSAR classification model development for metallic nanoparticles. Environ. Sci. Nano 5, 1902aEUR”1910 (2018).

  17. 17.

    Jeliazkova, N. et al. The eNanoMapper database for nanomaterial safety information. Beilstein J. Nanotechnol. 6, 1609aEUR”1634 (2015).

  18. 18.

    Mills, K. C., Murry, D., Guzan, K. A. & Ostraat, M. L. Nanomaterial registry: database that captures the minimal information about nanomaterial physico-chemical characteristics. J. Nanopart. Res 16, 2219 (2014).

  19. 19.

    Miller, A. L., Hoover, M. D., Mitchell, D. M. & Stapleton, B. P. The Nanoparticle Information Library (NIL): A prototype for linking and sharing emerging data. J. Occup. Environ. Hyg. 4, D131aEUR”D134 (2007).

  20. 20.

    Ha, M. K. et al. Toxicity classification of oxide nanomaterials: effects of data gap filling and pchem score-based screening approaches. Sci. Rep. 8, 1aEUR”11 (2018).

  21. 21.

    Choi, J. S., Trinh, T. X., Yoon, T. H., Kim, J. & Byun, H. G. Quasi-QSAR for predicting the cell viability of human lung and skin cells exposed to different metal oxide nanomaterials. Chemosphere 217, 243aEUR”249 (2019).

  22. 22.

    Thomas, D. G. et al. ISA-TAB-Nano: a specification for sharing nanomaterial research data in spreadsheet-based format. BMC Biotechnol. 13, 2 (2013).

  23. 23.

    Krone, M., Stone, J., Ertl, T. & Schulten, K. Fast visualization of Gaussian density surfaces for molecular dynamics and particle system trajectories. EuroVis(Short Papers) https://doi.org/10.2312/PE/EuroVisShort/EuroVisShort2012/067-071 (2012).

  24. 24.

    Khlebtsov, N. & Dykman, L. Biodistribution and toxicity of engineered gold nanoparticles: a review of in vitro and in vivo studies. Chem. Soc. Rev. 40, 1647aEUR”1671 (2011).

  25. 25.

    Huo, S. et al. Ultrasmall gold nanoparticles as carriers for nucleus-based gene therapy due to size-dependent nuclear entry. ACS Nano 8, 5852aEUR”5862 (2014).

  26. 26.

    Depan, D. & Misra, R. D. K. Hybrid nanoparticle architecture for cellular uptake and bioimaging: direct crystallization of a polymer immobilized with magnetic nanoparticles on carbon nanotubes. Nanoscale 4, 6325aEUR”6335 (2012).

  27. 27.

    Yan, X. et al. In silico profiling nanoparticles: predictive nanomodeling using universal nanodescriptors and various machine learning approaches. Nanoscale 11, 8352aEUR”8362 (2019).

  28. 28.

    Cherkasov, A. et al. QSAR modeling: where have you been? Where are you going to? J. Med. Chem. 57, 4977aEUR”5010 (2014).

  29. 29.

    Zhu, H. Big data and artificial intelligence modeling for drug discovery. Annu. Rev. Pharmacol. Toxicol. 60, 573aEUR”589 (2020).

  30. 30.

    Dragos, H., Gilles, M. & Alexandre, V. Predicting the predictability: a unified approach to the applicability domain problem of qsar models. J. Chem. Inf. Model. 49, 1762aEUR”1776 (2009).

  31. 31.

    Shen, M. et al. Quantitative structure-activity relationship analysis of functionalized amino acid anticonvulsant agents using k nearest neighbor and simulated annealing PLS methods. J. Med. Chem. 45, 2811aEUR”2823 (2002).

  32. 32.

    Wang, W., Kim, M. T., Sedykh, A. & Zhu, H. Developing enhanced blood-brain barrier permeability models: integrating external bio-assay data in QSAR modeling. Pharm. Res. 32, 3055aEUR”3065 (2015).

  33. 33.

    Kim, M. T. et al. Mechanism profiling of hepatotoxicity caused by oxidative stress using antioxidant response element reporter gene assay models and big data. Environ. Health Perspect. 124, 634aEUR”641 (2016).

  34. 34.

    Eriksson, L. et al. Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environ. Health Perspect. 111, 1361aEUR”1375 (2003).

  35. 35.

    Mayr, A. et al. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 9, 5441aEUR”5451 (2018).

  36. 36.

    Feng, C. et al. Gene expression data based deep learning model for accurate prediction of drug-induced liver injury in advance. J. Chem. Inf. Model. 59, 3240aEUR”3250 (2019).

  37. 37.

    Russo, D. P., Zorn, K. M., Clark, A. M., Zhu, H. & Ekins, S. Comparing multiple machine learning algorithms and metrics for estrogen receptor binding prediction. Mol. Pharm. 15, 4361aEUR”4370 (2018).

  38. 38.

    RodrA-guez-PA(C)rez, R., Miyao, T., Jasial, S., Vogt, M. & Bajorath, J. Prediction of compound profiling matrices using machine learning. ACS Omega 3, 4713aEUR”4723 (2018).

  39. 39.

    Liu, G. et al. Analysis of model PM2.5-induced inflammation and cytotoxicity by the combination of a virtual carbon nanoparticle library and computational modeling. Ecotoxicol. Environ. Saf. 191, 110216 (2020).

  40. 40.

    Liu, X., Wang, D. & Li, Y. Synthesis and catalytic properties of bimetallic nanomaterials with various architectures. Nano Today 7, 448aEUR”466 (2012).

  41. 41.

    Movassaghian, S., Merkel, O. M. & Torchilin, V. P. Applications of polymer micelles for imaging and drug delivery.Wiley Interdiscip. Rev. Nanomed. Nanobiotechnol. 7, 691aEUR”707 (2015).

  42. 42.

    Tang, F., Li, L. & Chen, D. Mesoporous silica nanoparticles: synthesis, biocompatibility and drug delivery. Adv. Mater. 24, 1504aEUR”1534 (2012).

  43. 43.

    Dang, S., Zhu, Q. L. & Xu, Q. Nanomaterials derived from metal-organic frameworks. Nat. Rev. Mater. 3, 1aEUR”14 (2017).

  44. 44.

    Toropova, A. P., Toropov, A. A., Benfenati, E., Leszczynska, D. & Leszczynski, J. QSAR modeling of measured binding affinity for fullerene-based HIV-1 PR inhibitors by CORAL. J. Math. Chem. 48, 959aEUR”987 (2010).

  45. 45.

    Bai, X., Martin, T. G., Scheres, S. H. W. & Dietz, H. Cryo-EM structure of a 3D DNA-origami object. Proc. Natl Acad. Sci. USA 109, 20012aEUR”20017 (2012).

  46. 46.

    Nguyen, N. et al. The absence of tertiary interactions in a self-assembled DNA crystal structure. J. Mol. Recognit. 25, 234aEUR”237 (2012).

  47. 47.

    Dong, Y., Chen, S., Zhang, S. & Sodroski, J. Folding DNA into a lipid-conjugated nanobarrel for controlled reconstitution of membrane proteins. Angew. Chem. 130, 2094aEUR”2098 (2018). .

  48. 48.

    Pan, K. et al. Lattice-free prediction of three-dimensional structure of programmed DNA assemblies. Nat. Commun. 5, 5578 (2014).

  49. 49.

    Slone, S. M. Building DNA Brick Structures with LegoGen.A Theoretical and Computational Research at the Interface of Physics, Biology, and Nanotechnology,A http://bionano.physics.illinois.edu/tutorials/using-legogen-build-dna-brick-structures (2016).

  50. 50.

    Maingi, V., Jain, V., Bharatam, P. V. & Maiti, P. K. Dendrimer building toolkit: Model building and characterization of various dendrimer architectures. J. Comput. Chem. 33, 1997aEUR”2011 (2012).

  51. 51.

    Schilrreff, P., MundiA+-a-Weilenmann, C., Romero, E. L. & Morilla, M. J. Selective cytotoxicity of PAMAM G5 core-PAMAM G2.5 shell tecto-dendrimers on melanoma cells. Int. J. Nanomed. 7, 4121aEUR”4133 (2012).

  52. 52.

    Maiti, P. K., A?aC?in, T., Wang, G. & Goddard, W. A. Structure of PAMAM dendrimers: generations 1 through 11. Macromolecules 37, 6236aEUR”6254 (2004).

  53. 53.

    Naha, P. C., Davoren, M., Lyng, F. M. & Byrne, H. J. Reactive oxygen species (ROS) induced cytokine production and cytotoxicity of PAMAM dendrimers in J774A.1 cells. Toxicol. Appl. Pharmacol. 246, 91aEUR”99 (2010).

  54. 54.

    Yan, X., Fan, J., Yu, Y., Xu, J. & Zhang, M. Transport behavior of a single Ca2+, K+, and Na+ in a water-filled transmembrane cyclic peptide nanotube. J. Chem. Inf. Model. 55, 998aEUR”1011 (2015).

  55. 55.

    Wang, W. et al. Predicting nano-bio interactions by integrating nanoparticle libraries and quantitative nanostructure activity relationship modeling. ACS Nano 11, 12641aEUR”12649 (2017).

Download references

Acknowledgements

X.Y. and B.Y. were supported by the National Key R&D Program of China (2016YFA0203103), the National Natural Science Foundation of China (91543204 and 91643204), and the introduced innovative R&D team project under the aEURoeThe Pearl River Talent Recruitment ProgramaEUR? of Guangdong Province (2019ZT08L387). W.W.A and H.Z.A wereA partially supported by the National Institute of Environmental Health Sciences (grant number R01ES031080, R15ES023148, and P30ES005022).A We thank A. L. Chun of Science StoryLab for editorial service.

Author information

H.Z. and B.Y. conceived and designed the study. H.Z. designed the project strategy. X.Y. curated the experimental data, constructed the web portal, simulated the virtual nanomaterials, calculated nanodescriptors, built the models, and performed validation. A.S. designed, wrote and tested codes for constructing the virtual nanomaterials and guided several nanodescriptors calculation. W.W. helped analyze the results. X.Y., B.Y., and H.Z. wrote the paper. All authors have read and approved this paper.

Correspondence to
Bing Yan or Hao Zhu.

Ethics declarations

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Christine Ogilvie Hendren and David Winkler for their contribution to the peer review of this work. Peer reviewer reports are available.

PublisheraEUR(TM)s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the articleaEUR(TM)s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the articleaEUR(TM)s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Categories
Nature RSS Feed

Inferring multimodal latent topics from electronic health records

Abstract

Electronic health records (EHR) are rich heterogeneous collections of patient health information, whose broad adoption provides clinicians and researchers unprecedented opportunities for health informatics, disease-risk prediction, actionable clinical recommendations, and precision medicine. However, EHRs present several modeling challenges, including highly sparse data matrices, noisy irregular clinical notes, arbitrary biases in billing code assignment, diagnosis-driven lab tests, and heterogeneous data types. To address these challenges, we present MixEHR, a multi-view Bayesian topic model. We demonstrate MixEHR on MIMIC-III, Mayo Clinic Bipolar Disorder, and Quebec Congenital Heart Disease EHR datasets. Qualitatively, MixEHR disease topics reveal meaningful combinations of clinical features across heterogeneous data types. Quantitatively, we observe superior prediction accuracy of diagnostic codes and lab test imputations compared to the state-of-art methods. We leverage the inferred patient topic mixtures to classify target diseases and predict mortality of patients in critical conditions. In all comparison, MixEHR confers competitive performance and reveals meaningful disease-related topics.

Introduction

The broad adoption of electronic health record (EHR) systems has created unprecedented resources and opportunities for conducting health informatics research. Hospitals routinely generate EHR data for millions of patients, which are increasingly being standardized by using systematic codes, such as Logical Observation Identifiers Names and Codes (LOINC) for lab tests, RxNorm for prescriptions; and Systematized Nomenclature of Medicine (SNOMED), Diagnosis-Related Groups (DRG), and International Classification of Diseases (ICD-9) for diagnoses (albeit for billing purposes). In the USA, for example, the number of nonfederal acute care hospitals with basic digital systems increased from 9.4% to 96% over the 7-year period between 2008 and 20151,2,3. Furthermore, the amount of comprehensive EHR data recording multiple data types, including clinical notes, increased from only 1.6% in 2008 to 40% in 20153. With the aid of effective computational methods, these EHR data promise to define an encyclopedia of diseases, disorders, injuries, and other related health conditions, uncovering a modular phenotypic network.

Distilling meaningful concepts from the raw EHR data presents several challenges, and it is often unfeasible to directly model the joint distribution over the entire EHR feature space. On the other hand, it is possible to formulate a latent topic model over discrete data, in analogy to automatic text categorization4, considering each patient as a document and each disease meta-phenotype as a topic (grouping recurrent combinations of individual phenotypes). Our task then is to learn a set of meaningful disease topics, and the probabilistic mixture memberships of patients for each disease topic, representing the combination of meta-phenotypes inferred for each patient.

In this paper, we formalize this analogy by introducing a latent topic model, MixEHR, designed to meet the challenges intrinsic to heterogeneous EHR data. The main objective of our approach is twofold: (1) distill meaningful disease topics from otherwise highly sparse, biased, and heterogeneous EHR data; and (2) provide clinical recommendations by predicting undiagnosed patient phenotypes based on their disease mixture membership. Importantly, we aim for our model to be interpretable in that it makes not only accurate predictions but also intuitive biological sense. MixEHR can simultaneously model an arbitrary number of EHR categories with separate discrete distributions. For efficient Bayesian learning, we developed a variational inference algorithm that scales to large-scale EHR data.

MixEHR builds on the concepts of collaborative filtering5,6,7,8,9 and latent topic modeling4,10,11,12,13,14. In particular, our method is related to the widely popular text-mining method Latent Dirichlet Allocation (LDA)4. However, LDA does not account for missing data and NMAR mechanism8,15. Our method is also related to several EHR phenotyping studies focusing on matrix factorization and model interpretability16,17,18,19,20,21,22,23. Recently developed deep-learning methods primarily focus on prediction performance of target clinical outcomes24,25,26,27,28,29,30,31. Apart from these methods, Graph-based Attention Model (GRAM)32 uses the existing knowledge graph to embed ICD code and make prediction of diagnostic codes at the next admission using the graph embedding. Deep patient29 uses a stacked stochastic denoising autoencoder to model the EHR code by its latent embedding, which are used as features in a classifier for predicting a target disease. More detailed review are described in Supplementary Discussion.

We apply MixEHR to three real-world EHR datasets: (1) Medical Information Mart for Intensive Care (MIMIC)-III1: a public dataset to date containing over 50,000 ICU admissions; (2) Mayo Clinic EHR dataset containing 187 patients, including with 93 bipolar disorder and 94 controls; (3) The RA(C)gie de laEUR(TM)assurance maladie du QuA(C)bec Congenital Heart Disease Dataset (Quebec CHD Database) on over 80,000 patients being followed for congenital heart disease (CHD) for over 28 years (1983aEUR”2010) at the McGill University Health Centre (MUHC) in Montreal. In these applications, we find that the clinical features emerging across EHR categories under common disease topics are biologically meaningful, revealing insights into disease co-morbidities. The inferred topics reveal sub-categories of bipolar disorder, despite the small sample size. The inferred patient topic mixtures can be used to effectively predict diagnostic code in patientaEUR(TM)s future hospital visits from outpatient data, impute missing lab results, and predict future next-admission patient mortality risk.

Results

We take a probabilistic joint matrix factorization approach, by projecting each patientaEUR(TM)s high-dimensional and heterogeneous clinical record onto a low-dimension probabilistic meta-phenotype signature, which reflects the patientaEUR(TM)s mixed memberships across diverse latent disease topics. The key innovation that enables us to handle highly heterogeneous data types is that we carry out this factorization at two levels. At the lower level, we use data-type-specific topic models, learning a set of basis matrices for each data type. Using the MIMIC-III data as an particular example, we learned seven basis matrices corresponding to clinical notes, ICD-9 billing codes, prescriptions, DRG billing codes, CPT procedural codes, lab tests, and lab results from the MIMIC-III dataset. We link these seven basis matrices at the higher level using a common loading matrix that connects the multiple data types for each patient (Fig. 1a).

Fig. 1: MixEHR model overview.
figure1

a Multi-view matrix factorization of multiple data matrices corresponding to different EHR data types, including lab tests, billing code, doctor notes, etc. b Proposed Bayesian model for modeling non-missing-at-random (NMAR) lab tests and other multimodal data. In order to achieve tractable inference, we assign a latent topic hlj to the lab results ylj and missing indicator (rlj) such that they become conditionally independent. c Collapsed variational Bayesian inference of the MixEHR model. The inference and learning can be visualized as marginalizing a three-dimensional tensor that represents the expectations of the latent variables.

Full size image

To model the lab data, we assign common latent topic to both the lab test and lab test result such that they are conditionally independent given their topic assignments (Fig. 1b). We then updates the lab topic distributions using both the observed lab test results and the missing/imputed lab test results weighted by their topic assignment probabilities for each patient. This differs from other latent topic models, which assume that data are either completely observed or missing at random. To learn the model, we use a variational Bayes to update each model parameters in a coordinate ascent (Fig. 1c). The inference algorithms are detailed in aEURoeMethodsaEUR? and aEURoeSupplementary MethodsaEUR?. After learning the model, we can obtain three pieces of clinically relevant information: (1) to impute missing EHR code for each patient, we take average over the disease topics; (2) to infer patientsaEUR(TM) meta-phenotypes as their mixed disease memberships, we average over the phenotype dimension; and (3) to infer latent disease topic distribution (i.e., distributions over all clinical variables), we average over the patient dimension.

We applied MixEHR to MIMIC-III data1, which contain ~39,000 patients each with a single admission and ~7500 patients each with multiple admissions (Supplementary Table 1). We used MixEHR to jointly model six data categories, including unstructured text in clinical notes, ICD-9, DRG, current procedural terminology (CPT), prescriptions, and lab tests, together comprising of ~53,000 clinical features and ~16 million total clinical observations (aEURoeMethodsaEUR?). The frequency of observed EHR code over all admissions is low (Supplementary Fig. 27). Majority of the variables including the lab tests were observed in less than 1% of the 58,903 admissions (i.e., greater than 99% missing rates for most variables). Therefore, the data are extremely sparse underscoring the importance of integrating multimodal data information to aid the imputation.

To choose the optimal number of topics (K), we performed fivefold cross-validation to evaluate models with different K by the averaged predictive likelihood on the held-out patients (Supplementary Fig. 1a). We set KaEUR?=aEUR?75 for subsequent analyses as it gave the highest predictive likelihood. We then examined the clinical relevance of our learned disease topics by qualitatively assessing the coherence of the 5 most probable EHR codes for each topic. Taking a column with common index k from the topic distribution matrices for each of the six EHR data types (i.e., notes, ICD-9, CPT, DRG, lab tests, and prescription) gives us a single disease topic distribution over clinical terms across those data types. We annotated all of the 75 topics based on the top EHR codes and found that majority of the inferred topics are specific to distinct diseases (Supplementary Table 2).

For the purpose of demonstration, we paid special attention to the disease topics exhibiting the highest likelihoods over the broad set of ICD-9 codes with prefixes {250, 296, 331, 042, 205, 415, 571}, representing diabetes, psychoses, neurodegeneration, HIV, myeloid leukemia, acute pulmonary heart disease, and chronic liver disease and cirrhosis, respectively. The top terms are displayed in alternative word cloud representations in Supplementary Fig. 3. We observed salient intra-topic consistency and inter-topic contrast with respect to each topicaEUR(TM)s highest-scoring EHR features across diverse EHR data categories (colored bars on the right; Fig. 2a). Notably, if we were to use LDA and thereby model observations across all data categories as draws from the same multinomial distribution, the learned disease topics would have been dominated by clinical notes, given that these contain far more features (i.e., words) than the other categories. We illustrate this aspect in the application of mortality prediction below.

Fig. 2: Disease topics from MIMIC-III dataset.
figure2

a Top five EHR codes for select disease topics from the 75-topic MixEHR model. The rows are the top features, and columns are the topics. The color intensities indicate the inferred probabilities of these features belonging to each topic. b Age-correlated disease topics. Top three EHR codes were displayed for the top three most and least age-correlated topics. c Top correlated EHR codes with schizophrenia and PTSD.

Full size image

Moreover, we observed interesting connections between abnormal lab results and specific diseases (Supplementary Fig. 4). For instance, topic M31 is enriched for alcoholic cirrhosis billing codes and ascites-based lab tests typical for cirrhosis patients; in M42, the code for HIV infection is grouped with abnormal counts of immune cells, including CD4 lymphocytes; diabetes codes tend to rank highly alongside high blood acetone in M48, and topic M62 is enriched for opioid abuse along with valproic acid and lithium prescriptionsaEUR”common treatments for bipolar disorder. Interestingly, topic M59aEUR”associated with neurodegenerative diseases such as AlzheimeraEUR(TM)s disease (AD)aEUR”is strongly associated with vitamin B12, which was recently shown to exhibit differential pattern in AD patients33. We also correlated the 75-topic distributions with patientsaEUR(TM) age. The results clearly show that the top positively age-correlated topics are related to heart failure (M20), cardiovascular (M44), and dementia (M59). In contrast, the top three negatively age-correlated topics are all related with newborn (Fig. 2b; Supplementary Fig. 5).

We reasoned that the similarity of the annotated disease (ICD-9 codes) distributions over topics might help uncover nontrivial associations between diseases. In order to infer a disease comorbidity network, we used the 75 learned disease topics and calculated the pairwise correlations between ICD-9 diagnostic codes. We included 424 ICD-9 codes observed in at least 100 patients, and not in the external injuries (ICD-9 code starting with V) or supplementary classification (ICD-9 code starting with E) sections. For comparison, we calculated the correlations between the same 424 ICD-9 codes using their sparse distributions over patients in the raw data. Despite focusing on the ICD-9 codes with at least 100 patients, on average each code is observed in only 0.8% of patients (min: 0.3% and max 2%). Consequently, due to the high sparsity of diseases among patients, we observed very weak correlations based on the observed data ranging between a^’0.02 and 0.06 (Supplementary Fig. 6a). However, when we correlated the ICD-9 codes using our inferred topic embeddings, we observed much stronger correlations. More importantly, diseases of similar physiology form modules consistent with clinical intuition (Supplementary Fig. 6b). Thus, the projection of multimodal data to the inferred topic embeddings space enables the discovery of strong disease association, not directly measurable from the raw EHR data.

Using these topic embeddings, we can also obtain a comorbidity network centering on a specific disease of interest by correlating the diseaseaEUR(TM)s diagnostic code with all other diagnostic codes. In particular, we obtained such a comorbidity network for schizophrenia, post traumatic stress disorder (PTSD), AlzheimeraEUR(TM)s disease, and bipolar disorder (Fig. 2c; Supplementary Figs. 7aEUR”14). The size of the terms are proportional to their Pearson correlation with the disease of interest. To control the false discovery rate, we randomly shuffled the topic probabilities for the target phenotype to calculate the background correlation. Only phenotypes with permutation p-valueaEUR? <aEUR?0.01 are displayed. We observed meaningful related phenotypes in all of the four examples.

Besides gaining insights from the disease topics (i.e., basis matrices), we can also exploit the disease topic mixture memberships along the patients dimension. As an example, we took the 50 patients with the highest proportions for topics M31, M35, and M50, which as we noted above are associated with alcoholic cirrhosis, pulmonary embolism, and leukemia, respectively (Fig. 3a). We observed a clear enrichment for the expected ICD-9 codesaEUR”571, 415, and 205, corresponding to the diagnoses of chronic liver disease and cirrhosis, acute pulmonary heart disease, and myeloid leukemia, respectivelyaEUR”among these patients (Fig. 3b). However, not all of the patients were classified by their expected diagnostic codes.

Fig. 3: Prioritize patients by disease mixture from MIMIC-III dataset.
figure3

a Word clouds of the three topics with colors indicating different data types and font size proportional to the probabilities of the EHR codes under that topic. b We selected top 50 patients under three of the learned topics for illustration. These topics are displayed as word clouds where the size of the word is proportional to their probabilities under the topic. The heatmap indicates patient mixture memberships for the topics. The color bar on the right of the heatmap indicates the actual diagnoses of the diseases related to each of the topics, namely leukemia, pulmonary embolism, and cirrhosis. c Top EHR codes of the high-risk patients. The top ten EHR codes (rows) from each topic were displayed for the top 50 patients (columns) under each topic. We highlighted some evidence (asterisks) as why some of the patients in absence of ICD-9 codes were prioritized as high risk by our method based on their relevance to the diseases of interests (i.e., leukemia, pulmonary embolism, and cirrhosis).

Full size image

We then hypothesized that these unclassified patients perhaps exhibit other markers of the relevant diseases. To further investigate these high-risk patients, we examined the 10 highest-scoring EHR codes under these topics (Fig. 3c). We highlighted some evidence as why the patients with the absence of ICD-9 code were prioritized as high risk by our model. Most patients under the leukemia topic M31, including those missing ICD-9 code 205, received bone marrow biopsies (CPT code 4131) and infusion of a cancer chemotherapeutic substance (CPT code 9925)aEUR”clear indicators of leukemia diagnosis and treatment. Likewise, several patients under the pulmonary embolism topic (M35) that are missing ICD-9 code 415 nevertheless have the DRG billing code for pulmonary embolism, a prescription for enoxaparin sodium (which is used to treat pulmonary embolism), or the related ICD-9 code 453 for other venous embolism and thrombosis. Moreover, all patients under the cirrhosis topic M50 have the key word cirrhosis mentioned in their clinical notes. Notably, although the procedural code for esophageal variceal banding is not particularly prevalent, MixEHR predicts reasonably high probability of undergoing this procedure among cirrhosis-topic patientsaEUR”a clinically reasonable prediction. This suggests that our model is not only intuitively interpretable but also potentially useful for suggesting patientaEUR(TM)s future medical interventions.

To further demonstrate the utility of our approach in discovering meaningful multimodal topics, we applied MixEHR to a separate dataset containing 187 patients, including 93 bipolar disorder cases and 94 age- and sex-matched controls from Mayo Clinic. Despite the small sample size, the patients were deeply phenotyped: there are in total 7731 heterogeneous EHR features across five different data categories, including ICD-9 codes, procedure codes, patient provided information (PPI), lab tests, and prescription codes. In total, there are 108,390 observations.

We evaluated our model by fivefold cross-validation. Here, we trained MixEHR or the unsupervised baseline models and a logistic regression (LR) classifier that uses the patient topic mixture derived from MixEHR or embeddings derived from other methods to predict the bipolar disorder label (Fig. 4a; aEURoeMethodsaEUR?). We observed superior performance of MixEHR+LR compared with LDA+LR and RBM+LR in both the area under the ROC and the area under the precisionaEUR”recall curves (Fig. 4b). We quantified the predictive information of the 20 topics based on the linear coefficients of the LR classifer. We observed that topics M19 and M7 have the highest positive coefficients for bipolar disorder label (Fig. 4c). To confirm the finding, we visualized the 187 patient topic mixtures side-by-side with their BD diagnosis ICD-9 296 code (Fig. 4d). Indeed, we find that M19 and M7 are highly correlated with the BD diagnosis label, and M19 and M7 may represent two distinct subgroups of BD patients. To check whether these two topics were driven by demographic information, we compared the topic mixture membership with sex and age for each patient. We did not observe significant difference between sexes for M19 mixture probabilities or M7 mixture probabilities (Supplementary Fig. 28). There is also no significant correlation between age and M19 or M7 topic mixture (Supplementary Fig. 31).

Fig. 4: Classification of bipolar disorder patients using the Mayo Clinic dataset.
figure4

a Workflow to evaluate the classification accuracy of bipolar disorder in a fivefold cross-validation (CV). b Classification accuracy of fivefold CV in ROC and precisionaEUR”recall curves. c Linear coefficients of the 20 topics ranked by decreasing order. d Inferred 20-topic mixture of 187 patients. The patients with and without bipolar disorder diagnosed were colored in red and green, respectively, on the left side of the dendrogram.

Full size image

We further investigated the differences between the M19 and M7 topics in terms of the underlying 7731 EHR codes. For each EHR code, we grouped the patients based on the presence and absence of that code. We then tested whether the two groups are significantly different in terms of their topic mixture memberships under M19 or M7 using Wilcoxon signed-rank one-sided tests. For the ease of interpretability, here we tested whether the patients with the code exhibit significantly higher topic mixture for M19 or M7, to assess positive enrichment for the topic. We note that these phenome-wide association studies (PheWAS) results were not corrected for multiple testing, and thus we used the results as an exploratory purpose only.

While a large proportion of the significant EHR codes is associated with both the M7 and M19 topics, we also find interesting codes that are unique to each topic (Fig. 5). For instance, ICD-9 codes for suicidal ideation (V62.84), family history of psychiatric condition (V17.0), and bipolar type I disorder with the most recent severe depressed episode without psychotic behavior (296.53) are significantly associated with M19 but not M7, although both topics share 296.80 (bipolar disorder, unspecified) and 296.52 (bipolar I disorder, most recent episode (or current) depressed, moderate) codes for BD (Fig. 5). Certain lab tests (e.g., SLC6A4, HTR2A, and cytochrome P450 enzymes) are also associated with M19, but not M7. Lithium lab test is also associated with high M19 mixture probabilities, but not M7. Thus it is possible that M19 patients may have had more severe symptoms and needed an increased use of pharmacogenomically guided treatment. PheWAS for the PPI questions are displayed in Supplementary Fig. 32. Interpretation on the PPI PheWAS must be taken with caution (aEURoeMethodsaEUR?). With these data, we are able to demonstrate a potential utility of our MixEHR approach to classify BD patients into potentially clinically meaningful categories that require further investigation in larger dataset.

Fig. 5: Phenome-wide association studies (PheWAS) of the 6776 EHR codes on the two bipolar disorder mixture topics (M7 and M19) based on the Mayo Clinic dataset.
figure5

For each EHR code, we tested the difference between patient groups with and without that code in terms of M19 and M7 topic membership probabilities. The labeled EHR codes are the ones with Wilcoxon signed-rank one-sided test p-valuesaEUR? <aEUR?0.001. The red, blue, and black color indicate the codes that are significant in only M19, in only M7, and in both M19 and M7, respectively. PheWAS for the PPI questions are displayed in Supplementary Fig. 32.

Full size image

Retrospective EHR code prediction has its value in diagnosing the code entered in the existing EHR patient records and making suggestions about the potential missing code and incorrectly entered code based on the expectation of those EHR codes. To predict EHR codes, we formulated a k-nearest neighbor approach (Supplementary Fig. 16a; aEURoeMethodsaEUR?). We evaluated the prediction accuracy based on fivefold cross-validation. The predicted EHR codes match consistently with the observed frequency of the diagnostic codes (Supplementary Fig. 15). Overall, the median prediction accuracy of MixEHR (multimodal) is 88.9%, AUROC is 85%, and AUPRC is 9.6%. In contrast, the baseline model that ignores distinct data types (flattened; i.e., LDA) obtained only 83.3% accuracy, 77.1% AUROC, and 6.7% AUPRC, respectively (Supplementary Fig. 16b). The prediction performances vary among distinct data types (Supplementary Fig. 17) and also among different ICD-9 disease groups (Supplementary Fig. 18), which are attributable to the rareness of the diseases, complexity of the disease codes, and the strength and limitations of our model-based assumption. One caveat in this experiment is that we are potentially using highly related code to predict the current target code. Therefore, we should distinguish this experiment from predicting diagnostic code in the future admissions of inpatient data or future visits in the outpatient data.

In the MIMIC-III data, there are 7541 patients who were admitted to the hospital more than once. This allows us to evaluate how well we can predict the diagnostic codes in the next admission based on the information from the previous admission(s) for the same patients. To this end, we developed a pipeline that combines MixEHR topics with recurrent neural network (RNN) with Gated Recurrent Unit (GRU) (Fig. 6a; aEURoeMethodsaEUR?). By leveraging the correlation structure information between rarely observed codes and commonly observed codes via the latent topics, we hypothesize that our approach can achieve comparable prediction accuracy as GRAM, which relies on the accuracy of the existing ICD-9 knowledge graph or taxonomy.

Fig. 6: Diagnostic code prediction using the MIMIC-III and Quebec CHD datasets.
figure6

a Proposed MixEHR+RNN framework for predicting medical code from the MIMIC-III data. b Medical code-prediction accuracy. We evaluated the prediction accuracy of the 42 common Clinical Classification Software (CCS) medical codes comparing MixEHR with Doctor AI and GRAM. For each medical code, the accuracy was calculated by the true positives among the top 20 predicted codes divided by the total number of positives. The dots are individual accuracies for each code predicted by each method. c Proposed MixEHR+RNN framework for predicting the ICD code on Quebec CHD database. d Prediction accuracies in terms of area under the precisionaEUR”recall curve (AUPRC) and area under the ROC curve (AUROC) over all of the 1098 3-digit ICD codes. The center line, bounds, and whiskers of the boxplots are median, first and third quartiles, and outlier, respectively. Wilcoxon signed-rank one-sided tests were performed to calculate the p-values with the alternative hypothesis set to that AUCs from MixEHR+RNN are greater than the AUCs from the baseline RNN.

Full size image

We observe that MixEHR+RNN confers the highest overall accuracy among the three methods, although the difference between MixEHR+RNN and GRAM32 is small (Fig. 6b). We also generated the prediction accuracy on the codes binned by their frequencies as before (Supplementary Fig. 35). MixEHR+RNN performs the best in four out of the five ranges and falls short by only 2% in the last range (80aEUR”100) compared with GRAM. Both MixEHR+RNN and GRAM outperform Doctor AI27 by a large margin. Doctor AI did not do well in this application because of the small sample size and short medical history of the MIMIC-III data (11 years) (Fig. 6b).

We evaluated the code prediction accuracy on the 28-year longitudinal outpatient data from the CHD database using a MixEHR+RNN architecture (Fig. 6c; aEURoeMethodsaEUR?). Compared to our model with the baseline RNN, we observed a significant improvement in terms of both AUPRC (Wilcoxon signed-rank tests p-valueaEUR?<aEUR?0.0408) and AUROC (Wilcoxon signed-rank tests p-valueaEUR?<aEUR?5.35e-55) (Fig. 6d). Therefore, adding MixEHR significantly improves the EHR code prediction over the baseline RNN model. We also observed that the learned weights connected to the 50-topic mixture exhibit higher magnitude than the learned weights connected to the concatenated dense layer embedding (Supplementary Fig. 19). This means that the network relies heavily on the topic mixture to make accurate predictions. Because our dataset is focused on the CHD patients, we checked the prediction accuracy on ICD-9 code 428. Both models achieve 93% AUROC and 33% AUPRC for predicting 428 ICD-9 code.

We discovered two interesting topics namely M43 and M1 that are highly related to heart-failure ICD-9 code 428 (Supplementary Fig. 20). Specifically, M43 involves not only ICD-9 code 428.9 for heart failure but also code 518.4 for acute lung edema, code 290.9 for senile psychotic condition, code 428.0 for congenital heart failure, code 402.9 for hypertensive heart disease without HF, and code 782.3 for edema. Indeed, edema is known to be the precursor for heart failure among many patients. Interestingly, topic M1 characterizes a different set of heart-related diseases, such as rheumatic aortic stenosis (code 395.0), secondary cardiomyopathy (code 425.9), and left heart failure (code 428.1).

Procedural codes 1-71, 1-9115, 1-9112 all indicate billing for more complex care either by a family doctor or specialists in patients who have impaired mobility, prolonged hospitalization beyond 15 days and/or admission to short-stay units for patients unable to go home, typically requiring diuretics for pulmonary edema. Therefore, most of the top EHR codes under topic M43 are clinically relevant predictors of heart-failure events in clinical practice.

We then evaluated the imputation accuracy of missing lab results. We first performed an extensive simulation by subsampling from MIMIC-III data to compare lab imputation accuracy between MixEHR and MICE34 (Supplementary Note). Our approach demonstrated superior performance compared with MICE and the model that assumes missing-at-random (MAR) lab results (Supplementary Fig. 25). We then demonstrated our approach on imputing the real MIMIC-III lab results. Here, we took a similar approach as the k-nearest neighbor approach in the EHR code prediction (Fig. 7a; aEURoeMethodsaEUR?). We observed that MixEHR achieves significantly higher accuracy compared to CF-RBM (Fig. 7b, c; Wilcoxon test p-valueaEUR?<aEUR?0.00013, KolmogorovaEUR”Smirnov (KS) test p-valueaEUR?<aEUR?1.15aEUR?A–aEUR?10a^’5). This is attributable to two facts: (1) MixEHR accounts for NMAR by jointly modeling the distribution both the lab missing indicators and lab results; (2) MixEHR is able to leverage more information than CF-RBM: it models not only the lab data but also other administrative data and clinical notes. We also observed significantly higher accuracy for our MixEHR imputation strategy compared to a simple averaging approach (Supplementary Fig. 34; aEURoeMethodsaEUR?).

Fig. 7: Lab results imputation using MIMIC-III dataset.
figure7

a Workflow to impute lab results. Step 1: We modeled lab tests, lab test results, and non-lab EHR data (i.e., ICD, notes, prescription, treatment) to infer the patient topic mixture. Step 2: For a test patient, we masked each of his observed lab test result t, and inferred his topic mixture. Step 3: We then found kaEUR?=aEUR?25 patients who have the lab test results t observed and exhibit the most similar topic mixture to the test patient. We then took the average of lab result values over the k patients as the prediction of the lab result value for the test patient (j^{prime}). Steps 1aEUR”3 were repeated to evaluate every observed lab test in every test patient. b We compared the imputation accuracy between MixEHR and CF-RBM. We generated the cumulative density function (CDF) of accuracy as well as the boxplot distributions (inset) for each method. The center line, bounds, and whiskers of the boxplots are median, first and third quartiles, and outlier, respectively. In both cases, MixEHR significantly outperformed CF-RBM based on KS test (paEUR?<aEUR?1.15e-5) and Wilcoxon signed-rank one-sided test (paEUR?<aEUR?0.00013). c CF-RBM versus MixEHR scatterplot in terms of imputation accuracy.

Full size image

Predicting patientaEUR(TM)s mortality risk in the Intensive Care Unit (ICU) is vital to assessing the severity of illness or trauma and facilitating subsequent urgent care. Because we are able to distill meaningful disease topics from heterogeneous EHR data, we sought to assess the predictive capacity of these disease topics for mortality predictions in future admissions. Here, we predicted the mortality outcome in the last admission based on the second-last admission of the patient within 6 months. We obtained the highest 31.62% AUPRC from MixEHR (50 topics) among all methods (Fig. 8a) and second highest AUROC of 65.69% (Supplementary Fig. 22). We also experimented with two different classifiers namely logistic regression with L2-norm (Supplementary Fig. 29) and random forest (Supplementary Fig. 30). We observed overall best performance using our MixEHR embedding with both classifiers compared with the other methods.

Fig. 8: Mortality prediction using MIMIC-III dataset.
figure8

Each unsupervised embedding method was trained on the patients with only one admission in the MIMIC-III data. The trained model was then applied to embed the second-last admission from patients with at least two admissions that are within 6 months apart. An elastic net (EN) classifier was trained to predict the mortality outcome in the last admission. This was performed in a fivefold cross-validation setting. a PrecisionaEUR”recall curve (PRC) were generated, and the area under of the PRC (AUPRC) were displayed in the legend for each embedding method. EN represents the performance of elastic net using the raw EHR features. b Linear coefficients for the topics from MixEHR and LDA. The top three and bottom three topics are highlighted. c Topic clinical features for the top three most positively predictive topics and three most negatively predictive topics based on the elastic net coefficients for the 75 latent disease topics from MixEHR. d Same as c, but for the top predictive topics from LDA.

Full size image

Furthermore, we designed another experiment that used the earliest/first admission to predict the mortality in the last admission. The value of this application is to identify patients who are discharged too early and therefore to provide a measure on whether the patients should be discharged from the ICU based on their current condition. Once again, MixEHR outperformed other baseline methods (Supplementary Fig. 33). Here, we focused our analysis on the mortality in the last admission based on the information in the second-last admission.

Notably, LDA obtained similar performance to MixEHR, whereas PCA, SiCNMF, and EN performed a lot worse on this task. This suggests that the topic models (i.e., LDA and MixEHR) are generally more suitable to modeling the discrete count data in the EHR. Based on the elastic net linear coefficients, the three most positively predictive mortality topics are enriched for renal failure (M60 with the most positive coefficient), leukemia (M73), and dementia (M26), respectively (Fig. 8c). Interestingly, the three most negatively predictive mortality topics are normal newborn (topic M52 with the most negative coefficient), aneurysm (M8), drug poisoning (M73), respectively. We note that each MixEHR topic is represented by the top EHR features from diverse data types. In contrast, all of the predictive topics from LDA are represented by a single data type: clinical notes (Fig. 8d). This is because clinical notes contain most EHR features (i.e., words) among all of the data types. By normalizing the EHR features across all of the data types, the LDA topics are overwhelmed by the category that contains a lot more features than the other categories. Qualitatively, this greatly reduces the interpretability of the topics compared to our MixEHR.

We also ran Deep Patient29 on the MIMIC-III data and did not obtain good performance (aEURoeMethodsaEUR?; Supplementary Fig. 23). This is perhaps due to the small sample size and the lack of carefully designed preprocessing pipeline29 as the authors demonstrated on their own EHR data. In addition, previous work has shown the benefits of using demographic information in representing EHR codes and patient visits35. Our preliminary results show little benefit of adding demographic information including sex, age, religion, ethnicity, and marital status in the mortality prediction (Supplementary Fig. 24). However, we acknowledge its importance and will continue exploring this in our future work.

Discussion

We present MixEHR as an unsupervised Bayesian framework to efficiently model the distribution of heterogeneous high-dimensional EHR data. We envision MixEHR to be a step toward several promising translational venues, such as (1) facilitating cliniciansaEUR(TM) decision-making, (2) directing further/deeper phenotyping, (3) enabling personalized medical recommendations, and (4) aiding biomarker discovery. Specifically, MixEHR can infer the expected phenotypes of a patient conditioned only on a subset of clinical variables that are perhaps easier and cheaper to measure. Because new data are constantly entering the EHR systems, our model needs to adapt to incoming data without refitting to the entire datasetaEUR”in other words, to go onlineaEUR”a topic for future development.

In the current framework, MixEHR assumes that the input data are a set of two-dimensional matrices of patients by measurements (lab tests, diagnoses, treatments, etc.), as delineated in the structure of our datasets. However, one could envision EHR data being represented as higher-dimensional data objectsaEUR”for instance, as a three-dimensional object of patients by lab tests by diagnoses, patients by diagnoses by treatments, or patients by diagnoses by time. To model such high-dimensional EHR data (as it becomes available), we can extend MixEHR to a probabilistic tensor-decomposition (TD) framework. Recently developed non-Bayesian models such as Marble36, Rubik37, and non-negative TD21 are TD-based methods that show great promise. MixEHR is an ideal modeling core to serve as the basis for such upcoming modeling high-dimensional challenges.

It is challenging to interpret associations with patient provided information (PPI), compared with the other data types we used here. The presence of PPI data is confounded by factors such as the point of care (e.g., primary care versus psychiatric department), gender, demographics, and other social determinants of health. We will revisit this challenge in our future work.

Our basic MixEHR framework is cross-sectional. Nonetheless, longitudinal EHR learning is an important research area as it promises to forecast patientsaEUR(TM) disease outcomes based on their entire medical history records while taking into account the temporal trajectory of patient states27,38,39. To this end, we present a pipeline that provides the MixEHR-inferred topic mixture at each admission as an input to a recurrent neural network in order to model the longitudinal aspect of the EHR data. It is more challenging to model irregularly sampled time points as commonly observed in the outpatient visits while modeling high-dimensional variables. In addition, in our lab imputation analysis, we did not model the sequential event of lab test results within the same admission, which requires a more sophisticated model that takes into account NMAR and the irregularly measured lab tests during the patientaEUR(TM)s stay. We leave these as future extensions to MixEHR.

In the mortality prediction, we used the information at the early admission to predict mortality in the last admission. Meanwhile, we understand the value of using the first 24 or 48aEUR?h data to predict in-hospital mortality within the same admission (e.g. ref. 17). This is different from our application. We did not carry out this experiment because not all of the EHR data in MIMIC-III have a time stamp (i.e., CHARTTIME) within the same admission. In particular, although clinical notes and lab tests have chart time that records the time they were taken during the in-hospital admission, all of the rest of the data including ICD-9 diagnostic code, prescription, CPT, DRG code do not have a chart time and are usually entered upon patientaEUR(TM)s discharge. This makes it difficult to design such an experiment that uses all of the information. We acknowledge that one caveat in our analysis is that some patients who are terminally ill (e.g., at the metastatic cancer stage) may choose to die at home and therefore our prediction may reflect not only the patientsaEUR(TM) health states but also their mental and social states.

In summary, our data-driven approach, using a large number of heterogeneous topics over long observation periods, has the potential of leading to clinically meaningful algorithms that will help clinicians anticipate high disease-burden events in clinical practice. Our findings can inform the growing emphasis on proactive rather than reactive health management as we move into an era of increasing precision medicine.

Methods

We model EHR data using a generative latent topic model (Fig. 1). Notations are summarized in Supplementary Table 3. Suppose there are K latent disease topics, each topic k a^^ {1, aEUR|, K} under data type t a^^ {1, aEUR|, T} prescribes a vector of unknown weights ({{boldsymbol{phi }}}_{k}^{(t)}={[{phi }_{wk}^{(t)}]}_{{W}^{(t)}}) for W(t) EHR features, which follows a Dirichlet distribution with unknown hyperparameter I2wt. In addition, each topic is also characterized by a set of unknown Dirichlet-distributed weights ({{boldsymbol{eta }}}_{lk}={[{eta }_{lkv}]}_{{V}_{l}}) with Vl distinct values (e.g., a lab test l with Vl distinct lab result values). For each patient j a^^ {1, aEUR|, D}, the disease mixture membership I,j is generated from the K-dimensional Dirichlet distribution Dir(I+-) with unknown asymmetric hyperparameters I+-k. To generate EHR observation i for patient j, a latent topic ({z}_{ij}^{(t)}) under data type t is first drawn from multinomial distribution with rate I,j. Given the latent topic ({z}_{ij}^{(t)}), a clinical feature ({x}_{ij}^{(t)}) is drawn from multinomial distribution with rate equal to ({{boldsymbol{phi }}}_{{z}_{ij}^{(t)}}^{(t)}).

For lab data, we use a generative process where each patient has a variable for the result for every lab test regardless whether the results is observed. One important assumption we make here is that the lab results ylj and lab observation rlj for patient j are conditionally independent given the latent topic hlj, namely that the probability of taking the test and the value of that test both depend on the latent disease topic, but not on each other. In terms of the generative process, we first sample the latent variable hlj from the multinomial distribution with rate I,j. Conditioned on the latent variable, we then concurrently sample (1) the lab result ylj from Vl-dimensional Dirichlet distribution ({{boldsymbol{eta }}}_{l{h}_{lj}}) with hyperparameters I?l over Vl values and (2) the lab observation indicator rlj from Binomial distributionFootnote 1 with rate ({psi }_{l{h}_{ij}}), which follows a Beta distribution with unknown shape and scale hyperparameters al and bl. The hyperparameters I+-k, I2wt, I?lv, al, blaEUR(TM)s follow unknown Gamma distributions.

Formally, we first generate global variables as the parameters for the disease topic distribution:

$${{boldsymbol{phi }}}_{k}^{(t)} sim Dir({{boldsymbol{beta }}}_{t}):frac{Gamma left({sum }_{w}{beta }_{wt}right)}{{prod }_{w}Gamma ({beta }_{wt})}{prod }_{w}{left[{phi }_{wk}^{(t)}right]}^{{beta }_{wt}-1}$$

$${eta }_{lk} sim Dir({{boldsymbol{zeta }}}_{l}):frac{Gamma ({sum }_{v}{zeta }_{lv})}{{prod }_{v}Gamma ({zeta }_{lv})}{prod }_{v}{eta }_{lkv}^{{zeta }_{lv}-1},quad {psi }_{lk} sim Bet({a}_{l},{b}_{l}):frac{Gamma ({a}_{l}+{b}_{l})}{Gamma ({a}_{l})Gamma ({b}_{l})}{psi }_{lk}^{{a}_{l}-1}{(1-{psi }_{lk})}^{{b}_{l}-1}$$

We then generate local variables for the diagnostic codes including words from clinical notes, ICD-9 diagnostic codes, prescription terms, Current Procedure Terminology (CPT) codes of each patient:

$${{boldsymbol{theta }}}_{j} sim Dir({boldsymbol{alpha }}):frac{Gamma ({sum }_{k}{alpha }_{k})}{{prod }_{k}Gamma ({alpha }_{k})}{prod }_{k}{theta }_{jk}^{{alpha }_{k}-1};quad {z}_{ij} sim Mul({{boldsymbol{theta }}}_{j}):{prod }_{k}{theta }_{jk}^{[{z}_{ij} = k]}; \ {x}_{ij} sim Mul({{boldsymbol{phi }}}_{k}^{(t)}):{prod }_{w}{phi }_{kw}^{[{x}_{ij} = w]}$$

and the local variables for the lab data, including latent topic, lab result, and observation indicator:

$${h}_{lj} sim Mul({{boldsymbol{theta }}}_{j}):{prod }_{k}{theta }_{jk}^{[{h}_{lj} = k]};quad {y}_{lj} sim Mul({{boldsymbol{eta }}}_{l{h}_{lj}}):{prod }_{v}{eta }_{l{h}_{lj}v}^{{y}_{ljv}};\ {r}_{lj} sim Bin({psi }_{l{h}_{lj}}):{psi }_{l{h}_{lj}}^{{r}_{lj}}{(1-{psi }_{l{h}_{lj}})}^{1-{r}_{lj}}$$

where Gam(. ), Dir(. ), Bet(. ), Mul(. ), and Bin(. ) denote gamma, Dirichlet, beta, multinomial, and Binomial distributions, respectively.

Treating the latent variables as missing data, we can express the complete joint likelihood based on our model as follows:

$$p({bf{x}},{bf{y}},{bf{r}},{bf{z}},{bf{h}},{boldsymbol{theta }},{boldsymbol{phi }},{boldsymbol{psi }},{boldsymbol{eta }}| {boldsymbol{Theta }})=p({boldsymbol{theta }}| {boldsymbol{alpha }})p({bf{z}},{bf{h}}| {boldsymbol{theta }})p({bf{x}}| {bf{z}},{boldsymbol{phi }})p({boldsymbol{phi }}| {boldsymbol{beta }})p({bf{r}}| {bf{h}},{boldsymbol{psi }})p({bf{y}}| {bf{h}},{boldsymbol{eta }})p({boldsymbol{eta }}| {boldsymbol{zeta }})p({boldsymbol{psi }}| {bf{a}},{bf{b}})$$

At this point, we could formulate a variational inference algorithm with mean-field factorization by optimizing an evidence lower bound of the marginal likelihood with respect to the model parameters40. However, that would be inaccurate and computationally expensive for the following reasons. The mean-field variational distribution assumes that all of the latent variables are independent. On the other hand, we know from our model that these independence assumptions are unrealistic. In particular, the latent topic assignments depend on the patient mixtures, and the disease latent topic distribution and topic assignment are dependent given the observed lab tests or diagnostic code (i.e., the V shape structure in the probabilistic graphical model41). In addition, the variational updates require extensive usages of exponential function and digamma function for every latent topic assignment of every patient, which are computationally expensive to compute for larger number of patients.

We can achieve more accurate and efficient inference by first analytically integrating out the patient mixtures and latent topic distributions13,42. To do so, we exploit the respective conjugacy of Dirichlet variables Io, I., I, to the multinomial likelihood variables x, y, {z, h}, and the conjugacy of Beta variable I^ to the binomial lab observation indicator variable r. This way we can simplify the complete joint likelihood distribution of the data and the latent variables {z, h} as follows:

$$p({bf{x}},{bf{y}},{bf{r}},{bf{z}},{bf{h}}| {boldsymbol{alpha }},{boldsymbol{beta }},{boldsymbol{zeta }},{bf{a}},{bf{b}})=int p({bf{z}},{bf{h}}| {boldsymbol{theta }})p({boldsymbol{theta }}| {boldsymbol{alpha }})d{boldsymbol{theta }}int p({bf{x}}| {bf{z}},{boldsymbol{phi }})p({boldsymbol{phi }}| {boldsymbol{beta }})d{boldsymbol{phi }}$$

$$int p({bf{y}}| {bf{h}},{boldsymbol{eta }})p({boldsymbol{eta }}| {boldsymbol{zeta }})d{boldsymbol{eta }}int p({bf{r}}| {bf{h}},{boldsymbol{psi }})p({boldsymbol{psi }}| {bf{a}},{bf{b}})d{boldsymbol{psi }}$$

$$={prod }_{j}frac{Gamma left({sum }_{k}{alpha }_{k}right)}{{prod }_{k}Gamma ({alpha }_{k})}frac{{prod }_{k}Gamma left({alpha }_{k}+{n}_{.jk}^{(.)}+{m}_{.jk}right)}{Gamma left({sum }_{k}{alpha }_{k}+{n}_{.jk}^{(.)}+{m}_{.jk}right)}{prod }_{k}{prod }_{t}frac{Gamma left({sum }_{w}{beta }_{wt}right)}{{prod }_{w}Gamma left({beta }_{wt}right)}frac{{prod }_{w}Gamma left({beta }_{t}+{n}_{w.k}^{(t)}right)}{Gamma left({sum }_{w}{beta }_{wt}+{n}_{w.k}^{(t)}right)}$$

$${prod }_{k}{prod }_{l}frac{Gamma left({sum }_{v}{zeta }_{lv}right)}{{prod }_{v}Gamma left({zeta }_{lv}right)}frac{{prod }_{v}Gamma left({zeta }_{lv}+{m}_{l.kv}right)}{Gamma left({sum }_{v}{zeta }_{lv}+{m}_{l.kv}right)}{prod }_{k}{prod }_{l}frac{Gamma left({a}_{l}+{b}_{l}right)}{Gamma ({a}_{l})Gamma ({b}_{l})}frac{Gamma ({a}_{l}+{p}_{lk})Gamma ({b}_{l}+{q}_{lk})}{Gamma ({a}_{l}+{p}_{lk}+{b}_{l}+{q}_{lk})}$$

where the sufficient statistics are:

$${n}_{.jk}^{(.)}=mathop{sum }limits_{t = 1}^{T}mathop{sum }limits_{i = 1}^{{M}_{j}^{(t)}}[{z}_{ij}^{(t)}=k],quad {n}_{w.k}^{(t)}=mathop{sum }limits_{j = 1}^{D}mathop{sum }limits_{i = 1}^{{M}_{j}^{(t)}}[{x}_{ij}^{(t)}=w,{z}_{ij}^{(t)}=k]$$

(1)

$${m}_{.jk.}=mathop{sum }limits_{l = 1}^{L}mathop{sum }limits_{v = 1}^{{V}_{l}}{y}_{ljv}[{h}_{lj}=k],quad {m}_{l.kv}=mathop{sum }limits_{j = 1}^{D}{y}_{ljv}[{h}_{lj}=k]$$

(2)

$${p}_{lk}=sum _{j}[{r}_{lj}=1]{prod }_{v}{y}_{ljv}[{h}_{lj}=k],quad {q}_{lk}=sum _{j}[{r}_{lj}=0] sum _{v}[{h}_{lj}=k,{y}_{lj}=v]$$

(3)

Note that we use yljv to denote the frequency of observing the lab test l of outcome v for patient j and [yljaEUR?=aEUR?v] as binary indicator of a single test. Detailed derivations are in Supplementary Methods.

The marginal likelihood can be approximated by evidence lower bound (ELBO):

$${mathrm{log}},p({bf{x}},{bf{y}},{bf{r}}| {boldsymbol{alpha }},{boldsymbol{beta }},{boldsymbol{zeta }},{bf{a}},{bf{b}})={mathrm{log}},sum _{{bf{z}},{bf{h}}}frac{p({bf{x}},{bf{y}},{bf{r}},{bf{z}},{bf{h}}| {boldsymbol{alpha }},{boldsymbol{beta }},{boldsymbol{zeta }},{bf{a}},{bf{b}})}{q({bf{z}},{bf{h}})}q({bf{z}},{bf{h}})$$

(4)

$$ge sum limits_{{bf{z}},{bf{h}}}q({bf{z}},{bf{h}}){mathrm{log}},p({bf{x}},{bf{y}},{bf{r}},{bf{z}},{bf{h}}| {boldsymbol{alpha }},{boldsymbol{beta }},{boldsymbol{zeta }},{bf{a}},{bf{b}})- sum _{{bf{z}},{bf{h}}}q({bf{z}},{bf{h}}){mathrm{log}},q({bf{z}},{bf{h}})$$

(5)

$$={{mathbb{E}}}_{q({bf{z}},{bf{h}})}[{mathrm{log}},p({bf{x}},{bf{y}},{bf{r}},{bf{z}},{bf{h}}| {boldsymbol{alpha }},{boldsymbol{beta }},{boldsymbol{zeta }},{bf{a}},{bf{b}})]-{{mathbb{E}}}_{q({bf{z}},{bf{h}})}[{mathrm{log}},q({bf{z}},{bf{h}})]equiv {{mathcal{L}}}_{ELBO}$$

(6)

Eqs. (4) to (5) follows JensenaEUR(TM)s inequality. Maximizing ({{mathcal{L}}}_{ELBO}) is equivalent to minimizing KullbackaEUR”Leibler (KL) divergence:

$${mathcal{KL}}[q({bf{z}},{bf{h}})| | p({bf{z}},{bf{h}}| {bf{x}},{bf{y}},{bf{r}})]={{mathbb{E}}}_{q({bf{z}},{bf{h}})}[{mathrm{log}},q({bf{z}},{bf{h}})]-{{mathbb{E}}}_{q({bf{z}},{bf{h}})}[{mathrm{log}},p({bf{z}},{bf{h}},{bf{x}},{bf{y}},{bf{r}})]+{mathrm{log}},p({bf{x}},{bf{y}},{bf{r}})$$

because ({mathrm{log}},p({bf{x}},{bf{y}},{bf{r}})) is constant and ({mathcal{KL}}[q({bf{z}},{bf{h}})| | p({bf{z}},{bf{h}}| {bf{x}},{bf{y}},{bf{r}})]+{{mathcal{L}}}_{ELBO}= {mathrm{log}},p({bf{x}},{bf{y}},{bf{r}})).

Under mean-field factorization, the proposed distribution of latent variables z and h are defined as:

$${mathrm{log}},q({bf{z}}| {boldsymbol{gamma }})=sum _{t,i,j,k}[{z}_{ij}^{(t)}=k]{mathrm{log}},{gamma }_{ijk}^{(t)},quad {mathrm{log}},q({bf{h}}| {boldsymbol{lambda }})=sum _{l,j,k}[{h}_{lj}=k]{mathrm{log}},{lambda }_{ljk}$$

(7)

Maximizing (6) with respect to the variational parameter ({gamma }_{ijk}^{(t)}) and I>>ljk is equivalent to calculating the expectation of ({z}_{ij}^{(t)}=k) and hljaEUR?=aEUR?k with respect to all of the other latent variables13,41:

$${mathrm{log}},q ({z}_{ij}^{(t)}=k)={{mathbb{E}}}_{q({{bf{z}}}^{-(i,j)})}[{mathrm{log}},p({bf{x}},{bf{z}})],quad {mathrm{log}},q({h}_{lj}=k)={{mathbb{E}}}_{q({{bf{h}}}^{-(l,j)})}[{mathrm{log}},p({bf{y}},{bf{r}},{bf{h}})]$$

(8)

Exponentiating and normalizing the distribution of (q({z}_{ij}^{(t)}=k)) and q(hljaEUR?=aEUR?k) gives

$${gamma }_{ijk}^{(t)}=frac{exp ({{mathbb{E}}}_{q({{bf{z}}}^{-(i,j)})}[{mathrm{log}},p({bf{x}},{{bf{z}}}^{-(i,j)},{z}_{ij}=k)])}{{sum }_{k^{prime} }exp ({{mathbb{E}}}_{q({{bf{z}}}^{-(i,j)})}[{mathrm{log}},p({bf{x}},{{bf{z}}}^{-(i,j)},{z}_{ij}=k^{prime} )])}$$

(9)

$${lambda }_{ljk}=frac{exp ({{mathbb{E}}}_{q({{bf{h}}}^{-(l,j)})}[{mathrm{log}},p({bf{y}},{bf{r}},{{bf{h}}}^{-(i,j)},{h}_{ij}=k)])}{{sum }_{k^{prime} }exp ({{mathbb{E}}}_{q({{bf{h}}}^{-(l,j)})}[{mathrm{log}},p({bf{y}},{bf{r}},{{bf{h}}}^{-(i,j)},{h}_{ij}=k^{prime} )])}$$

(10)

We can approximate these expectations by first deriving the conditional distribution for (p({z}_{ij}^{(t)}=k| {bf{x}},{bf{y}},{bf{r}})) and p(hlj = ka^GBPx, y, r) (Supplementary Methods) and then using Gaussian distributions to approximate the sufficient statistics by the summation of the variational parameters in zero-order Taylor expansion13,14:

$${gamma }_{ijk}^{(t)}propto left({alpha }_{k}+{tilde{n}}_{.jk}^{-(i,j)}+{tilde{m}}_{.jk}right)left(frac{{beta }_{t{x}_{ij}^{(t)}}+{left[{tilde{n}}_{{x}_{ij}^{(t)}.k}^{(t)}right]}^{-(i,j)}}{{sum }_{w}{beta }_{wt}+{left[{tilde{n}}_{w.k}^{(t)}right]}^{-(i,j)}}right)$$

(11)

To infer p(hljaEUR?=aEUR?ka^GBPx, y, r), we will need to separately consider whether the lab test l is observed or missing. In particular, we can infer the topic distribution ({lambda }_{ljkv}propto {{mathbb{E}}}_{q}[{mathrm{log}},p({h}_{lj}=k| {bf{x}},{bf{y}},{bf{r}})]) of an observed lab test ylj at value v as:

$${lambda }_{ljkv}propto left({alpha }_{k}+{tilde{n}}_{.jk}+{tilde{m}}_{.jk}^{-(l,j)}right)left(frac{{zeta }_{lv}+{tilde{m}}_{lkv}^{-(l,j,v)}}{{sum }_{v^{prime} }{zeta }_{lv^{prime} }+{tilde{m}}_{lkv^{prime} }^{-(l,j,v)}}right)left(frac{{a}_{l}+{tilde{p}}_{lk}^{-(l,j)}}{{a}_{l}+{tilde{p}}_{lk}^{-(l,j)}+{b}_{l}+{tilde{q}}_{lk}}right)$$

(12)

For unobserved lab tests, we infer the joint distribution of latent topic and lab result ({pi }_{ljkv}propto {{mathbb{E}}}_{q}[{mathrm{log}},p({h}_{lj}=k,{y}_{lj}=v| {bf{x}},{bf{y}},{bf{r}})]):

$${pi }_{ljkv}propto left({alpha }_{k}+{tilde{n}}_{.jk}+{tilde{m}}_{.jk}^{-(l,j)}right)left(frac{{zeta }_{lv}+{tilde{m}}_{lkv}^{-(l,j,v)}}{{sum }_{v^{prime} }{zeta }_{lv^{prime} }+{tilde{m}}_{lkv^{prime} }^{-(l,j,v)}}right)left(frac{{b}_{l}+{tilde{q}}_{lk}^{-(l,j)}}{{a}_{l}+{tilde{p}}_{lk}+{b}_{l}+{tilde{q}}_{lk}^{-(l,j)}}right)$$

(13)

where the notation na^'(i, j) indicate the exclusion of variable index i, j. The sufficient statistics in the above inference have closed-form expression conditioned on the other latent variables:

$${tilde{n}}_{.jk}^{-(i,j)}=mathop{sum }limits_{t = 1}^{T}mathop{sum }limits_{i^{prime} ne i}^{{M}_{j}^{(t)}}{gamma }_{i^{prime} jk},quad {[{tilde{n}}_{{x}_{ij}^{(t)}.k}^{(t)}]}^{-(i,j)}=mathop{sum }limits_{j^{prime} ne j}^{D}mathop{sum }limits_{i = 1}^{{M}_{j^{prime} }^{(t)}}[{x}_{ij^{prime} }^{(t)}={x}_{ij}^{(t)}]{gamma }_{ij^{prime} k}^{(t)}$$

(14)

$${tilde{m}}_{.jk.}^{-(l,j)}=mathop{sum }limits_{l^{prime} ne l}^{L}mathop{sum }limits_{v = 1}^{{V}_{l^{prime} }}[{r}_{l^{prime} j}=1]{y}_{l^{prime} jv}{lambda }_{l^{prime} jkv}+[{r}_{l^{prime} j}=0]{pi }_{l^{prime} jkv}$$

(15)

$${tilde{m}}_{l.kv}^{-(l,j,v)}=mathop{sum }limits_{j = 1}^{D}mathop{sum }limits_{l = 1}^{L}mathop{sum }limits_{v^{prime} ne v}^{{V}_{l}}[{r}_{lj}=1]{y}_{ljv^{prime} }{lambda }_{ljkv^{prime} }+[{r}_{lj}=0]{pi }_{ljkv^{prime} }$$

(16)

$${tilde{p}}_{lk}^{-(l,j)}=sum _{j^{prime} ne j}[{r}_{lj^{prime} }=1]sum _{v}{y}_{lj^{prime} v}{lambda }_{lj^{prime} k},quad {tilde{q}}_{lk}^{-(l,j)}=sum _{j^{prime} ne j}[{r}_{lj^{prime} }=0] sum _{v}{pi }_{ljkv}$$

(17)

Furthermore, we update the hyperparameters by maximizing the marginal likelihood under the variational expectations via empirical Bayes fixed-point updates14,43). For example, the update for I2wt is

$${beta }_{wt}^{* }leftarrow frac{{a}_{beta }-1+{beta }_{wt}left({sum }_{k}Psi ({beta }_{wt}+{n}_{w.k}^{(t)})-K{W}_{t}Psi ({beta }_{wt})right)}{{b}_{beta }+{sum }_{k}Psi ({W}_{t}{beta }_{wt}+{sum }_{w}{n}_{w.k}^{(t)})-KPsi ({W}_{t}{beta }_{wt})}$$

(18)

Other hyperparameters updates are similar. The learning algorithm therefore follows expectationaEUR”maximization: E-step infers ({gamma }_{ijk}^{(t)},{lambda }_{ljkv},{pi }_{ljkv})aEUR(TM)s; M-step updates the above sufficient statistics of model parameters and hyperparameters. Details and time complexity analysis are described in Supplementary Methods.

To learn MixEHR from massive-scale EHR data, we propose a stochastic collapsed variational Bayesian (SCVB) algorithm44,45. The main objective of our SCVB algorithm is to avoid keeping track of all of the latent topic assignments ({gamma }_{ijk}^{(t)}) and I>>ljkvaEUR(TM)s while maintaining accurate model inference. Specifically, we first calculate the intermediate variational updates from randomly sampled mini-batches of (D^{prime}) patients:

$${hat{n}}_{w.k}^{(t)}=mathop{sum }limits_{j^{prime} = 1}^{D^{prime} }mathop{sum }limits_{i = 1}^{{M}_{j^{prime} }^{(t)}}[{x}_{ij^{prime} }^{(t)}=w]{gamma }_{ij^{prime} k}^{(t)}$$

(19)

$${hat{m}}_{l.kv}=mathop{sum }limits_{j^{prime} = 1}^{D^{prime} }mathop{sum }limits_{l = 1}^{L}[{r}_{lj^{prime} }=1]{y}_{lj^{prime} v}{lambda }_{lj^{prime} kv}+[{r}_{lj^{prime} }=0]{pi }_{lj^{prime} kv}$$

(20)

We then update the latent disease topics using natural gradients, which are the intermediate updates weighted by a fixed learning rate:

$${hat{n}}_{w.k}^{(t)}=(1-rho ){tilde{n}}_{w.k}^{(t)}+rho {hat{n}}_{w.k}^{(t)}$$

(21)

$${hat{m}}_{l.kv}=(1-rho ){tilde{m}}_{l.kv}^{(t)}+rho {hat{m}}_{l.kv}^{(t)}$$

(22)

Details are described in Supplementary Methods. We observed that the SJCVB0 algorithm achieves comparable performance as the full-batch learning with only a fraction of time and constant memory (Supplementary Fig. 2).

To evaluate model learning and monitor empirical convergence, we performed fivefold cross-validation. For each patient in the validation fold, we randomly selected 50% of their EHR features to infer their disease mixtures and then used the other 50% of the features to evaluate the log predictive likelihoodaEUR”a common metric to evaluate topic models4,14,44:

$$sum _{j}sum _{t,i}{mathrm{log}}, sum _{k}{hat{theta }}_{jk}{hat{phi }}_{{x}_{ij}^{(t)}k}^{(t)}+sum _{l}[{r}_{lj}=1]{mathrm{log}},sum _{k}{hat{theta }}_{jk}({hat{psi }}_{lk}+{hat{eta }}_{lk{y}_{lj}})$$

(23)

where we inferred the variational expectations of the disease mixture and disease topics as:

$${hat{theta }}_{jk} =frac{{alpha }_{k}+{tilde{n}}_{jk}^{(.)}}{{sum }_{k^{prime} }{alpha }_{k^{prime} }+{tilde{n}}_{jk^{prime} }^{(.)}},{hat{phi }}_{wk}^{(t)}=frac{{beta }_{wt}+{tilde{n}}_{wk}^{(t)}}{{sum }_{w^{prime} }{beta }_{w^{prime} t}+{tilde{n}}_{w^{prime} k}^{(t)}}, \ {hat{psi }}_{lk} =frac{{a}_{l}+{tilde{p}}_{lk}}{{a}_{l}+{tilde{p}}_{lk}+{b}_{l}+{tilde{q}}_{lk}}, {hat{eta }}_{lkv}=frac{{zeta }_{lv}+{tilde{m}}_{lkv}}{mathop{sum }nolimits_{v^{prime} = 1}^{{V}_{l}}{zeta }_{lv^{prime} }+{tilde{m}}_{lkv^{prime} }}$$

Sensible models should demonstrate improved predictive likelihood on held-out patients. We evaluated the predictive log likelihood of models with K a^^ {10, 25, 40, 50, 60, 75, 100, 125, 150} using the MIMIC-III data and ultimately set KaEUR?=aEUR?75 as it gave the highest predictive likelihood (Supplementary Fig. 1). Nonetheless, models with K within the range between 50 and 100 have similar performance.

To impute missing data in an individual-specific way, we here describe a k-nearest neighbor approach. As illustrated in Supplementary Fig. 16a. the prediction can be divided into three steps:

  • Train MixEHR on training set to learn the EHR code by disease topic matrices W across data types and infer the disease topic mixtures I,train for each training patient data point;

  • To infer the probability of an unknown EHR code t for a test patient (j^{prime}), use MixEHR and the learnt disease topic matrices W to infer the disease topic mixture ({theta }_{j^{prime} }) for the test patient;

  • Compare the test patient disease topic mixture ({theta }_{j^{prime} }) with the training patient disease mixtures I,train to find the k most similar training patients ({{mathcal{S}}}_{j^{prime} }). Here, the patientaEUR”patient similarity matrix is calculated based on the Euclidean distance between their disease topic mixtures:

    $${d}_{ij}=sqrt{mathop{sum }nolimits_{k = 1}^{K}{left({theta }_{j^{prime} k}-{theta }_{jk}right)}^{2}}$$

    (24)

    where ({theta }_{j^{prime} k}) and I,jk are the mixture memberships of the test patient (j^{prime}) and training patient j, respectively. Finally, we take the average of the EHR code t over these k-nearest neighbor patients as the prediction for the target code t for test patient (j^{prime}):

    $${x}_{tj^{prime} }=frac{1}{k}sum _{jin {{mathcal{S}}}_{j^{prime} }}{x}_{tj}$$

    (25)

We empirically determined the number of nearest neighbors k to be 100.

To evaluate the prediction performance, we chose a set of EHR codes from each data category based on the following criteria. For clinical notes, we chose the key words that are recorded in more than 10 patients but fewer than 500 patients to focus on disease-specific words with sufficient number of patients. For other data categories, we chose target codes that are observed in at least 100 patients but no more than 50% of the training cohort (i.e., ~19,000 patients). We then further selected the top 100 codes per data type in terms of the number of patients except for the ICD-9 diagnostic code, for which we evaluated prediction on each of the 499 codes in order to assess the performance differences by ICD-9 disease groups (Supplementary Fig. 18). In total, we obtained 976 target EHR codes (Supplementary Table 4).

For unbiased evaluation, we performed a fivefold cross-validation. For training fold n, we trained MixEHR to learn the disease topic matrices Wtraining fold n and patient disease topic mixtures I,training fold n. For the corresponding validation fold, we evaluated the prediction of each target code t for each test patient (j^{prime}) as follows. We first removed the target code t from the test patient (j^{prime}) if it is observed for the test patient (j^{prime}). We then inferred the disease mixture for the test patient (j^{prime}) and inferred the probability of target code t for test patient (j^{prime}) using the k-nearest neighbor approach. We repeated this procedure for the five validation folds to obtain the predictions of each target EHR code.

For comparison, we evaluated the performance of the baseline topic model which ignores distinct data types in the EHR data. Notably, such model is essentially equivalent to LDA. To this end, we modified the training and validation datasets to have the same data type for all of the 53432 EHR codes and gave each code a unique ID (i.e., 1 to 53,432). We then extracted the 976 target codes for fivefold cross-validation predictions. We ran MixEHR on such data-type-flattened data. We evaluated the predictions based on area under of the ROC curve (AUROC) and precisionaEUR”recall curve (AUPRC) as well as the overall accuracy by thresholding the prediction probability by 1/k, where kaEUR?=aEUR?100 is the empirically determined number of nearest neighbors.

For experiments comparing our method with GRAM32 and Doctor AI27, we used the open source implementation made available by the authors of the papers found at https://github.com/mp2893/gram and https://github.com/mp2893/doctorai, respectively. GRAM and Doctor AI were both set to predict the medical codes (as defined in their papers) in the next admission for the MIMIC-III. This task is also known as sequential diagnosis prediction. The same Clinical Classification Software (CCS) medical codes used in the original implementation of both methods were used in the comparisons as well. Note that Doctor AI can also be used to predict the time duration of next visit, but we did not compare our method with it in this aspect. For both Doctor AI and GRAM, we used the default recommended settings provided in their code bases. Both these models were implemented in Theano. To compare with Doctor AI and GRAM, we used Accuracy@20 as described in the GRAM paper for different percentile of frequencies of the medical codes.

In our comparison with Deep Patient, since there was no published code available, we implemented the stacked denoising autoencoder to the best of our abilities. We closely followed the steps mentioned in the paper by the authors. We did not follow the elaborate data preprocessing pipeline used in the paper as our dataset (MIMIC-III) is different from the one used by29. Our implementation had three layers of autoencoders with 500 hidden units (as in the original paper) and was coded using Keras library. We used the Deep Patient representation as input to a classifier (i.e., logistic regression) to predict mortality and compared the accuracy with MixEHR in terms of the area under the precisionaEUR”recall (AUPRC) and area under the ROC (AUROC) (Supplementary Fig. 23).

Conditioned Factored Restricted Boltzmann Machine (CF-RBM) was implemented by adapting the code from https://github.com/felipecruz/CFRBM. CF-RBM was originally designed for Netflix movie recommendation. Each movie has a five-star rating. We modified the code to make it work with two-state lab results (i.e., normal and abnormal). We used 100 hidden units of CF-RBM to train on the randomly sampled 80% of the admissions and tested the CF-RBM on the remaining 20% of admissions. We ensured that the training and test sets were the same for both CF-RBM and MixEHR. We evaluated the trained CF-RBM as follows. We iterated over the admissions in the test set, and for every lab test in that admission, we masked its test result and made a prediction based on the remaining lab tests of the same admission. This procedure was repeated for every lab test in every admission. We then recorded the predicted lab results and the true lab results for evaluation purpose. The same evaluation procedure was also used for MixEHR.

SiCNMF was implemented using the Python code from authoraEUR(TM)s Github repository https://github.com/sgunasekar/SiCNMF19. The model was trained on the MIMIC-III dataset with six data types as the six collective matrices until convergence.

Unless mentioned otherwise, all of the other unsupervised learning and classification methods such as PCA, random forest, elastic net, logistic regression, and LDA were implemented using the Scikit-Learn Python library.

The methods were performed in accordance with the PhysioNet data user agreement. All of the MIMIC-III data received simple and minimal preprocessing of the raw comma separated value (CSV) files provided from MIMIC-III critical care database (mimic.physionet.org). For all data types, we removed nonspecific EHR code that were commonly observed among admissions based on their inflection points (Supplementary Fig. 26). No further processing was performed unless mentioned otherwise. For clinical notes (NOTEEVENTS.csv), we used the R library tm46 by following a simple pipeline. We filtered out common stop words, punctuations, numbers, whitespace and converting all of the remaining words to lowercase.

For lab test data, we used the FLAG column that indicates normal or abnormal level for the lab results. We counted for each admission the number of times the lab results were normal and abnormal. A patient at the same admission can exhibit both normal and abnormal states zero or more times for the same lab test. In this case, we recorded the frequency at each state for the same admission.

For prescription, we concatenated the DRUG, DRUG_NAME_GENERIC, GSN, NDC column to form a compound ID for each prescription. For the Diagnosis-related group (DRG) code (DRGCODES.csv), we concatenated several IDs including DRG_TYPE, DRG_CODE, DESCRIPTION, DRG_SEVERITY, and DRG_MORTALITY to form a compound ID for each DRG code. The identifiers for ICD-9 code (ICD-9-CM) (DIAGNOSES_ICD.csv) and treatment procedure code (PROCEDURES_ICD.csv) (ICD-9-CPT) were kept as their original forms. The resulting data are summarized in Supplementary Table 1.

The 187 patients (94 bipolar disorder cases and 93 controls) were selected from the Mayo Clinic Bipolar Disorder Biobank and the Mayo Clinic Biobank, respectively. They had consented for research, including research using their EHR data47,48. This study was reviewed and approved by Mayo Clinic Institutional Review Board and by the access committees from both biobanks. The EHR information for each of the patients cover five categories, including ICD-9 codes (2224 ICD-9-CM codes), procedure codes (2200 CPT codes), patient provided information (955 text questions), lab tests (1742 lab test codes), medication prescriptions (610 prescription order codes). In total, there are 7731 codes and 108,390 observations (Supplementary Table 5). For each case, one subject from the Mayo Clinic Biobank was matched on age and sex, after excluding subjects with psychiatric conditions in self-reports and/or EHR. Supplementary Table 6 summarizes age-sex distribution by cases and controls.

The dataset was derived from the EHR documented Quebec Congenital Heart Disease (CHD) database with 84,498 patients and 28 years of follow-up from 1983 to 201049,50. The study uses administrative databases as data sources. No participant is enrolled. The study was approved by Research Ethics Board of McGill University Health Centre. For each patient, the patient history can be traced back to the later of birth or 1983, and followed up to the end of the study or death of the patient, whichever came earlier. The dataset includes two data types: ICD-9 or ICD-10 billing code and intervention code. The raw dataset consists of individual records of each patientaEUR(TM)s medical visit. Each visit lasted less than a day. Thus, they were not considered as admissions, and are categorized as outpatient data. The informed consent was obtained from all participants.

By using the time stamp of those visits, we aggregated the diagnoses and intervention codes by the months into visit blocks. Thus, each visit block represents the record of a patient in that time interval as a set of diagnoses and a set of interventions. Since the patientsaEUR(TM) visit history was recorded as sequential events over the 28 years, we represented each patientaEUR(TM)s complete medical history as a time series of diagnoses and intervention sets. The length of the sequence depends on the total number of months. We only kept the months in which a patient has visited a medical institution at least once. The diagnoses given were represented as a four or five-digit ICD-9 or ICD-10 code. All of the codes were converted to ICD-9 using a board-approved conversion table. In order to generate the labels, we took the first three digits of each code. The processed data are summarized in Supplementary Table 3.

Among the five data categories, lab tests and PPI have two separate data features unlike the other three binary data categories. Specifically, they consist of (1) presence vs. absence of the data in the EHR (e.g., whether a lab test was done or whether a patient answered a particular PPI question) and (2) the actual value for the variable (i.e., test results for a lab test, or patientaEUR(TM)s answer for a PPI question). Given the small sample size, we chose to model the first data feature (whether the data exists in the EHR) to mimic the data structure used in the rest of the data categories, recognizing that we may miss important information by not analyzing the actual lab test results and/or patients answers for particular PPI questions. We speculate that the missing pattern may be related to the disease classification and disease subgroups, especially for the lab codes. However, we note that the missing pattern for the PPI data might be strongly confounded by other nonclinical characteristics such as age, gender and social determinants of health.

We used a 20-topic MixEHR to model the distinct distribution of these six data types. Our preliminary analysis showed that more than 20 topics produced degenerate models. First, we sought to quantitatively examine whether the patient topic mixtures provide useful information for accurate classification of bipolar disorder. To this end, we divided the 187 patients into five folds (Fig. 4a). We trained MixEHR on the four folds and then a logistic regression (LR) classifier that uses the patient topic mixture in the four folds to predict the bipolar disorder label. In this evaluation, we removed ICD-9 code category 296 from all patients as it codes for bipolar disorder. We then applied the trained MixEHR and LR to predict the BD labels of patients in the validation fold based on their MixEHR 20-topic mixture. As a comparison, we also applied two other unsupervised learning approaches, namely Latent Dirichlet Allocation (LDA) with 20 topics and Restricted Boltzmann Machine (RBM) with 20 hidden units.

Because our current model is not longitudinal, we developed a pipeline that combines MixEHR topics with recurrent neural network (RNN) with Gated Recurrent Unit (GRU) (Fig. 6a). We first trained MixEHR on the EHR data for ~39,000 patients with single admission in MIMIC-III. We then used the trained MixEHR to infer topic mixture at each admission for the 7541 patients with multiple admissions. Then we used as input the inferred topic mixture at the current admission (say at time t) to the RNN to auto-regressively predict the diagnostic codes at the next admission at time taEUR?+aEUR?1. Here, MixEHR uses all of the data types from MIMIC-III. Specifically, we used a 128-dense layer to embed the input topic mixture before passing it to the two 128-GRU layers followed by a dense layer connected to the CCS medical code output classes. We set the L2 weight penalty to 0.001 and dropout rate 0.5 based on preliminary results and existing literature.

As a comparison, we ran one of the recently developed state-of-the-art deep-learning frameworks called Graph-based Attention Model (GRAM)32. GRAM requires a knowledge graph such as the CCS multi-level diagnosis hierarchy or ICD-9 taxonomy and at least two admissions for each patient. Therefore, we only trained GRAM on the 7541 patientsaEUR(TM) admissions using only their ICD-9 codes. For MixEHR+RNN and GRAM, we randomly split the multi-admission patients into 80% for training and 20% for testing. Same as the evaluation performed by Choi et al., we prepared the true labels by grouping the ICD-9-CM codes into 283 groups using the Clinical Classification Software (CCS)32.

In terms of evaluating prediction accuracy of specific medical codes, we found that all of the models including GRAM and Doctor AI do not perform well on the low-frequent medical codes (with median accuracies equal to 0 and mean accuracies below 25% accuracy range), and the accuracy estimates are unstable for the rare medical codes, depending on the training and testing split. Therefore, we focused on evaluating the predictions of the 42 medical codes each observed in at least 1000 training admissions. We then calculated the accuracy at the top 20 predictions for each CCS diagnosis code. For each medical code, we counted the number of true positives among the top 20 predicted code for each admission and then divided it by the number of total positives.

In this database, each patient had on average 28 years medical follow-up history (1983aEUR”2010). The input data have two data types namely, ICD-9 or ICD-10 code and procedure intervention code. We designed an RNN that takes as input both the raw EHR data and the multimodal topic mixture (over the two ICD and procedure data types) inferred by MixEHR at each visit block t and predicts as outputs the first three digits of the diagnostic ICD code at the next visit block taEUR?+aEUR?1 (Fig. 6c). Similar to the design from Doctor AI27, we defined a visit block as a 1-month interval that contains at least one visit from the patient. For instance, suppose a patient visits multiple times in January and does not visit since then until May. We pool the data over those visits in January. We ignore the months where there is no visit by that patient (i.e., February to April). We predict diagnostic code for the next visit block taEUR?+aEUR?1 in the subsequent month that contains at least one visit by that patient (i.e., May in the above hypothetical example).

Here, we sought to assess whether there is additional information provided by MixEHR that is not captured by the fully connected dense layer of the RNN in terms of the code prediction improvement. For the ease of reference, we call this approach as MixEHR+RNN to distinguish it from the baseline RNN that takes only the raw EHR as input (i.e., Doctor AI). For the baseline RNN, we use 100-dimensional hidden embedding (i.e., fully connected dense layer) to encode the observed EHR code in the input. For the MixEHR+RNN model, we used 50-dimensional hidden embedding to encode the observed EHR code augmented with 50-topic MixEHR for the topic mixture embedding (Supplementary Fig. 21). Therefore, both models have 100-dimensional latent embeddings to encode the input features. The rest of the architecture stays the same for both RNN models. Details are described in Supplementary Note.

For evaluation, we randomly split the 80,000 patients into 80% of the patients for training and 20% of the patients for testing. For MixEHR+RNN, we first trained MixEHR on all of the visit blocks of the training patients. Then, we trained an RNN that takes the raw input concatenated with the MixEHR-inferred 50-topic mixture at each visit block of the training patients to predict the diagnostic code at the next visit block. For testing, we applied the trained MixEHR+RNN and the baseline RNN to predict the diagnostic code in the next visit block of the test patients based on all of the previous visit block(s). For MixEHR+RNN and the baseline RNN models, we recorded the prediction accuracy across all visit blocks (excluding the first visit block) for each patient in terms of AUROC and AUPRC for each diagnostic code.

In total, there are 53,432 unique features/variables over six data types in the MIMIC-III. Among these, 564 variables are lab tests. We used all of the variables (notes, ICD, lab, prescriptions, DRG, CPT) across all of the six data types to learn the topic distribution from the training set and performed imputation of lab results on the testing set (Fig. 7a). For the observed lab tests of each patient, we used the FLAG column that indicates normal or abnormal level for the lab results. We counted for each admission the number of times the lab results were normal and abnormal. A patient at the same admission can exhibit both normal and abnormal states zero or more times for the same lab test. In this case, we recorded the frequency at each state for the same admission.

The goal of our analysis is to impute missing lab test results based on all of the other information of the patientaEUR(TM)s admission. Therefore, we did not distinguish whether the admissions came from the same patients and focused on imputing lab results within the same admission. Most patients only have one admission in the MIMIC-III dataset: among the 46,522 patients only 7541 patients (16%) have more than one admission. Therefore, the conclusion will likely remain the same if we were to aggregate information from multiple admissions for the same patient.

We randomly split the total 58,903 admissions into 80% for training and 20% for testing. For the training set, we trained our MixEHR to learn the topic distribution and the 50-topic mixture memberships for each training admission (Fig. 7a, step 1). We then iterated over the admissions in the test set. For every lab test observed in the test admission, we first masked its test result and imputed the test result as follows. We first inferred the 50-topic mixture memberships for each test admission using the topic distribution learned from the training set (Fig. 7a, step 2). We then found the 25 most similar admissions to the test admission from the training set based on the Euclidean distance of their topic mixture memberships (Fig. 7a, step 3). These 25 admissions must have the target lab test observed. We then took the average of the frequency over the 25 most similar training admissions as our imputed lab result for the target lab test in the test admission.

This procedure was repeated for every lab test in every test admission. We recorded the predicted lab results and the true lab results for evaluation purposes. The same way was used to evaluate the CF-RBM method. As a comparison, we trained conditional factored Restricted Boltzmann Machine (CF-RBM)5 with 100 hidden units only on the lab tests. For MixEHR and CF-RBM, we used 80% of the admissions for training and 20% of the admissions for testing. Here we did not distinguish whether the admissions came from the same patients but rather focused on imputing lab results within the same admission.

To further compare our proposed approach, we took a simpler imputation strategy by averaging. For each lab test, we took the average of the observed lab results over the training admissions. We then used the average frequency as the predictions for the lab results on the 20% testing admissions.

We first trained MixEHR on the ~39aEUR?K patients with single-admissions to learn the disease topics. We used the trained MixEHR to infer topic mixture of each admission for the patients who had more than one admission. We took the last two admissions that are within 6 months apart, which gave us around 4000 multi-admission patients. We used the topic mixture inferred for the second-last admission as an input to a classifier to predict the mortality outcome at the discharge of the last admission (i.e., as indicated by the EXPIRED flag in the Discharge location of the last admission). To realistically evaluate our model, we filtered out Discharge Summary from the clinical notes in all of the data (under CATEGORY of the notes) because they usually contain conclusion about patientsaEUR(TM) critical conditions. We then performed a 5-fold cross-validation to evaluate the prediction accuracy. We experimented with KaEUR?=aEUR?50 and 75 topics. As baseline models, we tested LDA on a flattened EHR matrix over all six data types, Principal Component Analysis (PCA) with 50 Principal Components (PCs), Sparsity-inducing Collected Non-negative Matrix Factorization (SiCNMF) on the same input data19. We ran SiCNMF with the default settings with lower rank set to 20. Lastly, we trained elastic net directly on the raw input data from the ~4000 patientsaEUR(TM) second-last admission to predict the morality in their last admissions upon discharge.

Moreover, we designed another experiment that used the earliest/first admission to predict the mortality in the last admission. We removed patients if their earliest admission lasted longer than 48aEUR?h based on the difference between the admission time and discharge time within the same admission. This also gave us 4040 patients whose first in-hospital stay was shorter than 2 days. Same as the above, we performed a fivefold CV on these patients as above using the pre-trained unsupervised models on the single-admission patientsaEUR(TM) data to generate topic mixtures on the first admissions as input to a supervised classifier (i.e., elastic net), which then predicts the mortality in the last admission upon discharge.

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The MIMIC-III data analyzed in this paper are publicly available through PhysioNet (http://mimic.physionet.org). Mayo Clinic and Quebec CHD data are not publicly accessible due to restricted user agreement.

Code availability

We implemented MixEHR in C++ as a standalone Unix command-line software using OpenMP for multi-threaded inference. It allows an arbitrary number of discrete data types and discrete states per EHR feature. The software is available as Supplementary Software 1 or at https://github.com/li-lab-mcgill/mixehr.

Notes

  1. 1.

    Binomial is used such that each lab tests can be performed multiple times on the same patient, which is not uncommon in the real EHR data.

References

  1. 1.

    Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035aEUR”1600359 (2016).

  2. 2.

    Charles, D., Gabriel, M. & Furukawa, M. F. Adoption of electronic health record systems among US non-federal acute care hospitals: 2008-2012. ONC Data Brief. 9, 1aEUR”9 (2013).

  3. 3.

    Henry, J., Pylypchuk, Y., Searcy, T. & Patel, V. Adoption of electronic health record systems among US non-federal acute care hospitals: 2008-2015. ONC data brief. 35,A 1aEUR”9 (2016).

  4. 4.

    Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993aEUR”1022 (2003).

  5. 5.

    Salakhutdinov, R., Mnih, A. & Hinton, G. Restricted Boltzmann machines for collaborative filtering. Proceedings of the 24th International Conference on Machine Learning.A In ACM Press (ed. Ghahramani, Z.) 791aEUR”798 (New York, 2007).

  6. 6.

    Mnih, A. & Salakhutdinov, R. R. Probabilistic matrix factorization. A Advances in Neural Information Processing Systems, In MIT Press (eds Platt, J. C., Koller, D., Singer, Y. & Roweis, S. T.) 1257aEUR”1264 (2008).

  7. 7.

    Hernandez-lobato, J. M., Houlsby, N. & Ghahramani, Z. Probabilistic matrix factorization with non-random missing data. Proceedings of the 31thA International Conference on Machine Learning. In JMLR (eds Xing, E. P. & Jebara, T. S.) 32, 1512aEUR”1520 (2014).

  8. 8.

    Marlin, B. M. & Zemel, R. S. Collaborative prediction and ranking with non-random missing data. The Proceedings of the third ACM conference on Recommender systems,A In ACM Press (eds Burke, R., Felfernig, A. & Schmidt-Thieme, L.)A 5aEUR”12 (New York, 2009).

  9. 9.

    Fraser, G. & Yan, R. Collaborative filtering and the missing at random assumption. Epidemiology 18, 1aEUR”9 (2016).

  10. 10.

    Mcauliffe, J. D. & Blei, D. M. Supervised topic models. In Advances in Neural Information Processing Systems (eds Platt, J. C., Koller, D., Singer, Y. & Roweis, S. T.) Vol. 20, 121aEUR”128 (Curran Associates, Inc., 2008).

  11. 11.

    Blei, D. M. Probabilistic topic models. Commun. ACM 55, 77aEUR”84 (2012).

  12. 12.

    Griffiths, T. L. & Steyvers, M. Finding scientific topics. Proc. Natl Acad. Sci. USA 101 (Supplement 1), 5228aEUR”5235 (2004).

  13. 13.

    Teh, Y. W., Newman, D. & Welling, M. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. in Advances in Neural Information Processing Systems (eds SchA?lkopf, B., Platt, J. C. & Hoffman, T.) Vol. 19, 1353aEUR”1360 (MIT Press, 2007).

  14. 14.

    Asuncion, A., Welling, M., Smyth, P. & Teh, Y. W. On smoothing and inference for topic models. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence UAI aEUR(TM)09,A In AUAI PressA (eds Bilmes, J. & Ng, A.) 27aEUR”34 (Arlington, VA, 2009).

  15. 15.

    Little, R. J. A. & Rubin, D. B. Statistical Analysis with Missing Data (John Wiley & Sons, 2014).

  16. 16.

    Halpern, Y., Horng, S. & Sontag, D.A (eds Doshi-Velez,A F.,A Fackler, J., Kale, D.,A Wallace,A B.A & Wiens, J.)A inA Proceedings of the 1st Machine Learning for Healthcare Conference. PMLRA 56, 209aEUR”225. (2016).

  17. 17.

    Joshi, S., Gunasekar, S., Sontag, D. & Ghosh, J. Identifiable phenotyping using constrained non-negative matrix factorization.A Proceedings of the 1st Machine Learning for Healthcare Conference. In PMLR (eds Doshi-Velez,A F.,A Fackler, J., Kale, D., Wallace,A B.A & Wiens, J.)A 56,A 17aEUR”41 (2016).

  18. 18.

    Pivovarov, R. et al. Learning probabilistic phenotypes from heterogeneous EHR data. J. Biomed. Inform. 58, 156aEUR”165 (2015).

  19. 19.

    Gunasekar, S. et al. Phenotyping using structured collective matrix factorization of multiaEUR”source EHR data. Preprint at https://arxiv.org/abs/1609.04466 (2016).

  20. 20.

    Flaherty, P., Giaever, G., Kumm, J., Jordan, M. I. & Arkin, A. P. A latent variable model for chemogenomic profiling. Bioinformatics 21, 3286aEUR”3293 (2005).

  21. 21.

    Zhao, J. et al. Detecting time-evolving phenotypic topics via tensor factorization on electronic health records cardiovascular disease case study. J. Biomed. Inform. 98, 103270 (2019).

  22. 22.

    Wang, Y. et al. Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records. J. Biomed. Inform. 102, 103364 (2020).

  23. 23.

    Wang, L., Tong, L., Davis, D., Arnold, T. & Esposito, T. The application of unsupervised deep learning in predictive models using electronic health records. BMC Med. Res. Methodol. 20, 1aEUR”9 (2020).

  24. 24.

    Razavian, N. & Sontag, D. Temporal convolutional neural networks for diagnosis from lab tests. Preprint at https://arxiv.org/abs/1511.07938 (2015).

  25. 25.

    Cheng, Y., Wang, F., Zhang, P. & Hu, J. Risk prediction with electronic health records: a deep learning approach. InA 2016 SIAM International Conference.A (eds Venkatasubramanian, S. & Meira, W.) 432aEUR”440A (SIAM, 2016).

  26. 26.

    Lipton, Z. C., Kale, D. C., Elkan, C. & Wetzel, R. R. Learning to diagnose with LSTM recurrent neural networks. Preprint at https://arxiv.org/abs/1511.03677 (2015).

  27. 27.

    Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc. 56, 301aEUR”318 (2016).

  28. 28.

    Nguyen, P., Tran, T., Wickramasinghe, N. & Venkatesh, S. Deepr: A Convolutional Net for Medical Records. in IEEE Journal of Biomedical and Health Informatics.A 21, 22aEUR”30 (2017).

  29. 29.

    Miotto, R., Li, L., Kidd, B. A. & Dudley, J. T. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 1aEUR”10 (2016).

  30. 30.

    Suresh, H., Szolovits, P. & Ghassemi, M. The use of autoencoders for discovering patient phenotypes. Preprint at https://arxiv.org/abs/1703.07004 (2017).

  31. 31.

    Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. npj Digital Med. 1, 18 (2018).

  32. 32.

    Choi, E., Bahadori, M. T., Song, L., Stewart, W. F. & Sun, J. GRAM: graph-based attention model for healthcare representation learning. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.A InA ACM Press (eds Matwin, S., Yu, S. & Farooq, F.) 787aEUR”795 (Sutter Health, Sacramento, New York, NY, 2017).

  33. 33.

    Osimani, A., Berger, A., Friedman, J., Porat-Katz, B. S. & Abarbanel, J. M. Neuropsychology of vitamin B12 deficiency in elderly dementia patients and control subjects. J. Geriatr. Psychiatry Neurol. 18, 33aEUR”38 (2005).

  34. 34.

    van Buuren, S. & Groothuis-Oudshoorn, K. Mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1aEUR”67 (2011).

  35. 35.

    Choi, E. et al. Multi-layer representation learning for medical concepts. The 22nd ACM SIGKDD International Conference. In ACM Press (eds Aggarwal,A C. & Smola, A.) 1495aEUR”1504 (New York, NY, 2016).

  36. 36.

    Ho, J. C., Ghosh, J. & Sun, J. Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD aEUR(TM)14. In ACM Press (eds Macskassy, S. & Perlich, C.) 115aEUR”124 (New York, NY, 2014).

  37. 37.

    Wang, Y., Chen, R. Ghosh, J., Denny, J. C. & Kho, A. Rubik: knowledge guided tensor factorization and completion for health data analytics. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. In ACM PressA (eds Cao, L.A & Zhang, C.)A 265aEUR”1274 (New York, NY, 2015).

  38. 38.

    Schulam, P. & Saria, S. A framework for individualizing predictions of disease trajectories by exploiting multi-resolution structure. In Advances in Neural Information Processing Systems (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.) Vol. 28, 748aEUR”756 (Curran Associates, Inc., 2015).

  39. 39.

    Schulam, P. & Saria, S. Integrative analysis using coupled latent variable models for individualizing prognoses. J. Mach. Learn. Res. 17, 1aEUR”35 (2016).

  40. 40.

    Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. & Saul, L. K. An introduction to variational methods for graphical models. Learning in Graphical Models. In SpringerA (ed. Heckerman, D.)A 105aEUR”161 (Netherlands, Dordrecht, 1998).

  41. 41.

    Bishop, C. M. Pattern recognition and machine learning.A Information Science and Statistics. SpringerA (eds Jordan, M., Kleinberg & J., Scholkopf, B.)A 461aEUR”474A (2006).

  42. 42.

    Griffiths, T. L. & Steyvers, M. Finding scientific topics. Proc. Natl Acad. Sci. USA 101 (Suppl 1), 5228aEUR”5235 (2004).

  43. 43.

    Minka, T. Estimating a Dirichlet distribution. Technical Report (MIT, 2000).

  44. 44.

    Hoffman, M. D., Blei, D. M., Wang, C. & Paisley, J. W. Stochastic variational inference. J. Mach. Learn. Res. 14, 1303aEUR”1347 (2013).

  45. 45.

    Foulds, J., Boyles, L., Dubois, C., Smyth, P. & Welling, M. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. InA ACM PressA (eds Grossman, R. & Uthurusamy, R.) 446aEUR”454 (New York, NY, 2013).

  46. 46.

    Feinerer, I., Hornik, K. & Meyer, D. Text mining infrastructure in R. J. Stat. Softw. 25, 1aEUR”54 (2008).

  47. 47.

    Frye, M. A. et al. Development of a bipolar disorder biobank: differential phenotyping for subsequent biomarker analyses. Int. J. Bipolar Disord. 3, 30 (2015).

  48. 48.

    Olson, J. E. et al. The Mayo Clinic Biobank: a building block for individualized medicine. Mayo Clin. Proc. 88, 952aEUR”962 (2013).

  49. 49.

    Marelli, A. J., Mackie, A. S., Ionescu-Ittu, R., Rahme, E. & Pilote, L. Congenital heart disease in the general population: changing prevalence and age distribution. Circulation 115, 163aEUR”172 (2007).

  50. 50.

    Marelli, A. J. et al. Lifetime prevalence of congenital heart disease in the general population from 2000 to 2010. Circulation 130, 749aEUR”756 (2014).

Download references

Acknowledgements

Y.L. is supported by Natural Sciences and Engineering Research Council (NSERC) Discovery Grant (RGPIN-2019-0621), Fonds de recherche Nature et technologies (FRQNT) New Career (NC-268592), and Canada First Research Excellence Fund Healthy Brains for Healthy Life (HBHL) initiative New Investigator award (G249591). The funding for the Mayo Clinic bipolar disorder study is provided by Marriott Foundation. Mayo Clinic Center is funded by Individualized Medicine. The QuA(C)bec CHD Database is funded by Dr. MarelliaEUR(TM)s Canadian Institute of Health Research Foundation Grant Award #35223. We thank Mathieu Blanchette for the helpful comments on the paper.

Author information

Y.L. and M.K. conceived the initial idea. Y.L. developed and implemented the method, processed and analyzed the data, and wrote the initial draft of the paper. P.N. ran the experiments on the EHR code prediction and mortality prediction using MIMIC-III data. Y.W. and A.A.K.D. helped running part of the lab imputation experiments. T.O., J.M.B., J.E.O., and M.A.F. provided the Mayo Clinic dataset. E.R. performed the data extraction and de-identification of the EHR data for both the bipolar and the control cohorts. W.L., Y.M., and Z.W. helped running the Mayo Clinic part of the experiment. T.O., J.M.B., and E.R. helped analyze the results. A.M. provided the data from Quebec CHD Database. A.L. and L.G. processed the data. X.H.L. implemented the MixEHR+RNN model and ran the experiments on the Quebec CHD Database with the help from Z.W. A.L. and A.M. helped interpreting the results from the Quebec CHD Database. Y.A. and J.D. performed the initial data processing and helped interpreted the initial results. M.K. edited the initial draft of the paper. Y.L. and M.K. supervised the project. All of the authors contributed to the final writing of the paper.

Correspondence to
Yue Li or Manolis Kellis.

Ethics declarations

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Yanshan Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

PublisheraEUR(TM)s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the articleaEUR(TM)s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the articleaEUR(TM)s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.