The initial literature search identified 1530 studies across the 4 databases, and a further 14 studies were identified following iterative review of references (Fig. 1). 1336 studies remained following removal of duplicates. Of these, 35 studies met the inclusion criteria for downstream analysis (Tables 1, 2 and 3). Four of these studies did not report sensitivity and specificity, and were, thus, included in qualitative synthesis only15,16,17,18.
The results of the QUADAS-2 tool are provided in Fig. 2 and Supplemental Fig. S1. Eight studies were found to have a high risk of bias across any of the 7 domains2,16,21,22,26,28,30,35. Within domain 1, 11% of studies were found to have high risk of bias, 26% low risk of bias, and 63% unclear risk of bias. Within domain 2, just 1 study was found to have high risk of bias, 43% low risk and 54% unclear risk. Within domain 3, 71% studies were found to have a low risk of bias and 29% with unclear risk. In domain 4, 69% had low risk and 31% had unclear risk of bias.
Four broad categories of methodologies were identified in POC detection of oral potentially malignant and malignant disorders: (1) classification based on clinical photographs (n = 11)2,19,20,21,22,23,25,26,27,28,29; (2) in vivo imaging using intra-oral optical imaging techniques (n = 18)15,17,30,31,33,34,35,37,38,39,40,41,42,43,44,45,50; (3) thermal imaging (n = 1)16; (4) analysis of volatile organic compounds (VOCs) from breath samples (n = 5)18,46,47,48,49. Just 8 studies were published before 201515,34,37,38,44,48,49,50. The majority of studies provided data on classification of OSCC vs healthy (n = 13)16,18,19,23,31,33,38,42,43,46,47,48,49, 8 studies provided data on OSCC/OPMD vs healthy25,26,28,30,37,39,40,41, 6 on OSCC/OPMD vs benign lesions15,17,21,35,36,50, 3 on OSCC vs benign29,34,44, 2 on OSCC vs other (healthy, benign and OPMD)2,45, 1 on OSCC/OPMD vs benign/healthy20, 1 on OPMD vs healthy27, and 1 on OPMD vs benign22.
Given sample heterogeneity, as indicated by forest plots (Supplementary Fig. S2) of univariate meta-analyses and quantitative measures of heterogeneity (sensitivity: Tau2 = 0.37, I2 = 62%, p < 0.001; specificity: Tau2 = 0.70, I2 = 84%, p < 0.001), a bivariate random-effects model for logit-transformed pairs of sensitivities and false positive rates was used to provide an estimate of diagnostic test performance. Across all studies, the pooled estimates for sensitivity and false positive rates (FPR) were 0.892 [95% CI 0.866–0.913] and 0.140 [95% CI 0.108–0.180], respectively. The AUC was 0.935 (partial AUC restricted to observed FPRs of 0.877), indicating excellent classifier performance (Table 4; Fig. 3, top left panel).
Graphic Display of Study Heterogeneity (GOSH) plots were used to further explore causes of heterogeneity in the extracted data through the application of unsupervised clustering algorithms to identify influential outliers (Supplemental Fig. S3). 4 studies were found to substantially contribute to between-studies heterogeneity with respect to sensitivity27,28,33,40, and a further 6 studies were identified as potentially influential with respect to specificity20,24,25,33,38,43,46. Exclusion of these studies from a univariate random effects model of sensitivity (N = 27) and specificity (N = 24) resulted in a reduction in Higgins I2 to 0.0% [0.0; 42.5] (Tau2 = 0.27, Q(26) = 24.99, p = 0.52) for sensitivity and I2 60.8% [38.9; 74.8] (Tau2 = 0.39, Q(23) = 58.7, p < 0.0001). A sensitivity analysis was thus performed with influential outliers excluded (Table 4). Although these analyses provide an indication of influential outlying studies, they do not inform on the likelihood of small study effects as a contributor of identified heterogeneity.
Funnel plots, of both all studies and according to subgroup, were initially used to investigate for small study effects (Supplemental Fig. S4). These funnel plots themselves provide an indication of possible publication bias, with a number of studies demonstrating both a large effect size and standard error, and the use of contour-enhancement does appear to identify a scarcity of studies in zones of low significance. Egger’s linear regression test supported plot asymmetry within studies reporting on classical machine learning methods (Supplemental Table S2). These results should be interpreted with caution, however, and plot asymmetry alone is not pathognomonic of publication bias. To further investigate small study effects as a possible cause for this asymmetry, a bias-corrected estimate of the diagnostic odds ratio was determined using Duval and Tweedie’s Trim and Fill method, which aims to re-establish symmetry of the funnel plot by imputing ‘missing’ effects, to provide an adjusted diagnostic odds ratio that better reflects the true effect when all evidence is considered. This method did identify a reduction in effect size, particularly in studies reporting on classical machine learning methods in classification, in those examining the use of clinical photographs, and in those classifying OSCC vs Healthy. Inspection of the funnel plots for these categories (Supplemental Fig. S4) does appear to show an absence of studies within regions of low significance, supporting a conclusion that reporting bias may contribute to inflation of study effects in some subgroups.
A comparison of algorithm performance according to methodology (clinical photographs, thermal imaging or analysis of volatile compounds), AI type (modern and classical), and lesion type (OSCC vs Healthy, OSCC/OPMD vs Benign, OSCC/OPMD vs Healthy) identified no differences in performance, as indicated by overlap in confidence regions on sROC curves (Fig. 3), showing uniformly high performance irrespective of group. Moreover, bivariate meta-regression found no significant differences in classification performance irrespective of methodology, AI type or lesion type (Table 4). A comparison of lesion types undergoing classification was limited to OSCC vs Healthy, OSCC/OPMD vs Benign, OSCC/OPMD vs Healthy, given the limited number of studies reporting results on other comparisons. Classification performance across subgroups was similar following exclusion of those studies identified as potentially influential.
Just 1 study met the inclusion criteria reporting on the use of thermal imaging in oral cancer detection16. In this study, Chakraborty et al. exploited Digital Infrared Thermal Imaging (DITI) as a non-invasive screening modality for oral cancer. Their process of detection involves initial detection of left and right regions of interest (ROI) from infrared images using a FLIR T 650 SC long infrared camera. Rotation invariant feature extraction was then performed on ROI using a Gabor filter, the responses of which are then used as input into a non-linear support vector machine (SVM) following transformation using a radial basis function (RBF) kernel. Fivefold cross validation on a dataset of 81 malignant, 59 precancerous and 63 normal subjects identified an overall accuracy of 84.72% in distinguishing between normal vs malignant subjects.
18 studies used various methods of optical imaging for in-vivo detection of oral potentially malignant and malignant disorders15,30,31,33,34,35,36,37,38,39,40,41,42,43,44,45,50,51, 16 of which provided sufficient performance metrics for meta-analysis15. All studies were prospective in design. Estimates for sensitivity and false positive rate for this modality were 0.882 [95% CI 0.865–0.896] and 0.118 [0.112–0.197], respectively. AUC for the accompanying sROC curve (Fig. 3) was 0.914 (partial AUC of 0.867); again, indicating good classifier performance. The majority of studies exploited perturbation in autofluorescence spectra in oral pathology as the principal method of detection. However, there was variation in the source and wavelengths of excitation (Table 2). With exception to 11 studies (which used a support vector machine40,45, relevance vector machine38, quadratic discriminant analysis36,39,41,42, Mahalanobis distance43, linear discriminant analysis34,52, and decision tree37), the remaining studies demonstrated best performance using neural networks. In studies utilising ANN, data pre-processing was similar, involving some form of normalisation to standardise contrast and brightness, before introduction of a size-adjusted image according to the base architecture (Supplementary Data S1). The exceptions here were Chan et al., who instead utilised a Gabor filter or wavelet transformation from a redox ratio image of FAD and NADH to ultimately generate a feature map as input, Wang et al., who used partial least squares discriminant analysis on captured spectra to identify features as input, and de Veld et al. who again utilised normalised autofluorescence spectra as input. 3 studies used augmentation to increase the size of the training dataset for ANN30,33,51. Contrarily, studies utilising classical ML techniques for classification were heavily reliant on manual region of interest (ROI) detection and manual feature extraction. All studies with exception to James et al. produced a series of spectral intensity-based features following normalisation as input for classification. James et al. instead adopted an ensemble approach, whereby object detection and feature extraction were automated using ANNs, before introduction into a support vector machine for classification. Best overall accuracy within the modern ML group was achieved by Chan et al. using Inception (accuracy of 93.3) to classify OSCC vs healthy, and best performance within the classic group was achieved by Kumar et al. (accuracy 99.3) using Mahalanobis distance in classification of OSCC vs healthy.
Uthoff et al. performed a field-testing study of new hardware developed specifically for intra-oral classification of benign and (pre-)malignant lesions. The device in question, designed to provide POC detection in low- and middle-income countries, comprises an intra-oral probe connecting to a standard widely available smartphone that utilises 6 405 nm LEDs for autofluorescence and 4 4000 K LEDs for white light. Classification of autofluorescence spectra using a VGG-M architecture provided an accuracy of 86.88%, and AUC of 0.908. Song et al. also used a custom smartphone-based intra-oral visualisation system, exploiting 6 405 nm LEDs for excitation. This approach, using a VGG-M architecture pretrained on ImageNet, yielded an accuracy of 86.9%, with sensitivity of 85.0% and specificity of 88.7%51. Other approaches for achieving autofluorescence in vivo included a xenon lamp with monochromator and spectrograph15, multispectral digital microscopy35, time-domain multispectral endogenous fluorescence lifetime imaging FLIM36, N2 laser38, confocal endomicroscopy (CFE)33, portable spectrophotometry37,50, and optical coherence tomography45. Notably, although in vivo and providing a prospect of POC detection, the approach taken by Aubreville et al. of confocal laser endomicroscopy does require intra-venous administration of fluorescein prior to imaging and its utility as a POC detection tool may therefore be limited33. Both Huang et al. and Jeng et al. used the commercially available VELscope for autofluorescence imaging, though both groups used different approaches to classification. Huang et al. determined the average intensity of red, blue and green (RGB) channels and grayscale following grayscale conversion as input into quadratic discriminant analysis to distinguish between oral potentially malignant/malignant and healthy tissues, reporting a sensitivity and specificity of 0.92 and 0.98, respectively39. While feature selection was similar to Huang’s group (extracting average intensity and standard deviation of intensity from grayscale-converted RGB images), Jeng et al. compared the performance of both linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), reporting an optimal performance using QDA on normalised images of the tongue (sensitivity of 0.92, precision 0.86)41.
11 of the 26 identified studies attempted diagnosis of oral potentially malignant or malignant disorders from clinical photographs19,20,21,22,23,24,25,26,27,28,29, all of which utilised deep learning through various neural network architectures for classification and were retrospective in design (Table 1). All studies using clinical photographs provided performance metrics amenable to meta-analysis. Sensitivity and false positive rate were estimated as 0.911 [95% CI 0.848–0.950] and 0.118 [95%CI 0.070–0.192], respectively, and AUROC was 0.952 (partial AUC of 0.90; Fig. 3). All studies in this category used neural networks for classification. The source of images was variable between studies, with 4 studies using smart phone cameras as a potential easily-implementable POC source of data20,24,25,26, 2 studies using heterogenous images from various camera types19,21, 3 studies using images from search engines/repositories22,28,29, and 2 used high resolution single-lens reflex (SLR) cameras23,27. Training and testing sample sizes varied between studies (Fig. 5), though 8 of the 11 studies used augmentation to enhance the size of the training set, including scaling, shearing, rotation, reflection, and translation19,20,23,24,25,26,27,28. With exception to Fu et al. (who used the Single Shot Multibox Detector (SDD) as a detection network), and Lin et al.24 (who used the automatic centre-cropping function of a smartphone grid), all remaining studies within this category depended upon manual ROI bounding, thus still requiring a degree of clinical expertise prior to feature extraction and classification. Best overall accuracy, of 99.28, was achieved by Warin et al.23 using DenseNet-161 (pretrained on ImageNet) in classification of OSCC from healthy.
Fu et al. developed a two-stage process of classification, exploiting the Single Shot MultiBox Detector (SSD) as a detection convolutional neural network to initially define the region of interest, before binary classification using DenseNet, pretrained on ImageNet. In addition to demonstrating promising classification performance (AUROC 0.970), the developed deep learning algorithm also demonstrated superior performance in classification from clinical images compared to blinded non-medical professionals and post-graduate medical students majoring in oral and maxillofacial surgery (OMFS). Both identified studies by Welikala et al. adopted a smart phone-based approach, with a view to rapid POC detection of oral cancer in low and middle-income countries, as part of the Mobile Mouth Screening Anywhere (MeMoSA) initiative. A range of convolutional neural networks were trained on provided images, with best classification performance achieved through the VGG-19 architecture (Table 1). Both Tanriver et al. and Jeyaraj et al. attempted multiclass classification of either OSCC vs OPMD vs benign or normal vs benign vs malignant, respectively. Both used search engines and existing data repositories as the source of input data for classification (though Tanriver supplemented these using clinical photography within their unit). Transfer learning, with pretraining on ImageNet, performed best using the EfficientNet-b4 architecture in Tanriver et al., reporting an F1 of 0.86. Jeyaraj modified the Inception v3 architecture, and compared to a support vector machine and deep belief network, reporting a specificity of 0.98 and sensitivity of 0.94.
4 studies provided data on the use of an electronic nose as a POC device to detect malignancy-associated volatile compounds from exhaled breath (Table 3), all with exception to Mentel et al. providing outcomes amenable to meta-analysis46,47,48,49. All studies were prospective in design. Pooled estimates for sensitivity and false positive rate were 0.863 [95% CI 0.764–0.924] and 0.238 [95% CI 0.142–0.372] and AUC was estimated at 0.889 (partial AUC of 0.827). All 4 studies utilised some form of portable electronic ‘nose’ (eNose) to detect volatile organic compounds in exhaled breath of either patients with a confirmed diagnosis of malignancy or healthy controls. Van der Goor et al. and Mohamed et al. used eNose devices with a combination of micro hotplate metal-oxide sensors to detect changes in conductivity with redox reactions of volatile organic compounds on heating. Leunis instead analysed air samples using 4 sensor types—CH4, CO, NOx and Pt—and Hakim et al. used a device dependent upon spherical gold nanoparticles. Van der Goor et al. and Mohamed et al. both used tensor decomposition (Tucker3) to generate a single input vector for training of a neural network from the 64 × 36 datapoints generated per sensor, achieving sensitivities of 84% and 80%, and specificities 80% and 77% in detecting OSCC. Leunis et al. instead used logistic regression in binary classification, using measurements from only the NOx sensor to avoid collinearity. This achieved a sensitivity of 90% and specificity of 80%. Hakim et al. used Principal Component Analysis (PCA) for initial clustering, before training a linear support vector machine on principle components 1 and 2—this method achieved a sensitivity of 100% and specificity of 92%. Mental et al. used a commercially available BreathSpect device for sample collection, using two-fold separation with gas chromatography and mass spectrometry to detect VOCs. The output from the affiliated software is a 2-dimensional image representation of both VOC drift time and parts-per-billion. This output was used to train various classical machine learning algorithms (k-nearest neighbours, random forest, logistic regression and linear discriminant analysis), with best performance of an accuracy of 0.89 using logistic regression.
Several approaches to ML were used across the identified studies in their pursuit for detection of oral potentially malignant and malignant disorders. For clarity, the hierarchical classification presented by Mahmood et al. is adopted here53. ML classification algorithms may be subdivided into modern techniques and classical techniques (Fig. 4). The majority of identified studies used supervised algorithms for classification (following feature selection where necessary), whereby the machine is trained on annotated data. The majority of studies reported best outcomes using various architectures of neural networks. All studies on analysis of photographic images used deep learning (neural networks with more than one hidden layer), the most popular architecture of which being VGG neural networks17,22,25,26,30,51. This is perhaps unsurprising since VGGNet was developed as an extension of the revolutionary AlexNet54,55.
Several studies compared multiple different machine learning methods in classification. Shamim et al. used transfer learning with multiple convolutional neural networks pretrained on ImageNet, including AlexNet, GoogLeNet, VGG19, ResNet50, Inception v3 and SqueezeNet, achieving the optimal performance using the VGG19 CNN with a sensitivity of 89% and specificity of 97%22. Welikala et al. compared VGG16, VGG19, Inception v3, ResNet50 and ResNet101, all pretrained on ImageNet and applied through transfer learning; VGG19 again proved to provide the best detection of suspicious lesions from clinical images. Tanriver et al. found optimal performance using the EfficientNet-b4 architecture in clinical image classification.
Fifteen studies used “classical” ML algorithms. Roblyer et al. and Rahman et al. used linear discriminant analysis for classification of features extracted from autofluorescence images. Jo et al. and Huang et al. used quadratic discriminant analysis. Duran-Sierra et al. exploited an ensemble approach of both quadratic discriminant analysis and a support vector machine, demonstrating superior performance in classification of normalised ratios from autofluorescence images than the two approaches independently. Francisco et al. used decision trees, Chakraborty et al. and Hakim et al. used support vector machines, Majumder et al. a relevance vector machine and Leunis et al. used logistic regression. James et al. also adopted an ensemble approach, employing ANN for feature extraction prior to a support vector machine for classification. Feature selection and reduction for input into classical machine learning algorithms was also achieved through a variety of methods, including Principle Component Analysis49, tensor decomposition46,47, Gabor feature extraction and discrete wavelet transformation31. The only study utilising an unsupervised machine learning approach for classification (rather than feature selection) was Kumar et al., who initially used PCA for dimensionality reduction before Mahalanobis distance classification of the first 11 identified principal components.
Sample sizes for training and validation sets were hugely variable between studies. Test set sample size ranged from 5 per sample31 to 407933. An overview of training and test set sample sizes is provided in Fig. 5. Training sample sizes are estimates only, as some papers did not report total sample size post-augmentation, and so only the initial training sample size was recorded (and may therefore be underestimated). 16 of the 35 included studies did not report on software for implementation of machine learning methods. Of those using modern ML methods, 7 studies used the Keras application programming interface20,21,23,25,27,33,35, 2 used PyTorch, 1 used the Python Scikit-learn machine learning library, 2 studies used proprietary software accompanying the eNose46,47, and 1 study used the Deep Learning Toolbox and Parallel Learning Toolbox within MATLAB22. Within studies using classical ML methods, 3 studies used MATLAB34,43,45, 1 used Scikit-learn (Python), 1 used SPSS Statistics48, and 1 study used WEKA37.