Interpretable automatic detection of incomplete hippocampal inversions using anatomical criteria

Incomplete Hippocampal Inversion (IHI) is an atypical anatomical pattern of the hippocampus that has been associated with several brain disorders (epilepsy, schizophrenia). IHI can be visually detected on coronal T1 weighted MRI images. IHI can be absent, partial or complete (no IHI, partial IHI, IHI). However, visual evaluation can be long and tedious, justifying the need for an automatic method. In this paper, we propose, to the best of our knowledge, the first automatic IHI detection method from T1-weighted MRI. The originality of our approach is that, instead of directly detecting IHI, we propose to predict several anatomical criteria, which each characterize a particular anatomical feature of IHI, and that can ultimately be combined for IHI detection. Such individual criteria have the advantage of providing interpretable anatomical information regarding the morphological aspect of a given hippocampus. We relied on a large population of 2,008 participants from the IMAGEN study. The approach is general and can be used with different machine learning models. In this paper, we explored two different backbone models for the prediction: a linear method (ridge regression) and a deep convolutional neural network. We demonstrated that the interpretable, anatomical based prediction was at least as good as when predicting directly the presence of IHI, while providing interpretable information to the clinician or neuroscientist. This approach may be applied to other diagnostic tasks which can be characterized radiologically by several anatomical features.


INTRODUCTION
Incomplete Hippocampal Inversions are found in around 20% of the general population and are more commonly observed in the left hemisphere (17.1% left hemisphere and 6.5% right hemisphere). 1 While their origin is not well understood, they are thought to occur during the pre-natal development of the temporal lobe. IHIs have been shown to have a higher prevalence in patients with epilepsy (30-50% of the population), 2-5 or schizophrenia. 6 This suggests that IHI might play a role in the development of several brain disorders, and more research is needed to investigate the association between IHI and other psychiatric or neurodevelopmental disorders.
However, IHIs need to be detected visually by a trained rater. This can be a long and tedious task and could greatly benefit from an automated method. We are not aware of any such method. IHIs can be absent, partial or total and they are thus associated with a global three-class score (no IHI, partial IHI, total IHI). However, this global score does not account for the different anatomical characteristics of IHI and is likely to lack reproducibility from one laboratory to another. Thus, different authors have proposed to rate IHI using a set of criteria. 1, 3, 4 Specifically, Cury et al. 1 proposed a rating scale composed of five criteria/dimensions assessing the different characteristics of the hippocampus: verticality/roundness, medial positioning and neighbouring sulci characteristics (sulcal depth). Individual criteria can be summed-up to form an IHI score representing the IHI level of a given hippocampus. The IHI criteria allow for a more specific assessment of individual characteristics and allow designing a more interpretable automatic rating method.
The aim of this work was to develop an automatic method to detect IHI from anatomical MRI. More specifically, we propose to predict individual anatomical criteria in order to obtain an intrinsically interpretable rating. The predicted criteria are subsequently combined to detect IHI. We then aimed to assess whether this prediction strategy leads to performances at least on par with those obtained when directly predicting the IHI status.

Materials
We studied 2,008 participants from the IMAGEN study. 7 We included all participants with a T1-weighted anatomical MRI acquired at 3 Tesla. The average age at MRI was 14.5 years (range: 12.9 -17.2). 51% participants were females, 49% males and sex information was missing for one. Both the global three-class criterion (denoted as C0) corresponding to IHI detection and the individual interpretable criteria (denoted from Further author information: (Send correspondence to Lisa Hemforth) Lisa Hemforth: E-mail: hemforthl@gmail.com C1 to C5) were assessed on all MRI images by trained raters. The sum of individual criteria was called the IHI score denoted as SC. 1 Each of these criteria and scores have been evaluated separately on the left ( L) and on the right ( R) hemisphere. Local ethics committees approved the study. Participants as well as their parents gave informed written consent.

MRI pre-processing
We processed the MRI using the t1-volume pipeline implemented in Clinica. 8,9 This pipeline is a wrapper of the Segmentation, Run Dartel and Normalise to MNI Space routines implemented in SPM. First, the Unified Segmentation procedure 10 is used to simultaneously perform tissue segmentation, bias correction and spatial normalization of the input image. Next, a group template is created using DARTEL, an algorithm for diffeomorphic image registration, 11 from the participants' tissue probability maps on the native space, usually GM, WM and CSF tissues, obtained at the previous step. The DARTEL to MNI method 11 is then applied, providing a deformable registration of the native space images into the MNI space. We further cropped the gray-matter maps into a box of interest containing both hippocampi and the surrounding sulci.

Split between learning and testing set
We isolated 25% (502) of the participants to form a test set. We performed the split prior to running any analysis and only used the test set to evaluate results. This left a learning data-set of 1,506 participants (also used for model selection and hyperparameter optimization through cross-validation within these 1506 participants). We stratified the split based on all IHI criteria as well as age, weight, height, sex, handedness and imaging centre. In practice, we performed 200 random splits, and selected the one that minimised differences in distributions for all considered variables between the learning and test set (based on a Kolmogorov-Smirnoff test).

Proposed approach
The core of our approach is to predict the individual anatomical criteria C1 to C5, in place of the global criterion C0. The individual predicted criteria are then combined to detect IHI. The first criterion (C1) assesses the verticality and roundness of the hippocampal body. The second criterion (C2) evaluates the verticality and depth of the collateral sulcus. The third criterion (C3) quantifies the medial position of the hippocampus. The fourth criterion (C4) is a binary score defining if the subiculum is bulging upwards or not. However, we did not use this criterion because it leads to difficulties in human annotations and because it is normal in the overwhelming majority of participants (> 97% 1 ). The fifth criterion (C5) assesses whether any sulci of the fusiform gyrus exceed the level of the subiculum. Each of these criteria is rated on a 2 point scale with 0.5 steps for criteria 1 to 3 and 1 point steps for criterion 5. A schematic of these criteria extracted from the paper by Claire Cury et al. can be found in figure 1. We trained machine learning models to predict each score (C1, C2, C3, C5) separately. Our approach can work with various machine learning models as a backbone. In this work, we compared two models. First, we considered a linear model (ridge regression) and used a cross-validation to estimate the best hyper-parameter. Next, we considered a convolutional neural network with five convolutional layers and three fully connected layers, denoted as Conv5-FC3 in the following, and implemented in ClinicaDL . 12,13 It was trained to perform regression using the mean squared error loss. The deep-learning model was trained over 50 epochs and we performed early stopping, i.e. the model with the lowest validation loss was used for further analysis. Note that none of these operations involved the test set in order to not to bias the results. Both the linear and deep learning models used as input the voxel-based gray-matter maps described in Section 2.2. Finally, we summed the predictions for the individual criteria and the result was denoted as SC add L (resp. SC add R) for the left (resp. right) hemisphere.

Comparison to the global criterion
For comparison, we trained the same models to predict the global criterion C0. We searched for the optimal threshold to discretize the continuous IHI score into three classes of C0. To that purpose, we iteratively searched for the threshold on SC that gave the most accurate classification for C0 (as measured by balanced accuracy). Absence of IHI (C0=0) corresponds to IHI scores below 2.25 , partial IHIs (C0=0.5) to scores between 2.25 and 4.25, total IHI (C0=1) to scores above 4.25. Note that we only used the learning set and the manually-obtained SC to compute these thresholds, in order not to bias the results. Finally, we also compared our approach to a direct prediction of SC.

Performance metrics and statistical analysis
To compare the prediction to the ground truth, we used 1) a quadratic weighted kappa score for the discretized SC, the global criterion C0 and the individual criteria C1, C2, C3 because these are discrete ordered variables; 2) a non-weighted kappa for C5 as it is a binary variable; 3) inter-class correlation coefficients (ICC) for the continuous SC. We performed a bootstrap on the isolated test set, using 1000 iterations of 502 samples, the same size as the isolated test set. From this, we deduced the mean kappa/weighted kappa/ICC score along with the standard error.

Visual analysis
For the ridge regression, we extracted a weight map, i.e. a 3D image showing the weight attributed to each voxel. We computed a saliency map 14 from the Conv5-FC3 models using the implementation provided in ClinicaDL. 13 To visualise which regions contribute most to the models' decisions, we only show the 100 voxels with the highest values over-layed on a T1 MRI image. Table 1 presents the performance obtained for the prediction of each individual criterion, when computing the sum of the predictions for C1, C2, C3 and C5 to obtain the IHI score, and when predicting the IHI score directly. The deep learning model systematically achieved higher performance compared to ridge regression. However, this difference ranged from small (about 0.05 points) to very high (about 0.25 points). Kappa and ICC scores were systematically lower in the right hemispheres, which we attributed to to the lower number of IHIs on this side.The fifth criterion was predicted with greater difficulty due to its unbalanced nature 1 (right side : 85%, 6%, 9%; left side : 59%, 20%, 20% ). When summed, predictions obtained from individual criteria produced comparable results to predicting directly the IHI scores. Table 2 displays results for prediction of C0, either directly, or through thresholding the sum of the predictions of the individual criteria, or through thresholding the prediction of the IHI score (SC). Overall, the performance of the proposed approach (predicting interpretable individual scores) was comparable to the direct prediction of C0. For ridge regression, there were even some cases where the results were substantially better.

Visual interpretation
Weight maps and saliency maps can be seen in Figure 2. We noticed that both models seem to rely on voxels located in the hippocampus and the surrounding gyri. The hippocampus seems to be outlined for C1 and C3 while the gyri are highlighted for C2 and C5. The saliency maps are slightly less clear and exhibit sparser results. However, a similar tendency may be observed.

CONCLUSION
In this paper, we have proposed to automatically detect IHI by predicting anatomically interpretable individual criteria. We showed that this approach does not decrease predictive performance compared to directly predicting the presence of IHI. Predicting individual criteria provides much more information about the specific anatomical characteristics underlying the IHI of a given participant, thereby providing more interpretable information to the clinician or neuroscientist. This training strategy has the potential to be applied to other diagnosis tasks which can be characterized by individual interpretable criteria.