We divided the 20 subjects into a source domain S, containing 16 subjects, and a target domain T, containing 4 Subjects. In both S and T, we used the first half of each training sequence as training data and the second half as validation data.
Table of Contents
Model performance
To quantitatively assess model performance and for statistical analysis, we use the following three image-based error measures that express the similarity of predicted MR slice and ground truth.
RMSE
We compute the RMSE of two images, i.e., predicted slice and ground truth, as expressed in Eq. (1) by computing the voxel-wise intensity difference \(d_i\), then taking the root of the mean of the squared differences.
$$\begin{aligned} \text {RMSE} = \sqrt{\frac{1}{W \cdot H} \sum _{ i=0 }^{ W \cdot H } d_i^2 }, \end{aligned}$$
(1)
where W and H are the width and height of the images. It is common practice to report the RMSE in the evaluation of 4D MRI methods. However, the comparability of the measure across works is limited because different image normalization might be used. Moreover, this similarity measure does not differentiate between the appearance or presence of structures and the displacements of structures.
MDISP
We compute the MDISP by first performing a B-spline deformable registration using simpleITK24 to obtain a dense deformation field between prediction and label. The resulting dense deformation field was then sampled with a \(16\times 16\) grid (\(8\times 8\) voxel spacing) within the liver to obtain displacement vectors. We then compute the average Euclidean norm of the displacement vectors in mm. We manually segmented all livers in the static volumes and used the segmentation as a mask to sample only within the liver. The parameterization of the deformable registration algorithm was empirically determined as follows.
ANTSNeighborghoodCorrelation (radius \(=\) 2) was used as the similarity measure. It visually yielded better registrations than MeanSquarse, MattesMutualInformation, and correlation. A pyramid scheme with two levels was utilized. In the first level, the images were smoothed with a sigma of 0.25 before halving their resolution using linear interpolation. In the second level, the original image was used with no smoothing. The grid size of the deformation mesh was \(4\times 4\) in the first level. It was doubled to \(8\times 8\) in the second level. A gradient descent optimizer (learning rate \(=\) 0.25, number of iterations \(=\) 20, convergence minimum value \(=\) \(1e^{-7}\), convergence window size \(=\) 10, estimate learning rate = True, maximum step size in physical units \(=\) 0.25) was used.
The MDISP is a better measure for comparison across works than the RMSE because the displacement of structures is independent of image normalization. However, the displacement field between a generated image and the ground truth is not always well defined. For example, when the prediction contains structures not present in the label or vice versa when structures are missing. An extreme example is an empty prediction, which would lead to an MDISP of zero, which of course, would not reflect the actual similarity.
DN_RMSE
To alleviate some of the shortcomings of RMSE and MDISP, we propose a new measure: the deformation-normalized root mean squared error (DN_RMSE). It computes the RMSE after the prediction is deformably registered to the label. Thus DN_RMSE measures the similarity purely based on appearance and not deformation or displacement and can be used to interpret small MDISP values better. Like MDISP, taken by itself, DN_RMSE is not conclusive. However, combined with MDISP, it aids in a better comparison of generated images within one work.
Domain shift
In this study the term domain shift is used in a general way, where it refers to the situation that the data distribution in the training set is different from the test set. And that this leads to a decrease in model performance. We argue that in clinical settings, the quantity of available training data is limited, and that there is a high likelihood that a new subject may not be adequately represented by the training set distribution. The inadequate representation of the new subject by the training set can be considered as domain shift. In our case, a small training distribution does not faithfully represent the following variations: liver shape and size, body height, abdominal girth and, consequently, SNR ( signal-to-noise ratio), body fat, sex, and age. This list might not be exhaustive. A tabular comparison of these aspects between the source domain and target domain can be found as Supplementary Table S1. To ensure anonymity only min, max, and mean values are reported. The liver shape is approximated as the extend along the three orthogonal directions SI (superior, inferior), AP (anterior, posterior), and LR (left, right). One can see that most values have a wide range between minimum and maximum. For example, the body height ranges from 160 to 220 cm , the body weight from 54 to 112 kg , and the liver volume from \({1182}\,\) to \({2435}\,\textrm{cm}^3\) . Also the liver extent has wide ranges in all three orientations (SI, AP, LR). A comparison of the different liver shapes and apparent SNR between source and target domain is given in Fig. 3. It is likely that the 16 source subjects do not represent the distribution of all factors over these wide ranges faithfully .
Remember, \({\textbf {M}}^{24}_{pre}\) is a model pre-trained on all 16 Subjects from the source domain S, using \({24}\,\textrm{min}\) worth of training samples per subject. Of course, it would be best if it could be applied to a new subject \({\textbf {t}} \in {\textbf {T}}\) directly and without any adaptation. However, this requires that there is no domain shift present between S and T. To test this, we compare the domains in two ways. First, the performance of \({\textbf {M}}^{24}_{pre}\) is compared between validation data (from S) and test data (T) using the MDISP and DN_RMSE. To that end, we randomly chose 50% of test samples from the first 10 seconds of the second half of each training sequence, i.e., for each subject (in S and T) and slice position. We then computed both similarity measures for all predictions of the test samples. Second, the anatomical variance was assessed visually using the navigator frames. We visualize the MDISP and DN_RMSE distributions in a violin plot (see Fig. 3). The violin plots show non-normal distributions with different mean. Because a Shapiro-Wilk Test (n \(=\) 4000) and Kolmogorov-Smirnov test also showed that the distributions are not normally distributed (p < 0.001), we used a Wilcoxon rank sum test (m \(=\) 3040, n \(=\) 12,352) to test for significance of the distribution shift. The null hypothesis of no shift in error distribution was rejected at a significance level of p < 0.001. The mean of MDISP and DN_RMSE are 0.30 and 1.29 in S and 0.49 and 2.06 in T. We quantify the effect size with Cohen’s d (n \(=\) 3040, m \(=\) 12,352) and find the effect size is large with d \(=\) 2.01 and 1.834. The visual comparison of the navigators shows variability in liver anatomy across subjects concerning the superior–inferior extent of the liver and the number and arrangement of vessels. This leads us to believe that domain shift is the reason for the significant shift in performance outcome of \({\textbf {M}}^{24}_{pre}\) in S and T.
Pre-trained vs. TL and influence of source domain data availability
Because domain shift is a challenge in deep learning-based 4D MRI prediction, we propose to employ TL. We evaluate the effect of TL on our models by comparing \({\textbf {M}}^{j}_{pre}\) (\(j \in [2, 5, 12, 24]\)) with \({\textbf {M}}^{2}_{pre+TL}\) regarding their performance in T. Where \({\textbf {M}}^{2}_{pre+TL}\) is the result of fine-tuning \({\textbf {M}}^{j}_{pre}\) with 2 minutes of samples from T (720 samples = \({2}\,\textrm{min}\) acquisition time). By that, we also analyze how the source data amount j influences the effect of TL. For comparison, we use RMSE, MDISP, and DN_RMSE. The top row of box plots in Fig. 4 shows the results of this experiment. Two observations can be made. First, transfer learning improves the model performance in the target domain for all tested measures. All tested measures show a high significance level of p<.001. Significance levels were computed using the Wilcoxon rank sum test (m \(=\) 3040, n \(=\) 12,352) after confirming none normal distributions using the Shapiro–Wilk test (n \(=\) 3040) and Kolmogorov–Smirnov test. We observe high effect sizes with \(|\text {d}| > 1.6\) for RMSE and DN_RMSE and medium effect sizes with \(|\text {d}| > 0.7\) for MDISP. Second, the amount of source domain data (beyond \(\sim {1}\,\textrm{min}/\text {subject}\)) has little to no influence on the effect size. It also does not affect the performance of either \({\textbf {M}}^{j}_{pre}\) or \({\textbf {M}}^{2}_{pre+TL}\) in T. In table 2 we report means and 95th percentiles.
TL vs. direct learning and the influence of target domain data availability
We evaluate whether TL is beneficial compared to directly learning a model from scratch in the target domain. Moreover, we evaluate how the target sample availability influences that effect regarding the effect size. To that end, we directly train models from scratch on samples from T and compare them with fine-tuned models. Let \({\textbf {M}}^{i}_{direct}\) be a directly learned model and let \({\textbf {M}}^{i}_{pre+TL}\) be a model fine-tuned from \({\textbf {M}}^{2}_{pre}\), where \(i \in [1, 2, 5, 12, 24, 47]\). \({\textbf {M}}^{2}_{pre}\) was chosen as the base model because j showed virtually no influence on model performance in T. Furthermore, acquiring only a few samples to train a base model in a real-world scenario would be more economical. The model performance was tested dependent on the availability of target domain samples from 1 to 47 min (see the bottom row in Fig. 4). For each target data availability level i and target subject t, one model was trained directly and one with TL (in total, 48 models). For target data availability between 1 and 12 min, we observe significant improvements (p < 0.001) when using TL concerning RMSE, MDISP, and DN_RMSE, and visual assessment reveals detail gain (see Fig. 6). Beyond the level of \({12}\,\textrm{min}\), improvements are not significant. We find that effect sizes are largest (small to medium) between 1 and 12 minutes when few target samples are available and become negligible when large amounts of target samples are available. We used the Wilcoxon rank sum test (m \(=\) 3040, n \(=\) 3040) to test for significance after we checked that the distributions are not normally distributed using the Shapiro–Wilk test (n \(=\) 3040) and Kolmogorov–Smirnov test. Effect sizes are reported as Cohen’s d. In Table 3 we report means and 95th percentiles. Figure 5 illustrates the image quality and displacement fields of predictions for increasing MDISP and RMSE values. We present 4D visualizations in this video: youtu.be/bh8A9SoAXvM. (The video’s visibility will be set to public once the manuscript is accepted. During review the video is provided as Supplementary Material).
TL+Ens vs. TL
We evaluate whether the combination of transfer learning with the ensembling strategy (TL+Ens) enhances the model performance. For that, we compare ensembles of fine-tuned models of different ensemble sizes with regard to RMSE, MDISP, and DN_RMSE. Where the ensemble size N \(=\) 1 represents only TL, i.e. no ensembling. A one-factorial analysis of variance (ANOVA) was performed to test for a primary effect of the ensemble size, which reveled a significant effect. A post-hoc pair-wise Tukey’s test was performed for the RMSE, MDISP, and DN_RMSE independently using p-adjustment. The pair-wise effect size was computed, using Cohen’s d. One can see that ensembles (TL + Ens) of size N = 5 and 10 perform significantly better than N = 1 (TL) in all tested metrics. Although ensembling provides some benefits, the effect size is relatively small, suggesting that our TL strategy has reached a saturation point in terms of quantitative result quality. However, based on a subjective perspective, our senior radiologists with extensive experience consistently preferred the results of the TL+Ens approach over the TL-only results in all tested cases. The boxplots and all pairwise significances and Cohen’s d are presented in Fig. 4. The mean and 95th percentile are reported in Table 4.