The human kidney seldom deforms due to patient posture changes and respiration according to clinicians. Therefore, we used rigid transform in the kidney image registration. The proposed solution consists of 3DULBNet36 and a 3D–2D hierarchical registration network. 3DULBNet was trained on CT and windowed U/S images on binary segmentation tasks separately offline. They were connected with the registration network to predict the CT plane that best aligned with the U/S images.
Table of Contents
Feature network
The ULBNet is a 5-level U-Net with a residual block replacing the original convolutional layer (Appendix A). A local binary convolution (LBC) layer37 was added to skip connections. The dropout rate was set to 0.2. For CT images, the patch size is 160 × 160 × 80, and the batch size was 1. The optimiser was Adam. The loss function was the negative Dice coefficient. The output layer was a convolutional layer with sigmoid activation, and its output was a kidney feature map of CT images. The feature map was a probability map of a pixel/voxel being kidney. It described global shape characteristic of the kidney. ULBNet36 can be referenced for details of the method used to process CT images.
U/S images are noisy, and it is difficult to delineate kidneys from a single image. Kidney motion information is useful. Since the U/S sequence scans the kidney in respiration, the kidney motion cannot be observed in one frame but can be observed in a few consecutive frames. The U/S image sequences are experimentally windowed with a size of 5 (Appendix E). Thus, instead of extracting features from the 2D U/S frame, we extracted them from the volume in 256 × 192 × 5. In the U/S feature net, the input size was 256 × 192 × 5 and down/upsampled by 2 at each level. We did not downsample the data in the time dimension. The numbers of feature maps were 16, 32, 64, 128, and 256 in the encoder pathway and 256, 128, 64, 32, and 16 in the decoder pathway. The output layer was a convolutional layer with sigmoid activation and output of a U/S kidney feature map. Kidney feature maps of CT and U/S images were of the same dimension as the input. The windowed kidney feature images were used to construct the CT–US image pair.
Registration network
We present kidney registration (Fig. 1) in five aspects: image pair preprocessing, network structure, loss function, training data generation and learning strategy.
Preprocessing
All CT images were converted to RAI orientation and isotopically sampled in 0.8 mm × 0.8 mm × 0.8 mm. CT scans were automatically cropped to 128 × 224 × 288 around the centroid. A U/S image is resampled to 0.8 mm × 0.8 mm and cropped to 224 × 288. Windowed U/S images were centred in a 128 × 224 × 288 volume with zero padding. Within one U/S window, the middle was the registration target, and the others contributed to motion regularisation. U/S images were stacked along the time axis, the same as the R-L axis in image space. CT volume was aligned with U/S by matching the kidney centroids from feature maps. Because U/S scanning of the kidney was subject to the constraints of the ribs and spines of patients, the variability of the initial position of 3DCT-2DUS pairs can be large. To reduce variability, we uniformly aligned the kidney on CT in the inferior-superior axis and then aligned kidney on U/S with the centroid. The inputs to the registration network were CT and U/S window images of 128 × 224 × 288. CT was the moving image, and U/S was the fixed image.
Network structure
The architecture is shown in Fig. 1. The CT and U/S volumes were concatenated before going to the convolution layer. The numbers of feature maps were 8, 16, and 16 in the encoder pathway and 16, 16, 16, 16, 8, and 8 in the decoder pathway. Upsampling was performed using Upsampling3D, and downsampling was performed with stride convolution. Transformation regression used the affine block. At each decoder level, the activation of the output (dense) layer was tanh for transformation parameters. The spatial transformer network (STN) layer38 processed rigid transformation because the kidney seldom underwent deformation during free respiration. The optimiser was Adam. The learning rate was 3e−4. Given that the window size of U/S images was Nw, the output transformation was a set of 6 Nw rigid transformation parameters, 3 Nw for rotation and 3 Nw for translation. Nw was experimentally set to 5 in this work (Appendix G). Each layer in the decoder pathway output a set of transformation parameters, and their weighted sum composed the final transformation. For rotation, it was an average of rotation parameters. For translation, it was a weighted sum of the translation parameters, and the weights were {8, 4, 2, 1}/4 = {2, 1, 0.5, 0.25}. Rotation was scale invariance and was averagely weighted. Translation was inversely proportional to the image resolution and was proportionally weighted. Hierarchical activation improved prediction and decreased the training time. The number of trainable parameters was approximately 282 M.
Loss function
The feature map describes the overall kidney, and the MIND feature describes image details. They can complement each other to measure alignment accurately. As registration occurs during breathing, the CT cutting planes obtained should be able to stack together as smooth time sequences. Given \({I}_{fix}\) fixed image, \({I}_{mov}\) moving image, \({M}_{fix}\) fixed feature map, \({M}_{mov}\) moving feature map, and \(\mathcal{D}\) transformation parameters, the feature-image-motion (FIM) loss was defined (Eq. 1) by three measures on the feature map, original image, and transformation:
$$\mathcal{L}\left({I}_{fix},{ I}_{mov},{M}_{fix},{M}_{mov},\mathcal{D}\right)={\mathcal{L}}_{f}\left({M}_{fix},\mathcal{D}\circ {M}_{mov}\right)+{\lambda }_{1}{\mathcal{L}}_{i}\left({I}_{fix},\mathcal{D}{ \circ (I}_{mov}\cdot {M}_{mov})\right)+{{\lambda }_{2}\mathcal{L}}_{d}\left(\mathcal{D}\right)$$
(1)
$${\mathcal{L}}_{f}\left({M}_{fix},\mathcal{D}\circ {M}_{mov}\right)=-\frac{1}{{N}_{w}}\sum_{i\in {N}_{w}}{DICE}_{i}\left({M}_{fix},{\mathcal{D}}_{i}\circ {M}_{mov}\right)$$
(2)
$${\mathcal{L}}_{i}\left({I}_{fix},\mathcal{D}{ \circ (I}_{mov}\cdot {M}_{mov})\right)=\frac{1}{\left|N\right|}\sum \left|MIND\left({I}_{fix}\right)-MIND(\mathcal{D}{ \circ (I}_{mov}\cdot {M}_{mov}))\right|$$
(3)
$${\mathcal{L}}_{d}\left(\mathcal{D}\right)=0.01 \, * \, \frac{\Vert \mathcal{D}-{I}_{4\times 4}\Vert }{{N}_{w}}+0.99 \, * \, grad\mathcal{D}, \,\,and \,\,grad\mathcal{D}=\frac{1}{{N}_{w}-2}\sum_{i-1,i,i+1\in {N}_{w}}\Vert {D}_{i+1}+{D}_{i-1}-2{D}_{i}\Vert.$$
(4)
The feature loss was the negative Dice coefficient of the fixed and warped kidney feature map (Eq. 2), where \(DICE\left(x,y\right)= \frac{1}{m}\sum \left(\frac{2x \odot y}{x \oplus y}\right)\), ⨀ and ⊕ are elementwise operations: multiplication and addition, and m is the number of elements. DICE was calculated between the middle slice in R-L axis of warped CT volume and corresponding slice in fixed U/S volume. The feature loss was an average DICE within time windows. The image loss was the MIND feature difference between the fixed and warped original image (Eq. 3)39. A MIND feature was calculated by a Gaussian function of the mean squared difference between a central patch of the image and one of its six neighbouring patches. The neighbourhood was in 3D and the patch size was 3 × 3 × 3. MIND features were extracted from CT volume, and from U/S volume to capture 3D local image information as the kidney moved. U/S volume had 5 frames, and the feature of its middle frame was valid and used to calculate the image loss. The image loss was the mean absolute difference between the MIND features of middle slice in R-L axis in the warped CT volume and that in the fixed U/S volume to compare two images with local details. N was the set of displacement vectors. Here, the CT images needs to be masked before calculating MIND because CT images display extra anatomic structures in addition to the kidney. The transformation loss (Eq. 4) was a weighted sum of the L2-norm divided by the number of parameters and the average transformation difference (Eq. 4). \({\lambda }_{1} , {\lambda }_{2}\) were empirically set as 0.01 and 0.001. Here, we assumed that MIND is locally continuous. \({\lambda }_{1}\) was set to 0.01 to ensure that feature loss dominates weight updates initially, and when approaching the global optimum, image loss took effect to fine tune the weights. The transformation loss regularised the motion transformation to move CT to its optimal position aligned with U/S images. Basically, feature loss, \({\mathcal{L}}_{f}\), was calculated on kidney feature maps. Image loss, \({\mathcal{L}}_{i}\) , was computed between U/S images and masked CT images, which were implemented by elementwise multiplication of the warped CT images and the warped CT kidney feature map. Loss was calculated on kidney regions, thus eliminating influence from the difference between the field of views in CT and U/S images.
Training data generation
The transformation was in 6-dimensional space, while the data size was relatively small. Network training was prone to overfitting. Training dataset generation was necessary and was employed in projective 3D–2D image registration25,40. Dense sampling was commonly used. Unlike projective registration, the transformation parameters or their distribution for sliced 3D–2D registration were uncontrollable and unavailable. We needed to find it out. First, we verified reference planes with clinicians (Appendix B). The transformations to obtain those verified alignments were modelled, parameter-by-parameter, in 2-sigma Gaussian distributions, where we randomly sampled Nt transformations individually (Appendix C). The Nt inverse transformations transform the optimal alignment back to Nt different initial kidney positions. We generated Nt training CT–US pairs by applying Nt inverse transformations on a reference CT volume (Appendix D). Our training data generation scheme helped to obtain realistic training datasets to overcome overfitting. In the future, if sufficient clinical data are available, data generation can be neglected.
Learning strategy
Based on the generated training datasets, unsupervised learning was employed to pretrain a registration model. Our target U/S images are respiration sequences including images of periodic breathing cycles. It was possible to make use of the onsite patient-specific dataset to further improve the model performance. We proposed a one-cycle transfer learning strategy, which refined the pretrained model using the first respiration cycle data via transfer learning with two epochs and inferred the transformations subsequentially. On-site patient-specific training without pretrained model was infeasible because the convergence time was too long to accept, and a long operation preparation time was impractical in clinical applications. We proposed using transfer learning to refine the model to save time and improve performance (Appendix F).
Evaluation metrics
The Hausdorff distance (HD) and mean contour distance (MCD) between the outlines of the kidney on CT and U/S images were used to evaluate if the CT–US pair was well aligned, and calculated as
$${D}_{Hausdorff}\left(U,C\right)=\underset{}{\mathrm{max}}\left\{\underset{c\in C}{\mathit{max}}\left\{\underset{u\in U}{\mathit{min}}\left\{d\left(c,u\right)\right\}\right\},\underset{u\in U}{\mathit{max}}\left\{\underset{c\in C}{\mathit{min}}\left\{d\left(u,c\right)\right\}\right\}\right\},\,\,{D}_{mean}=\underset{u\in U, c\in C}{\mathrm{mean}}\{d\left(u,c\right)\}.$$
\(d\left(u,c\right)\) is absolute value on the distance map, and c and u are contour points on the CT and U/S images, respectively. HD and MCD were in millimetres. The CT to U/S (CT–US) distance was calculated between kidney boundaries on CT and U/S images. The CT to CT (CT–CT) distance between kidney contours on the resulting CT plane and reference CT plane.
Ethics declarations
The in-house datasets were collected from Fifth Affiliated Hospital Guangzhou Medical University and approved by the Institutional Review Board on August 28, 2020, with protocol number L2020-24.