Re: [Rtk-users] Slow CUDA FDK performance

Simon Rit Fri, 19 Nov 2021 02:32:17 -0800

Sounds good to me! Can you make a PR? If not, I'll do it later today.

On Fri, Nov 19, 2021 at 10:18 AM Moritz Schaar <sch...@imt.uni-luebeck.de>
wrote:


> Hi Simon,
>
>
>
> thanks for the clarification.
>
> Just to understand what’s going on here:
>
> The issue is that CudaFDKConeBeamReconstructionFilter inherits properties
> from its superclass FDKConeBeamReconstructionFilter and not explicitly
> setting the value SLAB_SIZE?
>
> In GPUGenerateData()
> <https://github.com/SimonRit/RTK/blob/601715281db1ebb7a8a7e436bb0608814ee30e91/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L39>
>  the
> CPUSuperclass:GenerateData()
> <https://github.com/SimonRit/RTK/blob/601715281db1ebb7a8a7e436bb0608814ee30e91/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L41>
>  method
> is called that uses m_ProjectionSubsetSize
> <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L135>
> to group projections, which is now defaulted to 2 as per this part
> <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L49-L56>
> .
>
> So a simple
>
> CPUSuperclass::SetProjectionSubsetSize(SLAB_SIZE);
>
> In the constructor of CudaFDKConeBeamReconstructionFilter should be enough?
> Or am I approaching the issue from the wrong side and mixing up things?
>
>
>
> Best,
>
> Moritz
>
>
>
> *Von:* Simon Rit <simon....@creatis.insa-lyon.fr>
> *Gesendet:* Freitag, 19. November 2021 07:25
> *An:* Moritz Schaar <sch...@imt.uni-luebeck.de>
> *Cc:* rtk-users@public.kitware.com
> *Betreff:* Re: [Rtk-users] Slow CUDA FDK performance
>
>
>
> Hi,
>
> Thanks for isolating the problem. It's interesting to note that the
> comment
> <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L48>
> disagree with the content
> <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L49-L56>...
> This commit
> <https://github.com/SimonRit/RTK/commit/42bd0e2557a04e1bea5e6a25f63d251a12743fc3>
> is not sufficiently clear to understand why it's been changed at the time
> (and I don't know if it's still necessary).
>
> This could probably be changed in the Cuda constructor
> <https://github.com/SimonRit/RTK/blob/master/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L21>.
> Note that the computation of the ramp kernel always uses the CPU filter
>
>
> https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFFTRampImageFilter.hxx#L84
>
> but this should not be a problem as only one line is computed anyway
> (before being replicated). Feel free to propose a pull request with this
> change if it does solve your issue.
>
> As far as I can see, RTK_CUDA_PROJECTIONS_SLAB_SIZE (which becomes
> SLAB_SIZE) is only used in the Cuda code of the forward and
> backprojections, not for setting the FDK processing, so you have to set it
> manually.
>
> Simon
>
>
>
> On Thu, Nov 18, 2021 at 3:10 PM Moritz Schaar <sch...@imt.uni-luebeck.de>
> wrote:
>
> Hi Simon,
>
>
>
> thank you for looking into it!
>
> So there is no general issue, which is nice to know.
>
> Sadly I cannot run the old version as I end up with the following error:
>
> “ImportError: DLL load failed while importing _RTKPython”
>
> ITK works, RTK doesn’t. And rebuilding also doesn’t work as too many
> things changed (CUDA, VS, ..).
>
>
>
> In the meantime I figured out that modifying “m_ProjectionSubsetSize”
> helps to accelerate everything.
>
> Checking this with “feldkamp.GetProjectionSubsetSize()” I get for the CPU
> and CUDA version of the FDKConeBeamReconstructionFilter a value of “2”.
>
> Regarding your test code I, obviously, get very slow execution times with
> this value. Increasing the number of subsets to 200 I end up with similar
> values you reported.
>
>
>
> Now I am wondering where these defaults come from.
>
> For CPU I assume that this comes from missing FFTW as given here
> https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L48
>
> However, I do not understand why the CUDA version defaults to 2 instead of
> 16.
>
> In CMake I kept the default: RTK_CUDA_PROJECTIONS_SLAB_SIZE=16
>
> From https://github.com/SimonRit/RTK/blob/master/rtkConfiguration.h.in#L35
> I assume that this gets copied to SLAB_SIZE which will be used by all CUDA
> codes.
>
> My rtkConfiguration.h also reflects this:
>
> #ifndef SLAB_SIZE
>
> #  define SLAB_SIZE 16
>
> #endif
>
>
>
> Do you have an explanation why this gets reduced from 16 to 2? Or a hint
> where I can have a look?
>
>
>
> Best,
>
> Moritz
>
>
>
>
>
> *Von:* Simon Rit <simon....@creatis.insa-lyon.fr>
> *Gesendet:* Donnerstag, 18. November 2021 12:13
> *An:* Moritz Schaar <sch...@imt.uni-luebeck.de>
> *Cc:* rtk-users@public.kitware.com
> *Betreff:* Re: [Rtk-users] Slow CUDA FDK performance
>
>
>
> Hi,
>
> I compiled the python packages with exactly the same configurations and I
> can't reproduce the issue
>
> old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0 -> 0.9 s
>
> 0.019904613494873047
> 0.6475656032562256
> Reconstructing...
>
> 0.9730124473571777
>
>
>
> new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0
>
> 0.017342329025268555
> 0.7650339603424072
> Reconstructing...
> 0.8823671340942383
>
>
>
> The code I ran is the following
>
> #!/usr/bin/env python
> import sys
> import itk
> import time
> from itk import RTK as rtk
>
> if len ( sys.argv ) < 3:
>   print( "Usage: FirstReconstruction <outputimage> <outputgeometry>" )
>   sys.exit ( 1 )
>
> # Defines the image type
> GPUImageType = rtk.CudaImage[itk.F,3]
> CPUImageType = rtk.Image[itk.F,3]
>
> # Defines the RTK geometry object
> geometry = rtk.ThreeDCircularProjectionGeometry.New()
> numberOfProjections = 200
> firstAngle = 0.
> angularArc = 360.
> sid = 600 # source to isocenter distance
> sdd = 1200 # source to detector distance
> for x in range(0,numberOfProjections):
>   angle = firstAngle + x * angularArc / numberOfProjections
>   geometry.AddProjection(sid,sdd,angle)
>
> # Writing the geometry to disk
> xmlWriter = rtk.ThreeDCircularProjectionGeometryXMLFileWriter.New()
> xmlWriter.SetFilename ( sys.argv[2] )
> xmlWriter.SetObject ( geometry );
> xmlWriter.WriteFile();
>
> # Create a stack of empty projection images
> ConstantImageSourceType = rtk.ConstantImageSource[GPUImageType]
> constantImageSource = ConstantImageSourceType.New()
> origin = [ -127.75, -127.75, 0. ]
> sizeOutput = [ 512, 512,  numberOfProjections ]
> spacing = [ 0.5, 0.5, 0.5 ]
> constantImageSource.SetOrigin( origin )
> constantImageSource.SetSpacing( spacing )
> constantImageSource.SetSize( sizeOutput )
> constantImageSource.SetConstant(0.)
>
> REIType = rtk.RayEllipsoidIntersectionImageFilter[CPUImageType,
> CPUImageType]
> rei = REIType.New()
> semiprincipalaxis = [ 50, 50, 50]
> center = [ 0, 0, 10]
> # Set GrayScale value, axes, center...
> rei.SetDensity(2)
> rei.SetAngle(0)
> rei.SetCenter(center)
> rei.SetAxis(semiprincipalaxis)
> rei.SetGeometry( geometry )
> rei.SetInput(constantImageSource.GetOutput())
>
> # Create reconstructed image
> constantImageSource2 = ConstantImageSourceType.New()
> sizeOutput = [ 256 ] * 3
> origin = [ -63.75 ] * 3
> spacing = [ 0.5 ] *  3
> constantImageSource2.SetOrigin( origin )
> constantImageSource2.SetSpacing( spacing )
> constantImageSource2.SetSize( sizeOutput )
> constantImageSource2.SetConstant(0.)
> t0 = time.time()
> constantImageSource2.Update()
> t1 = time.time()
> print(t1-t0)
>
> # Graft the projections to an itk::CudaImage
> projections = GPUImageType.New()
> t0 = time.time()
> rei.Update()
> t1 = time.time()
> print(t1-t0)
> projections.SetPixelContainer(rei.GetOutput().GetPixelContainer())
> projections.CopyInformation(rei.GetOutput())
> projections.SetBufferedRegion(rei.GetOutput().GetBufferedRegion())
> projections.SetRequestedRegion(rei.GetOutput().GetRequestedRegion())
>
> # FDK reconstruction
> print("Reconstructing...")
> FDKGPUType = rtk.CudaFDKConeBeamReconstructionFilter
> feldkamp = FDKGPUType.New()
> feldkamp.SetInput(0, constantImageSource2.GetOutput())
> feldkamp.SetInput(1, projections)
> feldkamp.SetGeometry(geometry)
> feldkamp.GetRampFilter().SetTruncationCorrection(0.0)
> feldkamp.GetRampFilter().SetHannCutFrequency(0.0)
> t0 = time.time()
> feldkamp.Update()
> t1 = time.time()
> print(t1-t0)
>
>
> To be honest I don't see to do at this stage... Can you maybe check the
> same code with your two versions ? Any other suggestion?
> Simon
>
>
>
> On Wed, Nov 10, 2021 at 10:03 AM Moritz Schaar <sch...@imt.uni-luebeck.de>
> wrote:
>
> Hi Simon,
>
>
>
> I completely agree that this is hard to track down. That’s why I am asking
> for directions J
>
> To be more precise about the execution times of my example:
>
> The timings given in pairs of 17.1/1.2 s and 19/7 s are only the required
> times of the reconstruction step itself.
>
> Reading data, pre and post processing are not part of this time
> measurement.
>
> So the 7 s average in python is similar to the 6.41 s I obtained from
> adding everything done in CudaFDKConeBeamReconstructionFilter using
> RTK_PROBE_EACH_FILTER.
>
> The reconstruction step in python simply involves:
>
> -          Instantiation of a simple class, this doesn’t add anything to
> the timings
>
> -          Setting up ConstantImageSource with either rtk.Image or
> rtk.CudaImage
>
> -          Setting up
> FDKConeBeamReconstructionFilter/CudaFDKConeBeamReconstructionFilter
>
> -          Setting inputs, geometry and filter
>
> -          Update() and return result
>
>
>
> Looks like there was a typo in my mail, the versions compared should be:
>
> old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0
>
> new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0
>
>
>
> Sorry for the confusion and thanks for looking into it!
>
>
>
> Best,
>
> Moritz
>
>
>
>
>
> *Von:* Simon Rit <simon....@creatis.insa-lyon.fr>
> *Gesendet:* Mittwoch, 10. November 2021 09:32
> *An:* Moritz Schaar <sch...@imt.uni-luebeck.de>
> *Cc:* rtk-users@public.kitware.com
> *Betreff:* Re: [Rtk-users] Slow CUDA FDK performance
>
>
>
> Hi Moritz,
>
> Thanks for the report. It's a bit hard to be convinced that something is
> wrong without being able to reproduce it. From the RTK_PROBE_EACH_FILTER
> log, most of the time is spent reading the projections which will be the
> same with or without cuda so I wonder if this is not the issue here. I can
> try to reproduce the issue, can you just confirm the two configurations :
> Cuda 10.2, ITK 5.2.1, RTK 2.1.0 vs Cuda 11.5, ITK 5.2.1 RTK 2.3.0 ?
>
> Thanks,
>
> Simon
>
>
>
> On Fri, Nov 5, 2021 at 4:20 PM Moritz Schaar <sch...@imt.uni-luebeck.de>
> wrote:
>
> Hi,
>
>
>
> I recently upgraded my Windows 10 system to ITK 5.2.1 including RTK 2.3.0.
>
> This also involved upgrading CUDA from 10.2 to 11.5, Visual Studio 2019
> and even python update (3.8.5 to 3.8.12).
>
> Using the python wrapping of RTK I implemented own routines that use FDK
> similar to the rtkfdk application.
>
> On the old system (ITK 5.2.1, RTK 2.1.0) I benchmarked the FDK for a
> 512x512x200 dataset reconstructed into 256x256x256 with 1.0 mm isotropic
> voxel size.
>
> The system is equipped with 24 CPU cores and one RTX 2080 Ti, so the CPU
> version took 17.1 and the CUDA version 1.2 seconds.
>
> Running the new software version on the same system results in roughly 19
> s CPU time but more than 7 s for the CUDA version.
>
> I don’t care about the actual timings but the relative increase of the
> CUDA version is what bothers me.
>
>
>
> To dig up some more information I recompiled RTK with
> RTK_PROBE_EACH_FILTER and ran rtkfdk.exe for the same data, this is what I
> got:
>
>
> **************************************************************************************************************
>
> Probe Tag                                    Starts    Stops     Time
> (s)       Memory (kB)    Cuda memory (kB)
>
>
> **************************************************************************************************************
>
> ChangeInformationImageFilter                 200       200
> 0.0211846      0              0
>
> ConstantImageSource                          1         1
> 0.0305991      65668          0
>
> CudaCropImageFilter                          13        13
> 0.0222911      15786.8        15753.8
>
> CudaDisplacedDetectorImageFilter             13        13
>      0.0540568      10719.1        16384
>
> CudaFDKBackProjectionImageFilter             13        13
> 0.0326397      5051.38        5041.23
>
> CudaFDKConeBeamReconstructionFilter          1         1
> 5.72999        552184         211648
>
> CudaFDKWeightProjectionFilter                13        13
> 0.0262806      -13892         630.154
>
> CudaFFTRampImageFilter                       13        13
> 0.148416       43095.4        12499.7
>
> CudaParkerShortScanImageFilter               13        13
>      0.0467202      2525.85        15753.8
>
> ExtractImageFilter                           13        13
> 0.0259726      15812.3        -15753.8
>
> ImageFileReader                              200       200
> 0.0226735      -0.16          0
>
> ImageSeriesReader                            200       200
> 0.066097       6.12           0
>
> ProjectionsReader                            1         1
> 26.0388        208488         0
>
> StreamingImageFilter                         2         2         16.0663
>       547512         191840
>
> VnlRealToHalfHermitianForwardFFTImageFilter  2         2
> 0.0208174      0              0
>
>
>
> Following the conversion on the mailing list,
> https://public.kitware.com/pipermail/rtk-users/2018-July/010617.html, I
> see that the CudaFDKConeBeamReconstructionFilter takes 6.41 s of which
> roughly 1/3 is spent in the CudaFFTRampImageFilter.
>
> Sadly I don’t have these results for the old software version so I can’t
> relate these values.
>
>
>
> However, I also played around with v2.2.0 but it doesn’t make a difference.
>
> Sadly, the version I used before (v2.1.0) won’t compile with CUDA 11.5
> anymore. I tried to add small adjustments e.g. this commit
> https://github.com/SimonRit/RTK/commit/3d3c7506087f5fa98aee75df5af5c30e7e51cbe6
> to make things work but this didn’t work.
>
> The same happens with other errors when trying to setup ITK 5.1.2, so
> getting back the old version for comparison seems impossible.
>
>
>
> Is there any direction you can point me to check what is actually the
> issue here? Or maybe someone has an idea what could be the reason? 
> CUDA/RTK/ITK
> version?
>
> Any help is appreciated.
>
>
>
> *Best,*
>
> *Moritz*
>
>
>
> _______________________________________________
> Rtk-users mailing list
> Rtk-users@public.kitware.com
> https://public.kitware.com/mailman/listinfo/rtk-users
>
>

_______________________________________________
Rtk-users mailing list
Rtk-users@public.kitware.com
https://public.kitware.com/mailman/listinfo/rtk-users

Re: [Rtk-users] Slow CUDA FDK performance

Reply via email to