Sounds good to me! Can you make a PR? If not, I'll do it later today. On Fri, Nov 19, 2021 at 10:18 AM Moritz Schaar <sch...@imt.uni-luebeck.de> wrote:
> Hi Simon, > > > > thanks for the clarification. > > Just to understand what’s going on here: > > The issue is that CudaFDKConeBeamReconstructionFilter inherits properties > from its superclass FDKConeBeamReconstructionFilter and not explicitly > setting the value SLAB_SIZE? > > In GPUGenerateData() > <https://github.com/SimonRit/RTK/blob/601715281db1ebb7a8a7e436bb0608814ee30e91/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L39> > the > CPUSuperclass:GenerateData() > <https://github.com/SimonRit/RTK/blob/601715281db1ebb7a8a7e436bb0608814ee30e91/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L41> > method > is called that uses m_ProjectionSubsetSize > <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L135> > to group projections, which is now defaulted to 2 as per this part > <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L49-L56> > . > > So a simple > > CPUSuperclass::SetProjectionSubsetSize(SLAB_SIZE); > > In the constructor of CudaFDKConeBeamReconstructionFilter should be enough? > Or am I approaching the issue from the wrong side and mixing up things? > > > > Best, > > Moritz > > > > *Von:* Simon Rit <simon....@creatis.insa-lyon.fr> > *Gesendet:* Freitag, 19. November 2021 07:25 > *An:* Moritz Schaar <sch...@imt.uni-luebeck.de> > *Cc:* rtk-users@public.kitware.com > *Betreff:* Re: [Rtk-users] Slow CUDA FDK performance > > > > Hi, > > Thanks for isolating the problem. It's interesting to note that the > comment > <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L48> > disagree with the content > <https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L49-L56>... > This commit > <https://github.com/SimonRit/RTK/commit/42bd0e2557a04e1bea5e6a25f63d251a12743fc3> > is not sufficiently clear to understand why it's been changed at the time > (and I don't know if it's still necessary). > > This could probably be changed in the Cuda constructor > <https://github.com/SimonRit/RTK/blob/master/src/rtkCudaFDKConeBeamReconstructionFilter.cxx#L21>. > Note that the computation of the ramp kernel always uses the CPU filter > > > https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFFTRampImageFilter.hxx#L84 > > but this should not be a problem as only one line is computed anyway > (before being replicated). Feel free to propose a pull request with this > change if it does solve your issue. > > As far as I can see, RTK_CUDA_PROJECTIONS_SLAB_SIZE (which becomes > SLAB_SIZE) is only used in the Cuda code of the forward and > backprojections, not for setting the FDK processing, so you have to set it > manually. > > Simon > > > > On Thu, Nov 18, 2021 at 3:10 PM Moritz Schaar <sch...@imt.uni-luebeck.de> > wrote: > > Hi Simon, > > > > thank you for looking into it! > > So there is no general issue, which is nice to know. > > Sadly I cannot run the old version as I end up with the following error: > > “ImportError: DLL load failed while importing _RTKPython” > > ITK works, RTK doesn’t. And rebuilding also doesn’t work as too many > things changed (CUDA, VS, ..). > > > > In the meantime I figured out that modifying “m_ProjectionSubsetSize” > helps to accelerate everything. > > Checking this with “feldkamp.GetProjectionSubsetSize()” I get for the CPU > and CUDA version of the FDKConeBeamReconstructionFilter a value of “2”. > > Regarding your test code I, obviously, get very slow execution times with > this value. Increasing the number of subsets to 200 I end up with similar > values you reported. > > > > Now I am wondering where these defaults come from. > > For CPU I assume that this comes from missing FFTW as given here > https://github.com/SimonRit/RTK/blob/46ea6e190965bcc89421075830e365063aa8c51a/include/rtkFDKConeBeamReconstructionFilter.hxx#L48 > > However, I do not understand why the CUDA version defaults to 2 instead of > 16. > > In CMake I kept the default: RTK_CUDA_PROJECTIONS_SLAB_SIZE=16 > > From https://github.com/SimonRit/RTK/blob/master/rtkConfiguration.h.in#L35 > I assume that this gets copied to SLAB_SIZE which will be used by all CUDA > codes. > > My rtkConfiguration.h also reflects this: > > #ifndef SLAB_SIZE > > # define SLAB_SIZE 16 > > #endif > > > > Do you have an explanation why this gets reduced from 16 to 2? Or a hint > where I can have a look? > > > > Best, > > Moritz > > > > > > *Von:* Simon Rit <simon....@creatis.insa-lyon.fr> > *Gesendet:* Donnerstag, 18. November 2021 12:13 > *An:* Moritz Schaar <sch...@imt.uni-luebeck.de> > *Cc:* rtk-users@public.kitware.com > *Betreff:* Re: [Rtk-users] Slow CUDA FDK performance > > > > Hi, > > I compiled the python packages with exactly the same configurations and I > can't reproduce the issue > > old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0 -> 0.9 s > > 0.019904613494873047 > 0.6475656032562256 > Reconstructing... > > 0.9730124473571777 > > > > new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0 > > 0.017342329025268555 > 0.7650339603424072 > Reconstructing... > 0.8823671340942383 > > > > The code I ran is the following > > #!/usr/bin/env python > import sys > import itk > import time > from itk import RTK as rtk > > if len ( sys.argv ) < 3: > print( "Usage: FirstReconstruction <outputimage> <outputgeometry>" ) > sys.exit ( 1 ) > > # Defines the image type > GPUImageType = rtk.CudaImage[itk.F,3] > CPUImageType = rtk.Image[itk.F,3] > > # Defines the RTK geometry object > geometry = rtk.ThreeDCircularProjectionGeometry.New() > numberOfProjections = 200 > firstAngle = 0. > angularArc = 360. > sid = 600 # source to isocenter distance > sdd = 1200 # source to detector distance > for x in range(0,numberOfProjections): > angle = firstAngle + x * angularArc / numberOfProjections > geometry.AddProjection(sid,sdd,angle) > > # Writing the geometry to disk > xmlWriter = rtk.ThreeDCircularProjectionGeometryXMLFileWriter.New() > xmlWriter.SetFilename ( sys.argv[2] ) > xmlWriter.SetObject ( geometry ); > xmlWriter.WriteFile(); > > # Create a stack of empty projection images > ConstantImageSourceType = rtk.ConstantImageSource[GPUImageType] > constantImageSource = ConstantImageSourceType.New() > origin = [ -127.75, -127.75, 0. ] > sizeOutput = [ 512, 512, numberOfProjections ] > spacing = [ 0.5, 0.5, 0.5 ] > constantImageSource.SetOrigin( origin ) > constantImageSource.SetSpacing( spacing ) > constantImageSource.SetSize( sizeOutput ) > constantImageSource.SetConstant(0.) > > REIType = rtk.RayEllipsoidIntersectionImageFilter[CPUImageType, > CPUImageType] > rei = REIType.New() > semiprincipalaxis = [ 50, 50, 50] > center = [ 0, 0, 10] > # Set GrayScale value, axes, center... > rei.SetDensity(2) > rei.SetAngle(0) > rei.SetCenter(center) > rei.SetAxis(semiprincipalaxis) > rei.SetGeometry( geometry ) > rei.SetInput(constantImageSource.GetOutput()) > > # Create reconstructed image > constantImageSource2 = ConstantImageSourceType.New() > sizeOutput = [ 256 ] * 3 > origin = [ -63.75 ] * 3 > spacing = [ 0.5 ] * 3 > constantImageSource2.SetOrigin( origin ) > constantImageSource2.SetSpacing( spacing ) > constantImageSource2.SetSize( sizeOutput ) > constantImageSource2.SetConstant(0.) > t0 = time.time() > constantImageSource2.Update() > t1 = time.time() > print(t1-t0) > > # Graft the projections to an itk::CudaImage > projections = GPUImageType.New() > t0 = time.time() > rei.Update() > t1 = time.time() > print(t1-t0) > projections.SetPixelContainer(rei.GetOutput().GetPixelContainer()) > projections.CopyInformation(rei.GetOutput()) > projections.SetBufferedRegion(rei.GetOutput().GetBufferedRegion()) > projections.SetRequestedRegion(rei.GetOutput().GetRequestedRegion()) > > # FDK reconstruction > print("Reconstructing...") > FDKGPUType = rtk.CudaFDKConeBeamReconstructionFilter > feldkamp = FDKGPUType.New() > feldkamp.SetInput(0, constantImageSource2.GetOutput()) > feldkamp.SetInput(1, projections) > feldkamp.SetGeometry(geometry) > feldkamp.GetRampFilter().SetTruncationCorrection(0.0) > feldkamp.GetRampFilter().SetHannCutFrequency(0.0) > t0 = time.time() > feldkamp.Update() > t1 = time.time() > print(t1-t0) > > > To be honest I don't see to do at this stage... Can you maybe check the > same code with your two versions ? Any other suggestion? > Simon > > > > On Wed, Nov 10, 2021 at 10:03 AM Moritz Schaar <sch...@imt.uni-luebeck.de> > wrote: > > Hi Simon, > > > > I completely agree that this is hard to track down. That’s why I am asking > for directions J > > To be more precise about the execution times of my example: > > The timings given in pairs of 17.1/1.2 s and 19/7 s are only the required > times of the reconstruction step itself. > > Reading data, pre and post processing are not part of this time > measurement. > > So the 7 s average in python is similar to the 6.41 s I obtained from > adding everything done in CudaFDKConeBeamReconstructionFilter using > RTK_PROBE_EACH_FILTER. > > The reconstruction step in python simply involves: > > - Instantiation of a simple class, this doesn’t add anything to > the timings > > - Setting up ConstantImageSource with either rtk.Image or > rtk.CudaImage > > - Setting up > FDKConeBeamReconstructionFilter/CudaFDKConeBeamReconstructionFilter > > - Setting inputs, geometry and filter > > - Update() and return result > > > > Looks like there was a typo in my mail, the versions compared should be: > > old: CUDA 10.2, ITK 5.1.2, RTK 2.1.0 > > new: CUDA 11.5, ITK 5.2.1, RTK 2.3.0 > > > > Sorry for the confusion and thanks for looking into it! > > > > Best, > > Moritz > > > > > > *Von:* Simon Rit <simon....@creatis.insa-lyon.fr> > *Gesendet:* Mittwoch, 10. November 2021 09:32 > *An:* Moritz Schaar <sch...@imt.uni-luebeck.de> > *Cc:* rtk-users@public.kitware.com > *Betreff:* Re: [Rtk-users] Slow CUDA FDK performance > > > > Hi Moritz, > > Thanks for the report. It's a bit hard to be convinced that something is > wrong without being able to reproduce it. From the RTK_PROBE_EACH_FILTER > log, most of the time is spent reading the projections which will be the > same with or without cuda so I wonder if this is not the issue here. I can > try to reproduce the issue, can you just confirm the two configurations : > Cuda 10.2, ITK 5.2.1, RTK 2.1.0 vs Cuda 11.5, ITK 5.2.1 RTK 2.3.0 ? > > Thanks, > > Simon > > > > On Fri, Nov 5, 2021 at 4:20 PM Moritz Schaar <sch...@imt.uni-luebeck.de> > wrote: > > Hi, > > > > I recently upgraded my Windows 10 system to ITK 5.2.1 including RTK 2.3.0. > > This also involved upgrading CUDA from 10.2 to 11.5, Visual Studio 2019 > and even python update (3.8.5 to 3.8.12). > > Using the python wrapping of RTK I implemented own routines that use FDK > similar to the rtkfdk application. > > On the old system (ITK 5.2.1, RTK 2.1.0) I benchmarked the FDK for a > 512x512x200 dataset reconstructed into 256x256x256 with 1.0 mm isotropic > voxel size. > > The system is equipped with 24 CPU cores and one RTX 2080 Ti, so the CPU > version took 17.1 and the CUDA version 1.2 seconds. > > Running the new software version on the same system results in roughly 19 > s CPU time but more than 7 s for the CUDA version. > > I don’t care about the actual timings but the relative increase of the > CUDA version is what bothers me. > > > > To dig up some more information I recompiled RTK with > RTK_PROBE_EACH_FILTER and ran rtkfdk.exe for the same data, this is what I > got: > > > ************************************************************************************************************** > > Probe Tag Starts Stops Time > (s) Memory (kB) Cuda memory (kB) > > > ************************************************************************************************************** > > ChangeInformationImageFilter 200 200 > 0.0211846 0 0 > > ConstantImageSource 1 1 > 0.0305991 65668 0 > > CudaCropImageFilter 13 13 > 0.0222911 15786.8 15753.8 > > CudaDisplacedDetectorImageFilter 13 13 > 0.0540568 10719.1 16384 > > CudaFDKBackProjectionImageFilter 13 13 > 0.0326397 5051.38 5041.23 > > CudaFDKConeBeamReconstructionFilter 1 1 > 5.72999 552184 211648 > > CudaFDKWeightProjectionFilter 13 13 > 0.0262806 -13892 630.154 > > CudaFFTRampImageFilter 13 13 > 0.148416 43095.4 12499.7 > > CudaParkerShortScanImageFilter 13 13 > 0.0467202 2525.85 15753.8 > > ExtractImageFilter 13 13 > 0.0259726 15812.3 -15753.8 > > ImageFileReader 200 200 > 0.0226735 -0.16 0 > > ImageSeriesReader 200 200 > 0.066097 6.12 0 > > ProjectionsReader 1 1 > 26.0388 208488 0 > > StreamingImageFilter 2 2 16.0663 > 547512 191840 > > VnlRealToHalfHermitianForwardFFTImageFilter 2 2 > 0.0208174 0 0 > > > > Following the conversion on the mailing list, > https://public.kitware.com/pipermail/rtk-users/2018-July/010617.html, I > see that the CudaFDKConeBeamReconstructionFilter takes 6.41 s of which > roughly 1/3 is spent in the CudaFFTRampImageFilter. > > Sadly I don’t have these results for the old software version so I can’t > relate these values. > > > > However, I also played around with v2.2.0 but it doesn’t make a difference. > > Sadly, the version I used before (v2.1.0) won’t compile with CUDA 11.5 > anymore. I tried to add small adjustments e.g. this commit > https://github.com/SimonRit/RTK/commit/3d3c7506087f5fa98aee75df5af5c30e7e51cbe6 > to make things work but this didn’t work. > > The same happens with other errors when trying to setup ITK 5.1.2, so > getting back the old version for comparison seems impossible. > > > > Is there any direction you can point me to check what is actually the > issue here? Or maybe someone has an idea what could be the reason? > CUDA/RTK/ITK > version? > > Any help is appreciated. > > > > *Best,* > > *Moritz* > > > > _______________________________________________ > Rtk-users mailing list > Rtk-users@public.kitware.com > https://public.kitware.com/mailman/listinfo/rtk-users > >
_______________________________________________ Rtk-users mailing list Rtk-users@public.kitware.com https://public.kitware.com/mailman/listinfo/rtk-users