Re: [ccp4bb] MX data processing with GPUs??

Kay Diederichs Wed, 19 Feb 2020 10:47:29 -0800

Hi David,

raising the 99*99 limit does not yet appear to be needed with a single such 
machine - you'd have better performance by running XDS.INP with e.g.


MAXIMUM_NUMBER_OF_JOBS=4                    ! number of processes spawned 
MAXIMUM_NUMBER_OF_PROCESSORS=32  ! number of threads

Other combinations e.g. 8/16 or 5/25 or even some overcommitment (e.g. 3/50 or 
6/25) may be even better, depending on OSCILLATION_RANGE, DELPHI and the total 
number of frames. Please see 
https://strucbio.biologie.uni-konstanz.de/xdswiki/index.php/Performance !

With a cluster of 24 such machines, and using the 4/32 combination, you are 
still slightly below the limit of 99 "JOBS" for a single XDS run.

I don't foresee that more than 99 threads per "JOB" will be performant in the 
next few years, but once somebody wants to use more than 99 "JOBS" we'll change 
the limits.

If you plan for a Threadripper 3XXX, make sure to buy DDR4-3200 JEDEC memory, 
but don't go overboard with the amount - above certain limits (32 GB, I think; 
I could not find the exact specs for this), the maximum speed with which the 
memory controller can drive the RAM goes down (to 2933 or less). This is - 
together with the limited network capacity of 10Gb ports -  why one may prefer 
two 32-core TRs over the 64-core TR.

best wishes,
Kay 

On Wed, 19 Feb 2020 08:38:20 -0500, David Schuller <schul...@cornell.edu> wrote:

>Thank you for the info. Another of the Threadripper 3xxx series, the 
>3990X has 64 cores for 128 threads, so perhaps it is time to raise that 
>99*99 limit in XDS.
>
>
>
>On 2020-02-19 07:21, Kay Diederichs wrote:
>> Dear Ana,
>>
>> it is easy to ask the question (and I've been asked several times), but 
>> somewhat difficult to answer. To add to Graeme's excellent explanations:
>>
>> - all developers of MX processing software have seriously considered to 
>> implement their algorithms on GPUs, and have decided that the effort (which 
>> is very significant) is not worth it, in terms of benefit for users and 
>> developers (who should pay them for the effort? - after all, this would not 
>> result in a highly cited publication!). We are aware of the fact that they 
>> are much faster than CPUs for specific types of calculations that are most 
>> useful for images where each pixel is treated in the same way - but these 
>> types of (potentially highly parallel) calculations do not represent a large 
>> fraction of where a MX data processing program spends its time, and even 
>> worse, the parallel and serial parts of the calculation alternate in fast 
>> succession (XDS has on the order of 10 parallel regions, none of which 
>> dominates the CPU time). Ultimately, it is the serial fraction of a program 
>> that determines its potential speed-up, due to Amdahl's law.
>>
>> - MX data processing programs (at least XDS and DIALS) already exploit 
>> parallelism by using multiple CPUs at the same time; the current version of 
>> XDS can in principle use up to 99*99=9801 processors, and 60 machines each 
>> running 60 threads (see below) would process a 360° dataset composed of 0.1° 
>> frames within seconds, if DELPHI=6.
>>
>> - the recent Ryzen Threadripper 3XXX series CPUs have a significantly better 
>> cost/performance ratio than other processor families. A TR 3970X workstation 
>> can be bought for less than 5000€, and offers 64 threads. Graeme mentions 
>> AMD Rome; this is the server variant. The data transfer would become the 
>> bottleneck; to me, a cluster of workstations each equipped with two 10Gb 
>> ports looks attractive.
>>
>> Finally, I have the feeling that speed in data collection and processing is 
>> over-rated. I get the impression that some (many?) people think they should 
>> collect a data set as quickly as the machine permits. But they may not be 
>> aware of the fact that the quality of the data is then not optimal. Going 10 
>> times slower, and reducing the transmission to 10%, gives a resonable safety 
>> margin.
>>
>> Further questions arise - does every crap crystal have to be put into the 
>> beam? And does every crap data set have to be processed? Do all of us really 
>> want and need to collect from thousands of crystals every synchrotron day? 
>> Are all of us really producing that many crystals? Who is? (you probably 
>> realize my lack of imagination by now)
>> I know that people who build and run synchrotron beamlines have a different 
>> perspective, concerning these questions, than their users. Some common 
>> sense, and a lot of discussion, would  benefit our community more than 
>> resorting to technological "solutions".
>>
>> best wishes,
>> Kay
>>
>> On Wed, 19 Feb 2020 08:08:40 +0000, Winter, Graeme (DLSLtd,RAL,LSCI) 
>> <graeme.win...@diamond.ac.uk> wrote:
>>
>>> Dear Ana,
>>>
>>> To follow up on the contributions from others, there are some particular 
>>> annoyances with MX processing which differentiate it from other “big data” 
>>> or imaging problems.
>>>
>>> In tomographic reconstruction you have a big block of data which needs to 
>>> (as a simplistic approximation) be transformed by a bunch of trigonometric 
>>> functions to another big block of data. The shape of the calculation is the 
>>> same independent of the data itself, and overall this represents a 
>>> massively parallel computationally expensive problem, which makes it worth 
>>> the cost of getting the data in and out of the GPU (this is not cheap) - 
>>> even in this case, the parallelism of modern CPUs means that this is not a 
>>> given. These folks are usually the ones who are making a lot of noise about 
>>> how awesome GPU boards are, and for their use case this is absolutely true.
>>>
>>> In MX we have a particularly annoying problem, as about half of the 
>>> calculations are nicely parallel (spot finding, peak integration) and are 
>>> memory bandwidth / CPU breadth limited and the other half (indexing, 
>>> refinement, scaling) are not very parallel CPU speed bound, so finding the 
>>> best CPU architecture is hard to start with. In terms of GPU, the data need 
>>> to typically pass through main memory three times - for spot finding you 
>>> need to look at every pixel, and integration typically needs to load full 
>>> frames to extract the profiles and then fit them (the shoebox regions can 
>>> be cached between these, but they still need to pass in and out of the 
>>> CPU). Since moving data in and out of memory is expensive and GPU memory is 
>>> expensive this is a problem. For reference, a typical Eiger 16M data set 
>>> uncompressed needs about half a terabyte of RAM (7,200 * 18 megapixels * 4 
>>> bytes) so in memory processing presents real challenges. The image analysis 
>>> calculations themselves are typically rather light weight floating point 
>>> work (e.g. summed area table calculations) without a lot of trigonometry.
>>>
>>> All this, combined with the annoying habit of using words like “if” and 
>>> “for” in the code (which kills GPU calculations dead) mean that even for 
>>> spot finding it’s not worth the effort of moving the data into a GPU - we 
>>> DIALS folks looked into this a couple of years back with a specialist from 
>>> NVIDIA.
>>>
>>> For what it’s worth we have spent some time looking at this here at 
>>> Diamond, where we have a certain interest in speedy processing of MX data 
>>> and the current (2020/02) best bang for buck appears to be AMD Rome.
>>>
>>> We as a community have a challenge with keeping up with high data rate 
>>> beamlines at both synchrotrons and FELs - I feel it is important to keep an 
>>> eye on emerging technology and make best use of it (and share experiences 
>>> of using it!) but we should also keep in mind that the processing done in 
>>> MX is actually rather well defined and mathematical at its heart. It is 
>>> very unlikely that deep learning will help with the mathematical challenges 
>>> we face [1] as we know exactly the calculations we need to do (which are 
>>> very well documented in the literature, thank you to everyone who has 
>>> written these up over the years) and instead a clear focus on making the 
>>> maths fast is needed.
>>>
>>> Up to the point where someone comes up with a completely new way of looking 
>>> at the data, of course. I’m sure someone out there is looking at this :-)
>>>
>>> On the topic of raspberry pi machines ;-) these are fun but I would hate to 
>>> look at the interconnect necessary to get enough boards to work together to 
>>> keep up with a single AMD Rome box…
>>>
>>> best wishes Graeme
>>>
>>> [1] with the possible exception of classifying individual found spots and 
>>> other niche areas
>>>
>>>
>>> On 19 Feb 2020, at 07:04, Leonarski Filip Karol (PSI) 
>>> <filip.leonar...@psi.ch<mailto:filip.leonar...@psi.ch>> wrote:
>>>
>>> Dear Ana,
>>>
>>> To benefit from GPU architecture, over CPU, the algorithm needs to do quite 
>>> significant number crunching – i.e. do at least certain number of floating 
>>> point operations (FLOP) per one byte of data. It also needs to be highly 
>>> parallel, preferably without conditional (if/else) statements. Finally, 
>>> there is a variety of GPU architectures on the market and it is not exactly 
>>> obvious that code written for one GPU will be optimal on another one. So if 
>>> the code is based on a general purpose library, it will be easier to make 
>>> sure that it runs efficiently on all GPU hardware.
>>>
>>> I believe combination of these factors makes a big difference between 
>>> imaging and MX.
>>>
>>> Imaging processing is limited by FFT performance, which needs floating 
>>> point performance. Libraries for FFT on GPUs are standard and provided by 
>>> hardware vendors, so it is easy to implement.
>>>
>>> On the other hand MX algorithms for image processing, at least the one I 
>>> know of, do only handful of FLOP per pixel and they will probably not 
>>> benefit from GPU processing significantly, even if ported to such 
>>> architecture – which would be also a non-negligible effort. So while it is 
>>> not impossible to imagine GPU-accelerated MX software and hopefully people 
>>> are working on this, it is not a low hanging fruit, like in case of GPU 
>>> acceleration for imaging or cryo-EM.
>>>
>>> On a side note if one could find a way to use machine learning for data 
>>> processing and implement data processing pipeline in Tensorflow, then GPUs 
>>> would pay off quickly.
>>>
>>> Regarding Tim’s Raspberry Pi argument – it should be compared with Nvidia 
>>> Jetson price, which is more or less RPi with GPU, and it won’t be actually 
>>> that significant difference.
>>>
>>> Best,
>>> Filip
>>>
>>>
>>> From: CCP4 bulletin board 
>>> <CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>> on behalf of Ana 
>>> Carolina de Mattos Zeri <ana.z...@lnls.br<mailto:ana.z...@lnls.br>>
>>> Reply to: Ana Carolina de Mattos Zeri 
>>> <ana.z...@lnls.br<mailto:ana.z...@lnls.br>>
>>> Date: Tuesday, 18 February 2020 at 20:58
>>> To: "CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>" 
>>> <CCP4BB@JISCMAIL.AC.UK<mailto:CCP4BB@JISCMAIL.AC.UK>>
>>> Subject: [ccp4bb] MX data processing with GPUs??
>>>
>>> Dear all
>>> we have asked this of a few people, but the question remains:
>>> does any of you have experienced/tried using GPU based software to treat MX 
>>> data? for reducing or subsequent image analysis?
>>> is it a lost battle?
>>> how do you deal with the crescent amount of data we are facing, at 
>>> Synchrotrons and XFELs?
>>> Here at the Manaca beamline at Sirius we will continue to support CPU based 
>>> software, but due to developments in the imaging beam lines, GPU machines 
>>> are looking very attractive.
>>> many thanks in advance for your thoughts,
>>> all the best
>>> Ana
>>>
>>>
>>> Ana Carolina Zeri, PhD
>>> Manaca Beamline Coordinator (Macromolecular Micro and Nano Crystallography)
>>> Brazilian Synchrotron Light Laboratory (LNLS)
>>> Brazilian Center for Research in Energy and Materials (CNPEM)
>>> Zip Code 13083-970, Campinas, Sao Paulo, Brazil.
>>> (19) 3518-2498
>>> www.lnls.br<http://www.lnls.br/>
>>> ana.z...@lnls.br<mailto:ana.z...@lnls.br>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Aviso Legal: Esta mensagem e seus anexos podem conter informações 
>>> confidenciais e/ou de uso restrito. Observe atentamente seu conteúdo e 
>>> considere eventual consulta ao remetente antes de copiá-la, divulgá-la ou 
>>> distribuí-la. Se você recebeu esta mensagem por engano, por favor avise o 
>>> remetente e apague-a imediatamente.
>>>
>>> Disclaimer: This email and its attachments may contain confidential and/or 
>>> privileged information. Observe its content carefully and consider possible 
>>> querying to the sender before copying, disclosing or distributing it. If 
>>> you have received this email by mistake, please notify the sender and 
>>> delete it immediately.
>>>
>>>
>>> ________________________________
>>>
>>> To unsubscribe from the CCP4BB list, click the following link:
>>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1
>>>
>>> ________________________________
>>>
>>> To unsubscribe from the CCP4BB list, click the following link:
>>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1
>>>
>>>
>>> -- 
>>> This e-mail and any attachments may contain confidential, copyright and or 
>>> privileged material, and are for the use of the intended addressee only. If 
>>> you are not the intended addressee or an authorised recipient of the 
>>> addressee please notify us of receipt by returning the e-mail and do not 
>>> use, copy, retain, distribute or disclose the information in or attached to 
>>> the e-mail.
>>> Any opinions expressed within this e-mail are those of the individual and 
>>> not necessarily of Diamond Light Source Ltd.
>>> Diamond Light Source Ltd. cannot guarantee that this e-mail or any 
>>> attachments are free from viruses and we cannot accept liability for any 
>>> damage which you may sustain as a result of software viruses which may be 
>>> transmitted in or with the message.
>>> Diamond Light Source Limited (company no. 4375679). Registered in England 
>>> and Wales with its registered office at Diamond House, Harwell Science and 
>>> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
>>>
>>>
>>> ########################################################################
>>>
>>> To unsubscribe from the CCP4BB list, click the following link:
>>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1
>> ########################################################################
>>
>> To unsubscribe from the CCP4BB list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1
>
>
>-- 
>=======================================================================
>All Things Serve the Beam
>=======================================================================
>                                David J. Schuller
>                                modern man in a post-modern world
>                                MacCHESS, Cornell University
>                                schul...@cornell.edu
>
>########################################################################
>
>To unsubscribe from the CCP4BB list, click the following link:
>https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

Re: [ccp4bb] MX data processing with GPUs??

Reply via email to