Hi,

I checked the node running the job and there seem to be no memory issue but out of 2 MPi X 30 threads apparently only two threads are running at 100% CPU usage and the rest at about 5% each.

This is typical inefficiency when you have too many threads per MPI
process.

Each thread handles one particle in Polish.

If you have 35 particles on a movie, all threads handle
the first 30 particles in the first round but in the next round,
only 5 threads have particles to work on. 25 threads become idle.

Our file system is based on ZFS and we use --sbs option. Previously when we tried to use many Mpis in polishing, disk I/O seemed to become slow. Is that expected?

The file system itself is not very important, the underlying hardware
is.

Is it useful for polishing to have number of frames divisible by number of threads like in motioncor?

No, because parallelization is over particles, not frames.

Or could the problem be related to --only_do_unfinished issue discussed in: https://github.com/3dem/relion/issues/985    ?

I believe Polish does not have this problem.

Best regards,

Takanori Nakane

On 14.11.24 13:58, Takanori Nakane wrote:
[EXTERNAL EMAIL - USE CAUTION when clicking links or attachments]

Hi,

> We normally don't use many Mpi to limit disk load - perhaps at "combining frames" step it is OK to use many Mpi and less threads?

In general, having more MPI processes is more efficient,
provided that your file system can handle many parallel reads.
If you are using CephFS, GPFS, etc, this is the case.
If you are using a simple disk, or a RAID with not many disks,
that is surely limiting.

The combining step is more I/O bound than the earlier,
trajectory estimation step. If you cannot increase the number of
MPI processes in the early step, you cannot with this as well.

> Could it be memory leak problem similar to "subtracting particles"

First, that is not memory leak in RELION. It is caused by some
bad implementation/parameters of malloc.

You can check this by looking at the memory consumption.
Unlike subtraction, Polish supports continuation of a job.
In case of doubts, you can kill and continue the job.

Best regards,

Takanori Nakane

On 11/14/24 21:49, Leonid Sazanov wrote:
Hi, we are having some trouble finishing the polishing step. There are about 11k movies from K3 super-resolution pixel 0.53, motion-corrected in relion5 binned pixel 1.06, about 80 particles per image, resolution of about 2.3 A. We use 2 Mpi with 30 threads each and 16 GB memory per thread. The process seems to get slower and slower as it goes via the last "combining frames" step - initially it went though about 7k movies in one day and now at the third day it is at 8k movies and predicts at least another 5 days. We normally don't use many Mpi to limit disk load - perhaps at  "combining frames" step it is OK to use many Mpi and less threads?  Could it be memory leak problem similar to "subtracting particles" - we do have this problem on our system and could try to fix this with a previous solution for subtraction. Any other options?

Many thanks for any info!
Best
Leonid

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to