Hi Troels, Please see below: On 27 May 2015 at 02:10, Troels E. Linnet <NO-REPLY.INVALID-ADDRESS@xxxxxxx> wrote:
URL: <http://gna.org/bugs/?23618> Summary: queuing system for multi processors is not well designed. Project: relax Submitted by: tlinnet Submitted on: Wed 27 May 2015 12:10:57 AM UTC Category: relax's source code Specific analysis category: None Priority: 5 - Normal Severity: 3 - Normal Status: None Assigned to: None Originator Name: Originator Email: Open/Closed: Open Release: Repository: trunk Discussion Lock: Any Operating System: All systems _______________________________________________________ Details: There queuing system for multi processors appears not to be designed well. This has been detected in dispersion analysis. A clustered fit of 74 spins, doing 100 monte carlo simulations. The test has been where a number of multi processors is 10, with 1 CPU as master. The problem seems to reside in: multi.processor.run_queue() multi.multi_processor.chunk_queue() The current queuing system will take the 100 monte carlo simulations, and chunk them up in pieces of 10, and distribute each of these chunks to each CPU. Each CPU thus have 10 simulations to handle. The problem is, that not each simulations is equally fast to be solved. Thus, a CPU will "hang" until all simulations has finished. This will "block" the possibility to assign CPU power for other tasks, until all simulations has finished. A suggestion for a "first" fix, is not to chunk up the queue, but let each simulation be handled independently. In multi/processor.py -------------- - lqueue = self.chunk_queue(self.command_queue) - self.run_command_queue(lqueue) + #lqueue = self.chunk_queue(self.command_queue) + self.run_command_queue(self.command_queue) ------------- This does seem to improve the timing much, but give a better overview in the process.
This is actually a balancing act which depends on the data transfer rate between the nodes and the per-node computation time. For applications where data transfer is rate limiting (either data transfer is slow, or the calculations are relatively very fast), the chunking is very, very useful. This is the case for model-free analyses on the per-residue level parallelisation.
It appears that the queuing system can even be enhanced more. The list of "Running set" is not replenished before all jobs in "Running set" is completed.
This is not what I remember as happening. I remember clearly seeing the queue being replenished. Maybe a bug has been introduced. Or maybe this new bug is specific to the parallelisation of Monte Carlo simulations, and not the other parallelisations. We need to get to the bottom of this.
This influences the solving time. ---- Only 20 monte carlo simulations is runned for comparison. /usr/bin/time -p relax_multi bug.py The running time for 1 CPU, no multi processor: real 510.94 user 5903.01 sys 133.96 The running time for 1 CPU, 4 multi processor: real 214.89 user 1786.39 sys 37.09 The running time for 1 CPU, 10 multi processor: real 108.39 user 1930.21 sys 44.45 The running time for 1 CPU, 4 multi processor with first fix: real 235.46 user 1892.20 sys 38.58 The running time for 1 CPU, 10 multi processor with first fix real 110.50 user 1957.99 sys 43.60
What is the 'relax_multi' file? The times with the fix look to be the same. I don't believe that this change is a fix though, and you should probably revert it. For the 4 to 10 processor 'sys' time increase, this might be due to data transfer being a bottleneck. I cannot however check this yet, as I don't know how to execute the 'bug.py' script yet ;) Cheers, Edward