mailRe: [bug #23618] queuing system for multi processors is not well designed.


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on June 08, 2015 - 17:33:
Hi Troels,

Please see below:


On 27 May 2015 at 02:10, Troels E. Linnet
<NO-REPLY.INVALID-ADDRESS@xxxxxxx> wrote:
URL:
  <http://gna.org/bugs/?23618>

                 Summary: queuing system for multi processors is not well
designed.
                 Project: relax
            Submitted by: tlinnet
            Submitted on: Wed 27 May 2015 12:10:57 AM UTC
                Category: relax's source code
Specific analysis category: None
                Priority: 5 - Normal
                Severity: 3 - Normal
                  Status: None
             Assigned to: None
         Originator Name:
        Originator Email:
             Open/Closed: Open
                 Release: Repository: trunk
         Discussion Lock: Any
        Operating System: All systems

    _______________________________________________________

Details:

There queuing system for multi processors appears not to be designed well.

This has been detected in dispersion analysis.
A clustered fit of 74 spins, doing 100 monte carlo simulations.

The test has been where a number of multi processors is 10, with 1 CPU as
master.

The problem seems to reside in:
multi.processor.run_queue()
multi.multi_processor.chunk_queue()

The current queuing system will take the 100 monte carlo simulations, and
chunk them up in pieces of 10, and distribute each of these chunks to each
CPU.

Each CPU thus have 10 simulations to handle.

The problem is, that not each simulations is equally fast to be solved.
Thus, a CPU will "hang" until all simulations has finished.
This will "block" the possibility to assign CPU power for other tasks, until
all simulations has finished.

A suggestion for a "first" fix, is not to chunk up the queue,
but let each simulation be handled independently.

In multi/processor.py
--------------
-        lqueue = self.chunk_queue(self.command_queue)
-        self.run_command_queue(lqueue)
+        #lqueue = self.chunk_queue(self.command_queue)
+        self.run_command_queue(self.command_queue)
-------------

This does seem to improve the timing much, but give a better overview in the
process.

This is actually a balancing act which depends on the data transfer
rate between the nodes and the per-node computation time.  For
applications where data transfer is rate limiting (either data
transfer is slow, or the calculations are relatively very fast), the
chunking is very, very useful.  This is the case for model-free
analyses on the per-residue level parallelisation.


It appears that the queuing system can even be enhanced more.
The list of "Running set" is not replenished before all jobs in "Running 
set"
is completed.

This is not what I remember as happening.  I remember clearly seeing
the queue being replenished.  Maybe a bug has been introduced.  Or
maybe this new bug is specific to the parallelisation of Monte Carlo
simulations, and not the other parallelisations.  We need to get to
the bottom of this.


This influences the solving time.


----
Only 20 monte carlo simulations is runned for comparison.
/usr/bin/time -p relax_multi bug.py

The running time for 1 CPU, no multi processor:
real 510.94
user 5903.01
sys 133.96

The running time for 1 CPU, 4 multi processor:
real 214.89
user 1786.39
sys 37.09

The running time for 1 CPU, 10 multi processor:
real 108.39
user 1930.21
sys 44.45


The running time for 1 CPU, 4 multi processor with first fix:
real 235.46
user 1892.20
sys 38.58

The running time for 1 CPU, 10 multi processor with first fix
real 110.50
user 1957.99
sys 43.60

What is the 'relax_multi' file?  The times with the fix look to be the
same.  I don't believe that this change is a fix though, and you
should probably revert it.  For the 4 to 10 processor 'sys' time
increase, this might be due to data transfer being a bottleneck.  I
cannot however check this yet, as I don't know how to execute the
'bug.py' script yet ;)

Cheers,

Edward



Related Messages


Powered by MHonArc, Updated Mon Jun 08 18:20:09 2015