Re: multi processor scaling -- April 20, 2007

The scaling is looking awesome!  Obviously the MC sims will need work
and other tests will be required.  But the functionality of the branch
is looking very promising and exciting.  I have more responses below.

Please expect to have delayed responses to the previous messages.  I
will respond to your posts Gary, but I'm three days away from leaving
Australia and am flat out organising and packing.  I'll then be
spending a week in London before heading to Germany.  It could be a
few weeks before I'll be able to properly response to posts.


On 4/20/07, Gary S. Thompson <garyt@xxxxxxxxxxxxxxx> wrote:


 Dear All
     I have now had a chance to do some true multi tasking on our local 
cluster with real overhead from intreprocess communication and the results 
are as follows

 processors     min     eff     mc      eff     grid    eff
 1              18      100     80      100     134     100
 2              9       100
 4              5       90
 8              3       75
 16             1       112.5
 32             1       56.25   8       31.25   4       104.6


 and the picture that speaks 1000 words



 key top graph black line achieved runtimes
         top graph red line expected runtimes with perfect scaling efficency
         bottom graph scaling efficiency

 some notes


 0. data was collected on one of chris's small data sets containing 28 
residues not all of which are active for minimisation columns
         processors     - no slave  mpi processors
         min                    - time for a minimisation of models m1-m9 
with a fixed diffusion tensor
         eff                     - approximate parallel efficiency expected 
runtime/ actual runtime


It would be interesting to see if the efficiencies all converge to
100% when a larger number of spin systems are minimised.  Maybe
duplicating the data a number of times creating an artificially large
protein would be useful in that regard.

         mc                     - 256 monte carlo calculations
         eff                     - efficiency of the above
         grid                   - a grid search on a anisotropic diffusion 
tensor 6 steps


Do you mean the spheroid (axially symmetric) or the ellipsoid, as both
are anisotropic?  I would recommend increasing the number of steps in
this grid search if MPI is running.  With that type of scaling
efficiency, I would recommend 11 or 21 increments per dimension on a
32 processor cluster.  A drop from 134 min to 4 min is huge!

         eff                     - efficency of the above
      tests were run on a cluster of opterons using gigabit ethernet and mpi
 1. these results are crude wall times as measured by pythons time.time 
function for the master but they do not include startup and shutdown overhead


They should be more than accurate enough for these types of comparison.

 2. these tests are single point measurements there are no statistics


For now statistics are unnecessary.

 3. timings were rounded to 1 second, so for example we must consider data 
points for  more than 16 processors for the min run to be suspect


That would explain the 56% efficiency for 32 processors.

 The results also highlight up some interesting considerations

 1 our local cluster has very poor disk io, with the result that when i first 
ran the calculations I saw no multiprocessor imporvements on the min run (in 
actual fact it got worse!) I got round this for this crude test by switching 
off virtually all text output from the various minimisation commands. Now 
obviously this isn't a long term solution but I can thing of other methods 
e.g  using an output thread thread on the master or output batching that 
would improve these results.


Both options sound good.  This type of threading is not very
complicated, although debugging blocked threads is hell.  Sending the
minimisation print out in one hit at the end would be very useful as
well.

 2. comparison of the results from the grid calculation and the other 
calculations are quite informative. clearly the grid results are excellent. I 
believe this is because I am returning individual subtask results to the 
master as they complete and the  resulting overhead due to waiting for the 
master is a problem. To make this (clearer?) here is an example: in  the case 
of the mc run I will take the 256 mc runs and distribute a batch of 8 to each 
processor (in the case of a 32 processor run) I then resturn the results 
individually as they complete I believe this can lead to access to the master 
being the bottleneck (this is most probably due to output ovrehead on stdout  
 again, though problems with contention due to coherence of the calculation 
length could also be a problem ).


In the Monte Carlo simulations, all of the output of the minimisations
is suppressed.  Therefore the sending of the minimisation print out
shouldn't be the issue as nothing needs to be sent.  There must be
something else at play.  Finding out what this is exactly is important
before an investment into the threading or batching is made.

In the case of the grid there are no subtasks  as the grid is almost ideally 
sub divided by processor so only one task is run on each slave.


Do the MC sims have the same scaling efficiency if only one simulation
is sent to each processor at once?  Does the efficiency increase or
decrease?

I can see at least two answers to this. One is to batch the return of results 
so all results get returned at once and the second is to have an output 
thread on  the master separate from the thread ervicing mpi calls so 
processing of  returned data doesn't block the master and thus the rest of 
the cluster.


As I mentioned above, both would be useful.  However it would be good
to know which will be the most beneficial for increasing efficiency
before implementation.  It could be that the one or the other will not
result in any significant improvements.

 I have some more comments to follow on the design of the current 
minimisation interface, how text output from the commands is controlled, and 
unit testing but these will have to follow in another message later on


 regards
 gary

 n.b. if the pciture doesn't dsiplay well my apolergi


The picture at https://mail.gna.org/public/relax-devel/2007-04/msg00048.html
displays perfectly.  I've been thinking about how you could release an
MPI relax version prior to the merging of a patch into the 1.3 line.
You could release a relax-1.3.0-gt version (gt for Gary Thompson).
This could itself have a few versions associated with it (I don't know
how they would be called though).  What do you think Gary?

Cheers,

Edward

Re: multi processor scaling

Header

Content

Related Messages