mailRe: The multi-processor branch.

Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]



Posted by Gary S. Thompson on May 18, 2007 - 10:11:
Edward d'Auvergne wrote:

On Fri, 2007-05-04 at 13:59 +0100, Gary S. Thompson wrote:



the processor implementation gives some  feedback as to what prcoessor
you are running:

M S> script
M S>
M S>
M S>
M S>                                      relax repository checkout
M S>
M S>                           Protein dynamics by NMR relaxation data
M S>
M S>                              Copyright (C) 2001-2006 Edward
M S>
M S> This is free software which you are welcome to modify and
redistribute under the conditions of the
M S> GNU General Public License (GPL).  This program, including all
modules, is licensed under the GPL
M S> and comes with absolutely no warranty.  For details type 'GPL'.
Assistance in using this program
M S> can be accessed by typing 'help'.
M S>
M S> processor = MPI running via mpi4py with 5 slave processors & 1
master, mpi version = 1.2
M S>
M S> script = ''
M S>

If using '-np 6', shouldn't the number of slaves be 6?

nope there needs to be one processor which is the master and you just tell mpi how many prcoessors you want ( I will investigate running jobs on the master in a thread at some point (maybe never depending ;-)) but this places extra requirements on the mpi implimentation and is thus a special case, I can give more details if you want me to)

note the processor =  line

The processor line is in line with the 'script = ...' notation and that
information would be useful if the output is caught and stored in a log
file.  I like that touch.

another couple of things to note are that the output from the program
is prepended with some text indicating which stream and which
processors the output is coming from: The output prefix is divided
into two parts

'processor' 'stream'>  [normal output line]

where processor is either a number to identify the rank of the processor, or
a series of M's to indicate the master
stream is either E or S for the error or output streams

so here is another fragment

1 S> Hessian calls:    0
1 S> Warning:          None
1 S>
M S> idle set set([1, 2])
M S> running_set set([2, 3, 4, 5])
M S>
2 S>
2 S>
2 S> Fitting to residue: 24 ALA
2 S> ~~~~~~~~~~~~~~~~~~~~~~~~~~
2 S>
2 S> Grid search
2 S> ~~~~~~~~~~~
2 S>
2 S> Searching the grid.
2 S> k: 0       xk: array([ 0.

in this case we finish a minimisation on processor 1 '1 S>'
then have some output from the master processor  'M S>'
and then some output from prcoessor 2 '2 S>'

This output is very useful.  I would prefer though if the notation:

[M S] relax>

rather than:

M S> relax>

The text '1 S>' looks like a prompt when it is not whereas [1 S] is less

thats fine I will do it

On another note, I strongly believe that for ordinary operation this
output is not necessary.  The user doesn't need to know which slave
process the code has executed on.  All the user will care about is that
the calculation has occurred successfully.  Nevertheless this output is
very useful for programming and debugging purposes.  I propose that it
is shown when the --debug flag is passed to relax and suppressed
otherwise.  Importantly, that way the MPI or threading mode of operation
will look very similar to the normal uni-processor operation.  Or maybe
a verbosity flag to relax could be added to activate this printout?

certainly I think it is quite useful to have myself, because it assures the user that work is actually being distributed. However, I was planning to add a flag to switch it off anyway. Its just a case of taste I guess;-)

when running under the threaded and mpi4py implimentations you may see
long gaps with no output and the output to the terminal can be quite
'jerky'. This is because the multiprcoessor implimentation uses a
threaded output queue to decouple the writing of output on the master
from the queuing of calculations on the slaves, as otherwise for
systems with slow io the rate of io on the mastewr can control the
rate of calculation!

I'll have to test this later and see if I can cosmetically minimise the

Your can't! Well ok you can but there are 'implications'. The jerkyness is intrinsic to the batching up of results from Slave_commands, so you can switch off the batching of results from the Slave_processors but this will put more stress on ther master and the interprocessor communication fabric. If you want to return string results one line at a time the design also allows you to do this, but again you stress the master processor and the interprocessor communication fabric so possibly slowing the overall calculation. Note also that what works well for a computer with fast interprcoess interconnects will not work well on a computer with a slow communication fabric. Anyway the message overall is if you block/slow the master you can end up slowing the whole multiprocessor....

also note the std error stream is not currently used as race
conditions between writing to the  stderr and stdout streams can lead
to garbled output.

This will definitely need to be fixed prior to merging into the 1.3
line.  Stdout and stderr separation is quite important.

Indeed this is true and what I intend to do is to reintroduce an output streem that splits output on the master based on what the lines prefix is. This is all down to efficiency again, i vcould retuirn each line of text as it is output to the output stream on the slave, however, so there are not lots of objects to send between processors I join the streams together with the tags for identification of where the lien came from.. The intention was to give the user the choice to split them again at the other end but, I still haven't had time to write that code.

futher note that the implimentation includes a simple timer that gives
some bench marking as to the speed of calculation, this is the total
time that it takes for the master process to run

M S> relax>'save', dir=None, force=1,
M S> Opening the file 'save.bz2' for writing.
M S>
M S> overall runtime: 0:00:24

What triggers the time printout?  Does it occur at a specific location?
Does it occur multiple times during execution?

no only once on completion, it is the last thing the processor does on normal termination. See multi.prccessor.prerun and multi.processor.postrun and multi.Application_callback

Interactive terminals: the multi implementation still has an
interactive terminal. Tis maybe started by typing mpiexec -np
6 ../relax --multi mpi4py      for example in the case of an mpi4py
session All io to the treminal takes place on the master processor,
but commands that are parallel still run across the whole cluster.

Perfect (although I already knew this)!
note I think there maybe a problem with command completion etc I think this is because the streams aren't notifying the system that they are conencted to terminals as opposed to a file eg istty is not correctly passed through. I will investigate sometime soon but it isn't my higest priority ... laos the output can also look a bit manky but i will work at this

Exceptions: exceptions from slave  processors appear with slightly
different stack traces compared to normal exceptions:

Traceback (most recent call last):
 line 351, in run
line 75, in default_init_master
 line 177, in run
line 216, in run
   run_script(intro=self.relax.intro_string, local=self.local,
script_file=self.relax.script_file, quit=1)
line 392, in run_script
   console.interact(intro, local, script_file, quit)
line 343, in interact_script
   execfile(script_file, local)
 File "", line 54, in ?
   grid_search(name, inc=11)
 line 147, in grid_search
 line 270, in run_queue
 line 335, in run_command_queue
 line 109, in put
 line 76, in put
 line 221, in process_result,memo)
line 276, in run
   raise self.exception


 line 381, in run,completed)
line 297, in run
   raise 'dummy'

Nested Exception from sub processor
Rank: 1  Name: fbsdpcu156-pid31522
Exception type: dummy (legacy string exception)
Message: dummy


here we have an exception 'dummy' which was raised at line 297, in the
run function /multi/ on  slave 1 processor node  fbsdpcu156
process id 31522 and transferred back to line 276 of  function run in
multi/ on the master where it was raised again.

I think that the error printout should be made to resemble the standard
Python printout.  For example if you raise a RelaxError with the text
'hello', the printout could look like:

   raise RelaxError, 'hello'
Nested Exception from sub processor
Rank: 1  Name: fbsdpcu156-pid31522
RelaxError: hello

Hence the exception type is not separated from its message.

Ok I can do that its only a question of munging  a format string

Now some caveats 1. not all exceptions can be handled by this mechanism as they
exceptions can only be handed back once communication between the
slaves has been setup. This can be a problem on some mpi
implimentations as they don't provide redirection of stdout back to
the master contolling trerminal.

There's probably not much that can be done there.

yes what I am looking at is putting output to a file one per processor in this case

(this won't work in all cases as some clusters don't have disk storage?)

2. I have had a few cases where raising an exception has wedged the
whole multiproessor without any output. These can be quite hard to
debug as they are due to errors in the overrides I put on the io
streams! a pointer that may help is that  using the
sys.settrace(traceit)  as shown in will produce copious
output tracing  (and a very slow program)

The sorting out and separation of the IO streams may cause this problem
to disappear.
nope this may be to do with exceptions being thrown on remote proceessors and the master processor waiing infinitley long for communication from dea processors...

3. not all exception states seem to be leading to an exit from the
program currently so you should monitor output from the program

Do you know why this is happening?
no I am investigating


the following calculations are currently parallelised

1. model free minimisations across sets of residues with a  fixed
difffusion tensor frame 2. model free grid searches for the difffusion tensor frame
3. monte carlo simulations

This is great work!

in future it maybe possible also parallelise the minimisation of
modelfree calculations of the 'all' case where model fitting and the
tensor frame are optimised at the same time. However,this will require
modifications to the model free hessian gradient and cuntion
calculation routines and development of a parallel newton line seach
which are both major undertakings.

These are possible targets for parallelisation but I would very strongly
recommend against working at this position.  And adding optimisation
algorithms would require very careful testing.  From my experience with
optimisation in the model-free space, I would probably bet that the
algorithm will fail for certain model-free motions (not many algorithms
find all minima in such a convoluted space).  The place to target is the
following three functions:

Specifically the loop over all residues (to be renamed to all spin
systems in the 1.3 line) to create the value, gradient, and Hessian
would be the ideal spot to parallelise!

indeed this is what I thought and neil is working on it. One side note is that (if such a thing existed) a line search which is adventitous and looked a superset of newton positions would work ;-) again it would have to be tested but all such things have to betested to some degree note are tests for the cases for where lm failed in the test suite for relax?

Indeed the problem may be fine grained enough that use of c mpi and
recoding of the hessian etc calculations for model free in c is

This conversion should significantly speed up calculations anyway.  I
will do this one day.

the later we do this the better c is a bind. I still think pyrex which compiles what almost lloks like python to c woould be a good thing to look at ;-) I might try an prototype something for you to look at at some point

speedups on all calculations with increasing numbers of processors
should be near perfect as alluded to in message more
benchmarks will follow soon

processors      min     eff     mc      eff     grid    eff
1               18      100     80      100     134     100
2               9       100
4               5       90
8               3       75
16              1       112.5
32              1       56.25   8       31.25   4       104.6

and the picture that speaks 1000 words

processors      min     eff     mc      eff     grid    eff
1               18      100     80      100     134     100
2               9       100
4               5       90
8               3       75
16              1       112.5
32              1       56.25   8       31.25   4       104.6

and the picture that speaks 1000 words

key top graph black line achieved runtimes
       top graph red line expected runtimes with perfect scaling
       bottom graph scaling efficiency
some notes

0. data was collected on one of chris's small data sets containing 28
residues not all of which are active for minimisation columns
processors - no slave mpi processors min - time for a minimisation of models
m1-m9 with a fixed diffusion tensor
       eff                     - approximate parallel efficiency
expected runtime/ actual runtime
       mc                     - 256 monte carlo calculations
       eff                     - efficiency of the above
       grid                   - a grid search on a anisotropic
diffusion tensor 6 steps
       eff                     - efficency of the above
    tests were run on a cluster of opterons using gigabit ethernet
and mpi
1. these results are crude wall times as measured by pythons time.time
function for the master but they do not include startup and shutdown
2. these tests are single point measurements there are no statistics
3. timings were rounded to 1 second, so for example we must consider
data points for  more than 16 processors for the min run to be suspect

key top graph black line achieved runtimes
       top graph red line expected runtimes with perfect scaling
       bottom graph scaling efficiency

note if you watch the output carefully you will see one difference
between the multiprocessor and uniprocessor runs of the grid search.
The grid search reports all cases of the search where the target
function has improved for each processor, rather than for the whole
grid search....

I wonder what is causing the Monte Carlo simulations to not be 100%
efficient.  Of all the code in relax, I would see this as the most
amenable for parallelisation.  Each simulation can be queued to a
different slave.

I agree this may be an artefact of they way our cluster is setup... or a oneoff problem due to contention on its ethernet fabric (I said I didn't do any statistics

Bugs missing freatures todos etc:

1. There is very little commenting

This is quite important for the future maintainability of the code.

well its on the list

2. some exceptions do not stop the interpreter properly and there may
still be some bugs that cause lockups on throwing exceptions


well I have sorted everying I can find but... we need more tests

3. there are no unit tests (though the amount of code that can be unit
tested is rather limited as for example writing mock objects for mpi
could be fun!)

Most parts of the code could be tested relatively easily.  The unit test
framework makes this job quite easy.

Indeed ;-) I can test everything except the comnmunication which would require me to write a mock mpi object (ouch) though I do now have some thoughts on this...

4. there are no documentation strings

I would recommend compiling the API documentation using scons and then
looking at the HTML output.  That output should very clearly show what
docstrings are missing or deficient.

well until last weeknothing wa documented so I didn't bother! Also I note that epydoc has a checking mode that will note which interfaces are undocumented...

5. the command line handling need to be improved: we need to find the
current processor implimentation, load it and then ask it what command
line options it needs (this will also allow the simplification of the
handling of setting up the number of processors and allow
multiprocessor that need more command line arguments such as ssh
tunnels to get extra arguments) I will also have to design a way of
getting the help text for all the processor command line options
whether they are loaded or not

I don't follow.  The user shouldn't be asked a question by relax.
when I say 'ask the user' I really mean interrogate the command line I was just being a bit loose in my terminology

6. there are many task comments littered around the code FIXME: TODO:
etc all of these except the ones labelled PY3K: will need to be
reviewed resolved and removed
7. the relax class still has much code for the slave command setup
which needs to be removed as the multi module replaces it
8. The Get_name_command hasn't been tested recently especially across
all of the current processor fabrics
9. there needs to be a way of running the relax system test suite
againnst a list of processor fabrics

I disagree.  The unit tests should test all functions (except those
requiring missing dependancies).  The system/functional tests should be
run in the current mode of operation.  Therefore to test the operation
of relax against the different processor fabrics you would run the test
suite multiple times in those different modes.

what I was thinking of was allowing --multi=mpi4py,threads,uni for the system tests which would repatedly run the system tests with the different implimentations

10. code to control the use of batched command queueing and returning,
and the threaded output queue  has been implimented but hasn't got an
interface to turn it on and off yet
11.  the command queuing code has an idea of  how many grains there
should be per processor. This isn't under use control  at the moment
(the grainyness contols how many batches of commands each processor
should see , take for example 3 slaves and 18 commands with a
grainyness of 1  .  On the task queue they  would be divided up into 3
batched commands one for each processor with each batched command
containing 6 sub commands. With a grainyness of 3 there would be 9
batched commands with each batched command containing 2 commands).
This allow for some load balancing on more hetrogenous systems as the
batched commands are held in a queue and handed out to the slave
processors as the slaves become available.
12. some of the output prefixing has off by 1 errors
13. re segregation of  the stdout and  stderr streams back out into
their correct streams is not implimented; everything is reported on
stdout. This will require work for the uni_processor as well

I think that the two should never be combined.
see my comments above about why I combine stout and stderr and how its a requirement for accounable efficiency ;-) since they are tagged recombinig them is not a problem....

This makes it
challenging with the treading, but it is doable (I did it with my
ancient threading code).

14.  parellisation of hessian calculations and the all minimisation

See above.

15 . it would be good to give users control of which parts of the
program are parallelised during a run

I don't know if this is important or useful from the user's perspective.

It can be important for speed. Consider if you have a cluster with slow communication you might well want to run the monte carlos in parallel but not the (currently unimplimented) all hessian optimistaion as the overhead of sending out the parts of the hessian could well overwhelm the gains you make if you communication is slow

16 . uni processor could be implimented as a s subclass of
17.  true virtual classes are not implimented

What is a virtual class?

sorry me being sleep I meant abstract class

18.  the stdio stream interceptors should be implimented as delegates
to StringIO rather than inheriting from StringIO which would also
allow for the use of cStringIO
19. The master processor only does io and no calculations

Is that not how it currently works?

yep but the question is are we wasting resources and can it be improved?

anyway thats it for now

Again I have to say that this code is looking really good.  This should
be released as a proper relax release.  I would look at the instructions
for creating a relax release.  A tag needs to be created and the source
and binary packages created.  For that you will need to create a GPG key
pair specifically for relax and then send me your public key.  Then you
will be able to package and upload signed files to the download site.  I
can then check the packages and sign them with the relax key.

I will look into it soon and try to make a release in the next couple of weeks, I certainly need to iron out this bug of neils first




relax (

This is the relax-devel mailing list

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at


Dr Gary Thompson
Astbury Centre for Structural Molecular Biology,
University of Leeds, Astbury Building,
Leeds, LS2 9JT, West-Yorkshire, UK             Tel. +44-113-3433024
email: garyt@xxxxxxxxxxxxxxx                   Fax  +44-113-2331407

Related Messages

Powered by MHonArc, Updated Tue May 29 10:40:51 2007