The multi-processor branch. -- May 11, 2007

On Fri, 2007-05-04 at 13:59 +0100, Gary S. Thompson wrote:

[cut]

output:

the processor implementation gives some  feedback as to what prcoessor
you are running:

M S> script
M S>
M S>
M S>
M S>                                      relax repository checkout
M S>
M S>                           Protein dynamics by NMR relaxation data
analysis
M S>
M S>                              Copyright (C) 2001-2006 Edward
d'Auvergne
M S>
M S> This is free software which you are welcome to modify and
redistribute under the conditions of the
M S> GNU General Public License (GPL).  This program, including all
modules, is licensed under the GPL
M S> and comes with absolutely no warranty.  For details type 'GPL'.
Assistance in using this program
M S> can be accessed by typing 'help'.
M S>
M S> processor = MPI running via mpi4py with 5 slave processors & 1
master, mpi version = 1.2
M S>
M S> script = 'test_small.py'
M S>
----------------------------------------------------------------------------------------------------

If using '-np 6', shouldn't the number of slaves be 6?

note the processor =  line


The processor line is in line with the 'script = ...' notation and that
information would be useful if the output is caught and stored in a log
file.  I like that touch.

another couple of things to note are that the output from the program
is prepended with some text indicating which stream and which
processors the output is coming from: The output prefix is divided
into two parts

'processor' 'stream'>  [normal output line]

where 
processor is either a number to identify the rank of the processor, or
a series of M's to indicate the master
stream is either E or S for the error or output streams

so here is another fragment

1 S> Hessian calls:    0
1 S> Warning:          None
1 S>
M S> idle set set([1, 2])
M S> running_set set([2, 3, 4, 5])
M S>
2 S>
2 S>
2 S> Fitting to residue: 24 ALA
2 S> ~~~~~~~~~~~~~~~~~~~~~~~~~~
2 S>
2 S> Grid search
2 S> ~~~~~~~~~~~
2 S>
2 S> Searching the grid.
2 S> k: 0       xk: array([ 0.


in this case we finish a minimisation on processor 1 '1 S>'
then have some output from the master processor  'M S>'
and then some output from prcoessor 2 '2 S>'


This output is very useful.  I would prefer though if the notation:

[M S] relax>

rather than:

M S> relax>

The text '1 S>' looks like a prompt when it is not whereas [1 S] is less
ambiguous.

On another note, I strongly believe that for ordinary operation this
output is not necessary.  The user doesn't need to know which slave
process the code has executed on.  All the user will care about is that
the calculation has occurred successfully.  Nevertheless this output is
very useful for programming and debugging purposes.  I propose that it
is shown when the --debug flag is passed to relax and suppressed
otherwise.  Importantly, that way the MPI or threading mode of operation
will look very similar to the normal uni-processor operation.  Or maybe
a verbosity flag to relax could be added to activate this printout?

when running under the threaded and mpi4py implimentations you may see
long gaps with no output and the output to the terminal can be quite
'jerky'. This is because the multiprcoessor implimentation uses a
threaded output queue to decouple the writing of output on the master
from the queuing of calculations on the slaves, as otherwise for
systems with slow io the rate of io on the mastewr can control the
rate of calculation!


I'll have to test this later and see if I can cosmetically minimise the
jerkyness.

also note the std error stream is not currently used as race
conditions between writing to the  stderr and stdout streams can lead
to garbled output.


This will definitely need to be fixed prior to merging into the 1.3
line.  Stdout and stderr separation is quite important.

futher note that the implimentation includes a simple timer that gives
some bench marking as to the speed of calculation, this is the total
time that it takes for the master process to run

M S> relax> state.save(file='save', dir=None, force=1,
compress_type=1)
M S> Opening the file 'save.bz2' for writing.
M S>
M S> overall runtime: 0:00:24


What triggers the time printout?  Does it occur at a specific location?
Does it occur multiple times during execution?

Interactive terminals: the multi implementation still has an
interactive terminal. Tis maybe started by typing mpiexec -np
6 ../relax --multi mpi4py      for example in the case of an mpi4py
session All io to the treminal takes place on the master processor,
but commands that are parallel still run across the whole cluster.


Perfect (although I already knew this)!

Exceptions: exceptions from slave  processors appear with slightly
different stack traces compared to normal exceptions:


Traceback (most recent call last):
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py",
 line 351, in run
    self.callback.init_master(self)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/processor.py", 
line 75, in default_init_master
    self.master.run()
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/relax_tests_chris/../relax",
 line 177, in run
    self.interpreter.run()
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/interpreter.py",
 line 216, in run
    run_script(intro=self.relax.intro_string, local=self.local,
script_file=self.relax.script_file, quit=1)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/interpreter.py",
 line 392, in run_script
    console.interact(intro, local, script_file, quit)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/interpreter.py",
 line 343, in interact_script
    execfile(script_file, local)
  File "test_small.py", line 54, in ?
    grid_search(name, inc=11)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/minimisation.py",
 line 147, in grid_search
    self.relax.processor.run_queue()
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py",
 line 270, in run_queue
    self.run_command_queue(lqueue)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py",
 line 335, in run_command_queue
    result_queue.put(result)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py",
 line 109, in put
    super(Threaded_result_queue,self).put(job)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py",
 line 76, in put
    self.processor.process_result(job)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py",
 line 221, in process_result
    result.run(self,memo)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/processor.py", 
line 276, in run
    raise self.exception
Capturing_exception:

------------------------------------------------------------------------------------------------------------------------

  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py",
 line 381, in run
    command.run(self,completed)
  File
"/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/commands.py", 
line 297, in run
    raise 'dummy'


Nested Exception from sub processor
Rank: 1  Name: fbsdpcu156-pid31522
Exception type: dummy (legacy string exception)
Message: dummy

------------------------------------------------------------------------------------------------------------------------


here we have an exception 'dummy' which was raised at line 297, in the
run function /multi/commands.py on  slave 1 processor node  fbsdpcu156
process id 31522 and transferred back to line 276 of  function run in
multi/processor.py on the master where it was raised again.


I think that the error printout should be made to resemble the standard
Python printout.  For example if you raise a RelaxError with the text
'hello', the printout could look like:

    raise RelaxError, 'hello'
Nested Exception from sub processor
Rank: 1  Name: fbsdpcu156-pid31522
RelaxError: hello

Hence the exception type is not separated from its message.

Now some caveats 
1. not all exceptions can be handled by this mechanism as they
exceptions can only be handed back once communication between the
slaves has been setup. This can be a problem on some mpi
implimentations as they don't provide redirection of stdout back to
the master contolling trerminal.


There's probably not much that can be done there.

2. I have had a few cases where raising an exception has wedged the
whole multiproessor without any output. These can be quite hard to
debug as they are due to errors in the overrides I put on the io
streams! a pointer that may help is that  using the
sys.settrace(traceit)  as shown in processor.py will produce copious
output tracing  (and a very slow program)


The sorting out and separation of the IO streams may cause this problem
to disappear.

3. not all exception states seem to be leading to an exit from the
program currently so you should monitor output from the program
carefully


Do you know why this is happening?

Speedups
-----------

the following calculations are currently parallelised

1. model free minimisations across sets of residues with a  fixed
difffusion tensor frame 
2. model free grid searches for the difffusion tensor frame
3. monte carlo simulations


This is great work!

in future it maybe possible also parallelise the minimisation of
modelfree calculations of the 'all' case where model fitting and the
tensor frame are optimised at the same time. However,this will require
modifications to the model free hessian gradient and cuntion
calculation routines and development of a parallel newton line seach
which are both major undertakings.


These are possible targets for parallelisation but I would very strongly
recommend against working at this position.  And adding optimisation
algorithms would require very careful testing.  From my experience with
optimisation in the model-free space, I would probably bet that the
algorithm will fail for certain model-free motions (not many algorithms
find all minima in such a convoluted space).  The place to target is the
following three functions:
        maths_fns.mf.Mf.func_all()
        maths_fns.mf.Mf.dfunc_all()
        maths_fns.mf.Mf.d2func_all()

Specifically the loop over all residues (to be renamed to all spin
systems in the 1.3 line) to create the value, gradient, and Hessian
would be the ideal spot to parallelise!

 Indeed the problem may be fine grained enough that use of c mpi and
recoding of the hessian etc calculations for model free in c is
required


This conversion should significantly speed up calculations anyway.  I
will do this one day.

speedups on all calculations with increasing numbers of processors
should be near perfect as alluded to in message
https://mail.gna.org/public/relax-devel/2007-04/msg00048.html more
benchmarks will follow soon

processors            min     eff     mc      eff     grid    eff
1             18      100     80      100     134     100
2             9       100
4             5       90
8             3       75
16            1       112.5
32            1       56.25   8       31.25   4       104.6

and the picture that speaks 1000 words

processors            min     eff     mc      eff     grid    eff
1             18      100     80      100     134     100
2             9       100
4             5       90
8             3       75
16            1       112.5
32            1       56.25   8       31.25   4       104.6

and the picture that speaks 1000 words




key top graph black line achieved runtimes
        top graph red line expected runtimes with perfect scaling
efficency
        bottom graph scaling efficiency
        
some notes


0. data was collected on one of chris's small data sets containing 28
residues not all of which are active for minimisation columns
        processors     - no slave  mpi processors 
        min                    - time for a minimisation of models
m1-m9 with a fixed diffusion tensor
        eff                     - approximate parallel efficiency
expected runtime/ actual runtime
        mc                     - 256 monte carlo calculations
        eff                     - efficiency of the above
        grid                   - a grid search on a anisotropic
diffusion tensor 6 steps
        eff                     - efficency of the above
     tests were run on a cluster of opterons using gigabit ethernet
and mpi
1. these results are crude wall times as measured by pythons time.time
function for the master but they do not include startup and shutdown
overhead
2. these tests are single point measurements there are no statistics
3. timings were rounded to 1 second, so for example we must consider
data points for  more than 16 processors for the min run to be suspect

key top graph black line achieved runtimes
        top graph red line expected runtimes with perfect scaling
efficency
        bottom graph scaling efficiency

note if you watch the output carefully you will see one difference
between the multiprocessor and uniprocessor runs of the grid search.
The grid search reports all cases of the search where the target
function has improved for each processor, rather than for the whole
grid search....


I wonder what is causing the Monte Carlo simulations to not be 100%
efficient.  Of all the code in relax, I would see this as the most
amenable for parallelisation.  Each simulation can be queued to a
different slave.

Bugs missing freatures todos etc:
-------------------------------------

1. There is very little commenting


This is quite important for the future maintainability of the code.

2. some exceptions do not stop the interpreter properly and there may
still be some bugs that cause lockups on throwing exceptions


Yikes.

3. there are no unit tests (though the amount of code that can be unit
tested is rather limited as for example writing mock objects for mpi
could be fun!)


Most parts of the code could be tested relatively easily.  The unit test
framework makes this job quite easy.

4. there are no documentation strings


I would recommend compiling the API documentation using scons and then
looking at the HTML output.  That output should very clearly show what
docstrings are missing or deficient.

5. the command line handling need to be improved: we need to find the
current processor implimentation, load it and then ask it what command
line options it needs (this will also allow the simplification of the
handling of setting up the number of processors and allow
multiprocessor that need more command line arguments such as ssh
tunnels to get extra arguments) I will also have to design a way of
getting the help text for all the processor command line options
whether they are loaded or not


I don't follow.  The user shouldn't be asked a question by relax.

6. there are many task comments littered around the code FIXME: TODO:
etc all of these except the ones labelled PY3K: will need to be
reviewed resolved and removed
7. the relax class still has much code for the slave command setup
which needs to be removed as the multi module replaces it
8. The Get_name_command hasn't been tested recently especially across
all of the current processor fabrics
9. there needs to be a way of running the relax system test suite
againnst a list of processor fabrics


I disagree.  The unit tests should test all functions (except those
requiring missing dependancies).  The system/functional tests should be
run in the current mode of operation.  Therefore to test the operation
of relax against the different processor fabrics you would run the test
suite multiple times in those different modes.

10. code to control the use of batched command queueing and returning,
and the threaded output queue  has been implimented but hasn't got an
interface to turn it on and off yet
11.  the command queuing code has an idea of  how many grains there
should be per processor. This isn't under use control  at the moment
(the grainyness contols how many batches of commands each processor
should see , take for example 3 slaves and 18 commands with a
grainyness of 1  .  On the task queue they  would be divided up into 3
batched commands one for each processor with each batched command
containing 6 sub commands. With a grainyness of 3 there would be 9
batched commands with each batched command containing 2 commands).
This allow for some load balancing on more hetrogenous systems as the
batched commands are held in a queue and handed out to the slave
processors as the slaves become available.
12. some of the output prefixing has off by 1 errors
13. re segregation of  the stdout and  stderr streams back out into
their correct streams is not implimented; everything is reported on
stdout. This will require work for the uni_processor as well


I think that the two should never be combined.  This makes it
challenging with the treading, but it is doable (I did it with my
ancient threading code).

14.  parellisation of hessian calculations and the all minimisation


See above.

15 . it would be good to give users control of which parts of the
program are parallelised during a run


I don't know if this is important or useful from the user's perspective.

16 . uni processor could be implimented as a s subclass of
multi_processor
17.  true virtual classes are not implimented


What is a virtual class?

18.  the stdio stream interceptors should be implimented as delegates
to StringIO rather than inheriting from StringIO which would also
allow for the use of cStringIO
19. The master processor only does io and no calculations


Is that not how it currently works?

anyway thats it for now


Again I have to say that this code is looking really good.  This should
be released as a proper relax release.  I would look at the instructions
for creating a relax release.  A tag needs to be created and the source
and binary packages created.  For that you will need to create a GPG key
pair specifically for relax and then send me your public key.  Then you
will be able to package and upload signed files to the download site.  I
can then check the packages and sign them with the relax key.

Cheers,

Edward

The multi-processor branch.

Header

Content

Related Messages