On Fri, 2007-05-04 at 13:59 +0100, Gary S. Thompson wrote: [cut]
output: the processor implementation gives some feedback as to what prcoessor you are running: M S> script M S> M S> M S> M S> relax repository checkout M S> M S> Protein dynamics by NMR relaxation data analysis M S> M S> Copyright (C) 2001-2006 Edward d'Auvergne M S> M S> This is free software which you are welcome to modify and redistribute under the conditions of the M S> GNU General Public License (GPL). This program, including all modules, is licensed under the GPL M S> and comes with absolutely no warranty. For details type 'GPL'. Assistance in using this program M S> can be accessed by typing 'help'. M S> M S> processor = MPI running via mpi4py with 5 slave processors & 1 master, mpi version = 1.2 M S> M S> script = 'test_small.py' M S> ----------------------------------------------------------------------------------------------------
If using '-np 6', shouldn't the number of slaves be 6?
note the processor = line
The processor line is in line with the 'script = ...' notation and that information would be useful if the output is caught and stored in a log file. I like that touch.
another couple of things to note are that the output from the program is prepended with some text indicating which stream and which processors the output is coming from: The output prefix is divided into two parts 'processor' 'stream'> [normal output line] where processor is either a number to identify the rank of the processor, or a series of M's to indicate the master stream is either E or S for the error or output streams so here is another fragment 1 S> Hessian calls: 0 1 S> Warning: None 1 S> M S> idle set set([1, 2]) M S> running_set set([2, 3, 4, 5]) M S> 2 S> 2 S> 2 S> Fitting to residue: 24 ALA 2 S> ~~~~~~~~~~~~~~~~~~~~~~~~~~ 2 S> 2 S> Grid search 2 S> ~~~~~~~~~~~ 2 S> 2 S> Searching the grid. 2 S> k: 0 xk: array([ 0. in this case we finish a minimisation on processor 1 '1 S>' then have some output from the master processor 'M S>' and then some output from prcoessor 2 '2 S>'
This output is very useful. I would prefer though if the notation: [M S] relax> rather than: M S> relax> The text '1 S>' looks like a prompt when it is not whereas [1 S] is less ambiguous. On another note, I strongly believe that for ordinary operation this output is not necessary. The user doesn't need to know which slave process the code has executed on. All the user will care about is that the calculation has occurred successfully. Nevertheless this output is very useful for programming and debugging purposes. I propose that it is shown when the --debug flag is passed to relax and suppressed otherwise. Importantly, that way the MPI or threading mode of operation will look very similar to the normal uni-processor operation. Or maybe a verbosity flag to relax could be added to activate this printout?
when running under the threaded and mpi4py implimentations you may see long gaps with no output and the output to the terminal can be quite 'jerky'. This is because the multiprcoessor implimentation uses a threaded output queue to decouple the writing of output on the master from the queuing of calculations on the slaves, as otherwise for systems with slow io the rate of io on the mastewr can control the rate of calculation!
I'll have to test this later and see if I can cosmetically minimise the jerkyness.
also note the std error stream is not currently used as race conditions between writing to the stderr and stdout streams can lead to garbled output.
This will definitely need to be fixed prior to merging into the 1.3 line. Stdout and stderr separation is quite important.
futher note that the implimentation includes a simple timer that gives some bench marking as to the speed of calculation, this is the total time that it takes for the master process to run M S> relax> state.save(file='save', dir=None, force=1, compress_type=1) M S> Opening the file 'save.bz2' for writing. M S> M S> overall runtime: 0:00:24
What triggers the time printout? Does it occur at a specific location? Does it occur multiple times during execution?
Interactive terminals: the multi implementation still has an interactive terminal. Tis maybe started by typing mpiexec -np 6 ../relax --multi mpi4py for example in the case of an mpi4py session All io to the treminal takes place on the master processor, but commands that are parallel still run across the whole cluster.
Perfect (although I already knew this)!
Exceptions: exceptions from slave processors appear with slightly different stack traces compared to normal exceptions: Traceback (most recent call last): File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py", line 351, in run self.callback.init_master(self) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/processor.py", line 75, in default_init_master self.master.run() File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/relax_tests_chris/../relax", line 177, in run self.interpreter.run() File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/interpreter.py", line 216, in run run_script(intro=self.relax.intro_string, local=self.local, script_file=self.relax.script_file, quit=1) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/interpreter.py", line 392, in run_script console.interact(intro, local, script_file, quit) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/interpreter.py", line 343, in interact_script execfile(script_file, local) File "test_small.py", line 54, in ? grid_search(name, inc=11) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/prompt/minimisation.py", line 147, in grid_search self.relax.processor.run_queue() File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py", line 270, in run_queue self.run_command_queue(lqueue) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py", line 335, in run_command_queue result_queue.put(result) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py", line 109, in put super(Threaded_result_queue,self).put(job) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py", line 76, in put self.processor.process_result(job) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py", line 221, in process_result result.run(self,memo) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/processor.py", line 276, in run raise self.exception Capturing_exception: ------------------------------------------------------------------------------------------------------------------------ File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/multi_processor.py", line 381, in run command.run(self,completed) File "/nmr/jessy/garyt/projects/relax_branch/branch_multi1/multi/commands.py", line 297, in run raise 'dummy' Nested Exception from sub processor Rank: 1 Name: fbsdpcu156-pid31522 Exception type: dummy (legacy string exception) Message: dummy ------------------------------------------------------------------------------------------------------------------------ here we have an exception 'dummy' which was raised at line 297, in the run function /multi/commands.py on slave 1 processor node fbsdpcu156 process id 31522 and transferred back to line 276 of function run in multi/processor.py on the master where it was raised again.
I think that the error printout should be made to resemble the standard Python printout. For example if you raise a RelaxError with the text 'hello', the printout could look like: raise RelaxError, 'hello' Nested Exception from sub processor Rank: 1 Name: fbsdpcu156-pid31522 RelaxError: hello Hence the exception type is not separated from its message.
Now some caveats 1. not all exceptions can be handled by this mechanism as they exceptions can only be handed back once communication between the slaves has been setup. This can be a problem on some mpi implimentations as they don't provide redirection of stdout back to the master contolling trerminal.
There's probably not much that can be done there.
2. I have had a few cases where raising an exception has wedged the whole multiproessor without any output. These can be quite hard to debug as they are due to errors in the overrides I put on the io streams! a pointer that may help is that using the sys.settrace(traceit) as shown in processor.py will produce copious output tracing (and a very slow program)
The sorting out and separation of the IO streams may cause this problem to disappear.
3. not all exception states seem to be leading to an exit from the program currently so you should monitor output from the program carefully
Do you know why this is happening?
Speedups ----------- the following calculations are currently parallelised 1. model free minimisations across sets of residues with a fixed difffusion tensor frame 2. model free grid searches for the difffusion tensor frame 3. monte carlo simulations
This is great work!
in future it maybe possible also parallelise the minimisation of modelfree calculations of the 'all' case where model fitting and the tensor frame are optimised at the same time. However,this will require modifications to the model free hessian gradient and cuntion calculation routines and development of a parallel newton line seach which are both major undertakings.
These are possible targets for parallelisation but I would very strongly recommend against working at this position. And adding optimisation algorithms would require very careful testing. From my experience with optimisation in the model-free space, I would probably bet that the algorithm will fail for certain model-free motions (not many algorithms find all minima in such a convoluted space). The place to target is the following three functions: maths_fns.mf.Mf.func_all() maths_fns.mf.Mf.dfunc_all() maths_fns.mf.Mf.d2func_all() Specifically the loop over all residues (to be renamed to all spin systems in the 1.3 line) to create the value, gradient, and Hessian would be the ideal spot to parallelise!
Indeed the problem may be fine grained enough that use of c mpi and recoding of the hessian etc calculations for model free in c is required
This conversion should significantly speed up calculations anyway. I will do this one day.
speedups on all calculations with increasing numbers of processors should be near perfect as alluded to in message https://mail.gna.org/public/relax-devel/2007-04/msg00048.html more benchmarks will follow soon processors min eff mc eff grid eff 1 18 100 80 100 134 100 2 9 100 4 5 90 8 3 75 16 1 112.5 32 1 56.25 8 31.25 4 104.6 and the picture that speaks 1000 words processors min eff mc eff grid eff 1 18 100 80 100 134 100 2 9 100 4 5 90 8 3 75 16 1 112.5 32 1 56.25 8 31.25 4 104.6 and the picture that speaks 1000 words key top graph black line achieved runtimes top graph red line expected runtimes with perfect scaling efficency bottom graph scaling efficiency some notes 0. data was collected on one of chris's small data sets containing 28 residues not all of which are active for minimisation columns processors - no slave mpi processors min - time for a minimisation of models m1-m9 with a fixed diffusion tensor eff - approximate parallel efficiency expected runtime/ actual runtime mc - 256 monte carlo calculations eff - efficiency of the above grid - a grid search on a anisotropic diffusion tensor 6 steps eff - efficency of the above tests were run on a cluster of opterons using gigabit ethernet and mpi 1. these results are crude wall times as measured by pythons time.time function for the master but they do not include startup and shutdown overhead 2. these tests are single point measurements there are no statistics 3. timings were rounded to 1 second, so for example we must consider data points for more than 16 processors for the min run to be suspect key top graph black line achieved runtimes top graph red line expected runtimes with perfect scaling efficency bottom graph scaling efficiency note if you watch the output carefully you will see one difference between the multiprocessor and uniprocessor runs of the grid search. The grid search reports all cases of the search where the target function has improved for each processor, rather than for the whole grid search....
I wonder what is causing the Monte Carlo simulations to not be 100% efficient. Of all the code in relax, I would see this as the most amenable for parallelisation. Each simulation can be queued to a different slave.
Bugs missing freatures todos etc: ------------------------------------- 1. There is very little commenting
This is quite important for the future maintainability of the code.
2. some exceptions do not stop the interpreter properly and there may still be some bugs that cause lockups on throwing exceptions
Yikes.
3. there are no unit tests (though the amount of code that can be unit tested is rather limited as for example writing mock objects for mpi could be fun!)
Most parts of the code could be tested relatively easily. The unit test framework makes this job quite easy.
4. there are no documentation strings
I would recommend compiling the API documentation using scons and then looking at the HTML output. That output should very clearly show what docstrings are missing or deficient.
5. the command line handling need to be improved: we need to find the current processor implimentation, load it and then ask it what command line options it needs (this will also allow the simplification of the handling of setting up the number of processors and allow multiprocessor that need more command line arguments such as ssh tunnels to get extra arguments) I will also have to design a way of getting the help text for all the processor command line options whether they are loaded or not
I don't follow. The user shouldn't be asked a question by relax.
6. there are many task comments littered around the code FIXME: TODO: etc all of these except the ones labelled PY3K: will need to be reviewed resolved and removed 7. the relax class still has much code for the slave command setup which needs to be removed as the multi module replaces it 8. The Get_name_command hasn't been tested recently especially across all of the current processor fabrics 9. there needs to be a way of running the relax system test suite againnst a list of processor fabrics
I disagree. The unit tests should test all functions (except those requiring missing dependancies). The system/functional tests should be run in the current mode of operation. Therefore to test the operation of relax against the different processor fabrics you would run the test suite multiple times in those different modes.
10. code to control the use of batched command queueing and returning, and the threaded output queue has been implimented but hasn't got an interface to turn it on and off yet 11. the command queuing code has an idea of how many grains there should be per processor. This isn't under use control at the moment (the grainyness contols how many batches of commands each processor should see , take for example 3 slaves and 18 commands with a grainyness of 1 . On the task queue they would be divided up into 3 batched commands one for each processor with each batched command containing 6 sub commands. With a grainyness of 3 there would be 9 batched commands with each batched command containing 2 commands). This allow for some load balancing on more hetrogenous systems as the batched commands are held in a queue and handed out to the slave processors as the slaves become available. 12. some of the output prefixing has off by 1 errors 13. re segregation of the stdout and stderr streams back out into their correct streams is not implimented; everything is reported on stdout. This will require work for the uni_processor as well
I think that the two should never be combined. This makes it challenging with the treading, but it is doable (I did it with my ancient threading code).
14. parellisation of hessian calculations and the all minimisation
See above.
15 . it would be good to give users control of which parts of the program are parallelised during a run
I don't know if this is important or useful from the user's perspective.
16 . uni processor could be implimented as a s subclass of multi_processor 17. true virtual classes are not implimented
What is a virtual class?
18. the stdio stream interceptors should be implimented as delegates to StringIO rather than inheriting from StringIO which would also allow for the use of cStringIO 19. The master processor only does io and no calculations
Is that not how it currently works?
anyway thats it for now
Again I have to say that this code is looking really good. This should be released as a proper relax release. I would look at the instructions for creating a relax release. A tag needs to be created and the source and binary packages created. For that you will need to create a GPG key pair specifically for relax and then send me your public key. Then you will be able to package and upload signed files to the download site. I can then check the packages and sign them with the relax key. Cheers, Edward