mailRe: relax and Grid computing.


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Gary S. Thompson on March 19, 2007 - 23:29:
Edward d'Auvergne wrote:

Hi,

Sorry I had to change the subject.  Unfortunately Gmail will cause the
thread to break!  This is a response to the post located at
https://mail.gna.org/public/relax-devel/2007-03/msg00090.html
(Message-id: <45FEB28A.7010600@xxxxxxxxxxxxxxx>).


On 3/20/07, Gary S. Thompson <garyt@xxxxxxxxxxxxxxx> wrote:

Edward d'Auvergne wrote:

> Gary,
>
> It might be important to note that the code that you commented out was
> actually grid computing code rather than threading code.
> Unfortunately I called my grid computing code 'threading'!  Half of it
> could probably kept as is although relabelled to 'grid'.

Can you give an outline of how the grid code works I found it fairly
convoluted when i tried to look at it....


Ok, I'll try to explain as best I can.  It has been quite a while
since I wrote this grid computing code so please excuse me if I get
something wrong.

Firstly relax is executed as you normally would in either the prompt
or script UI mode.  To set up grid computing you run the
'thread.read()' user function.  This reads a file which defines the
hosts by its host name or IP address, your user name on the machine,
the location of relax, the slave process priority number on the
machine, and the number of CPUs or CPU cores on the machine (to launch
multiple slaves processes on one machine).  More information about
this setup, SSH public key authentication (hence a password-less login
to the machine), etc. is given by the thread.read() documentation.

As the script or prompt statements are executed, relax will operate as
normal.  That is until the minimise() method of the
generic_fns.minimise.Minimise class instance is executed by the
minimise() user function.  Currently only Monte Carlo simulation
calculations are sent off for calculation to the elements of the
computer grid.  See the code for the full details.  The instantiation
of the RelaxMinParentThread class starts the process.  Essentially
what happens is that the parent thread starts n RelaxThread instances,
which are true threads, for the n Monte Carlo simulations.  Each
thread then does all the grid computing work asynchronously
communicating with the slave processes.  Unfortunately there is no
separation between the threading framework and the grid computing
framework at this point.

The grid computing algorithm I have come up with is the code of the
RelaxThread.run() method (see the thread_classes.py file).  I have
used two queues, the self.job_queue and the self.results_queue (see
the Python module Queue).  Both are queues of job numbers.  An
infinite loop is used for execution.  Firstly a job number is taken
from the self.job_queue.  The job number is then added back to the end
of the job queue - this is to make the threads and slaves fail safe
and so idle faster machines will pick up the jobs of the slower
machines while they are still running.  To prevent race conditions,
the element of the self.job_locks array corresponding to the job
number is locked.  A list of completed jobs self.finished_jobs is used
to determine if the job has been finished by a faster thread to
prevent the job number being added back to the job queue.  This allows
the job queue to be depopulate as jobs finish.  Once a job has been
completed its number is added to self.results_queue.  Termination of
the infinite loop occurs one the job number None is pulled out of the
queue.  To terminate all threads (and corresponding slave processes),
None is added back to the job queue.

I hope that wasn't too confusing,

Edward

.

Hi Ed
No this wasn't too confusing. It helps quite alot and is relativley compatible with what I have (policy is relativley weak inside the processor objects and for good reason there are several possiblities for setting it up). The one thing that confuses me currently is how to bring up relax on a remote machine in a state where it is runnable without running a script into it... I have played around with dummy runs in the latest iteration of the multi branch but am not sure if this is the way to go... I also had a look at save state in state.py and this seems quite heavy I presume that it dumps the complete program state to a pickle and then rejuvenates it at the other side? Consider line 101 of mpi4py_processor, the command is given a copy of the relax_instance and should now execute commands against it (whether to update state or to do something and then return an object via processor) How do i ensure that it it is in a usable state? I guess i could initialise the main interpreter and then save it's state but by that point it is running a script!


One thing to note here is that I will at some stage try and rewrite commands to keep the slave states in sync as we run so we don't have to save the whole state. But that is for a later day, or never if you consider that to not be the way to go...

more questions

where should I be attacking the division problem? my main thought was to effectivley add restrictions to a some commands. So consider grid search I would add an extra parameter at the generic and functional levels which would give a range of steps within the current parameters to calculate.... e.g here are the ranges which give a grid of 10x10x10 ie 1000 steps. slave 1. you calculate 1-250 slave 2. 251-500 and so on..... is this the correct way to go?

regards
gary

--
-------------------------------------------------------------------
Dr Gary Thompson
Astbury Centre for Structural Molecular Biology,
University of Leeds, Astbury Building,
Leeds, LS2 9JT, West-Yorkshire, UK             Tel. +44-113-3433024
email: garyt@xxxxxxxxxxxxxxx                   Fax  +44-113-2331407
-------------------------------------------------------------------





Related Messages


Powered by MHonArc, Updated Tue Mar 20 11:00:20 2007