mailRe: Redesign of the relax data model: 2. A new run concept


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on January 15, 2007 - 07:44:
On 1/8/07, gary thompson <garyt@xxxxxxxxxxxxxxx> wrote:
Sorry for the delay in replying but this needed some uninterrupted
time for me to sort through it

Sorry for my delays. The proposed changes of the redesign are quite drastic. Your timing isn't too late though and the ideas in your three posts are very useful:

* re: Redesign of the relax data model: 2. A new run concept;
https://mail.gna.org/public/relax-devel/2007-01/msg00013.html
(Message-id: <f001463a0701071314i61276e67hde685fe3afb8fe42@xxxxxxxxxxxxxx>)

* re: Redesign of the relax data model: 3. Molecules, residues, and
spins; https://mail.gna.org/public/relax-devel/2007-01/msg00014.html
(Message-id: <f001463a0701071417w6bd7927cp8fdd052e698575ec@xxxxxxxxxxxxxx>)

* Re: redesign of the relax data model: 1. Why change?;
https://mail.gna.org/public/relax-devel/2007-01/msg00015.html
(Message-id: <f001463a0701071445i6a2a4e3bid302bb515a40de3c@xxxxxxxxxxxxxx>)

What I'll do is try to break the ideas into those which can be
incorporated into the proposed redesign and those which can be part of
their own subsequent redesign and sanitising of the relax internals.
As I have no formal training as a programmer, your input about design
patterns and anti-patterns is much appreciated (Wikipedia is very
useful for learning about these concepts).  As some of your
suggestions involve drastic changes as well, I think that incremental
changes to the internals would be better than doing it all in one hit.
As part of the redesign, I'm planning on moving through the code
function by function rewriting the function docstrings, writing unit
tests, and removing  'self.run' (see the file
'docs/data_model_redesign' in the 1.3 line for more details).  If we
had unit test for all functions and methods in relax, then one massive
refactorisation of all the internals would make more sense.  Therefore
a lot of the removal of poor design choices (anti-patterns) could
occur after a comprehensive unit test framework has been implemented.


>   Posted by Edward d'Auvergne on October 11, 2006 - 10:32:
>
>   On Wed, 2006-10-11 at 17:02 +1000, Edward d'Auvergne wrote:
>
>       This post is proposal for the redesign the relax data model.  This 
will
>       affect how data is input into the program, how data is selected, how
>       molecular structures are handled, how spin systems are handled, and 
how
>       many other parts of relax function.  Importantly the internal 
structure
>       of 'self.relax.data' will completely change.  These modifications will
>       essentially break every part of relax (the isolated code in the
>       directories 'minimise', 'maths_fns', and 'docs' will be safe from the
>       carnage, as will a few files in the base directory).  If you have any
>       ideas for extending or improving the proposed data model, can see any
>       short-comings, deficiencies, or flaws, are familiar with the PDB
>       conventions, etc., your input is very much sought after.  The changes
>       should occur in the 1.3 line of the repository.  1.2 versions will be
>       unaffected - scripts will remain compatible and the 1.2 line will
>       continue to be supported with bug fixes, etc.
>
>       I have to apologise in advance for the size of this proposal, to
>       simplify it I have divided the text into numbered sections.  Once this
>       initial parent message has been sent I will respond to it with the 
text
>       of the 4 major sections.  This will allow 4 major threads to branch 
off
>       from this message on the mailing list archive
>       (https://mail.gna.org/public/relax-devel).  If you have an opinion,
>       idea, etc. about a specific section, could you please post a separate
>       message in response to the relevant major section post?  Also if you
>       have unrelated ideas for one of these sections, could you post these 
as
>       separate messages as well?  For example if you have separate points
>       about sections 3.1 and 3.5.1, two different posts responding to the
>       parent Section 3 post would be appreciated.  Thanks.  This will help 
to
>       focus each discussion point into specific threads.
>
>       Edward
>
>
>
>       Redesign of the relax data model
>
>       Index:
>       1.  Why change?
>           1.1  The runs
>           1.2  The molecules
>           1.3  The residues
>           1.4  The spins
>       2.  A new run concept
>           2.1  Parcelling up an abstract space
>           2.2  The run data model
>           2.3  The pipe concept
>       3.  Molecules, residues, and spins
>           3.1  The spin data model
>           3.2  The data selection concept - identifying spin systems
>               3.2.1  Function arguments
>               3.2.2  NH data of a single protein macromolecule
>               3.2.3  A single organic molecule (non-polymeric)
>               3.2.4  A single RNA or DNA macromolecule
>               3.2.5  Complexes
>           3.3  Regular expression
>           3.4  The spin loop
>           3.5  Molecule, sequence, and spin user function classes
>               3.5.1  The 'molecule' user function class
>               3.5.2  The 'sequence' user function class
>               3.5.3  The 'spin' user function class
>           3.6  The input and output files
>       4.  Conclusion
>
>
>
>   Before reading this post, please read the previous posts:
>
>   * The parent message 'Redesign of the relax data model:  A HOWTO for
>   breaking relax.' located at
>   https://mail.gna.org/public/relax-devel/2006-10/msg00053.html
>   (Message-id:
>   <1160550133.9523.54.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).
>
>   * Section 1 'Redesign of the relax data model:  1.  Why change?' located
>   at https://mail.gna.org/public/relax-devel/2006-10/msg00054.html
>   (Message-id:
>   <1160551172.9523.60.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).
>
>
>
>   2.  A new run concept
>
>   2.1  Parcelling up an abstract space
>
>   The general idea is to further increase the prominence of the 'run'.
>   Rather than relax executing in an abstract space where the 'run' is
>   passed into each user function as necessary, the idea is that relax
>   executes within a space dedicated to a certain 'run'.  So if at the
>   relax prompt, you could type a user function such as:
>
>   relax> run.current()
>   'm8'
>
>   By working in the 'm8' run space, each user function can be executed
>   without the need for the 'run' argument.  Other user functions, such as
>   'run.switch()', can be used to change between runs.

I agree that carrying the run argument throughout the data structure
is an annoying problem and I like the solution but here is an
extension to it that may enegender more felxibility

There is an interesting parallel here... basically the proposal
consists of the proposal that there should always be a current run
(much in the same way that most shells have a present working
directory). However, it is worth noting that many unix tools take a
directory argument which overrides the current working directory and
this engenders both simplicity and flexibility as to which 'context' a
command runs in.

This is analogous to Chris' idea of the 'runs' keyword argument for all user functions (https://mail.gna.org/public/relax-devel/2006-10/msg00060.html, Message-id: <1160562239.14487.98.camel@mrspell>). By default, not supplying the 'runs' argument will cause the user function will operate on the current run whereas supplying the argument will cause the user function to operate on one or more alternative runs.


>   2.2  The run data model
>
>   The current run name could be stored in the single data structure
>   'self.relax.run'.  The relax data structure could then be accessed by
>   typing 'self.relax.data[self.relax.run]'.  I.e. 'self.relax.data' is a
>   DictType object (it has key-value pairs) in which the run name key is
>   associated with a specific data container.  As most data structures in
>   the current relax data model are associated with a run (e.g.
>   'self.relax.data.diff[self.run]', 'self.relax.data.res[self.run]',
>   'self.relax.data.pdb[self.run]', etc), the data model significantly
>   simplifies.

now following on from the comment above I would suggest that a data
structure  containing a stack of runs be a good idea.. consider a
command that took a run parameter:

def command(run=None):
   self.relax.run.push(run)
   ... do something
   self.relax.run.pop()

Is this stack idea a suggestion for the UI design (i.e. the end user sees it) or is it an internal implementation suggestion for Chris' runs argument idea?


now there are some intrinsic problems with this setup (basically it is
far too easy to pop and then degugging really does become a
nightmare.... However, python actually has at least three solutions to
this(not all ow which are available in version 2.4 the with solution
requires 2.5)

1. decorators (python 2.4)
   @relax_command
   def command():
      ...do something

   @relax_command then wraps command in a self.relax.run.push/pop(run) pair


2. define relax_command as a functor and then have a default relax_command functor that wraps around with a push and a pop

   class relax_command():
      def __init__(self,function):
          self.function=function
      def __call__(self,*args):
        #find run arg and save in local variable and remove from args
        self.relax.push()
        self.function(args)
        self.relax.pop()

3. the with statement (python 2.5)
see 
http://www.dalkescientific.com/writings/diary/archive/2006/08/23/with_statement.html

Some asides

A.  I believe the runs that are passed around in relax are strings
which are then used to lookup data in a map. Why not just have
(runs/pipes) as objects... Then for example the call

self.relax.data[self.relax.run]

above becomes

self.relax.run.data a much more object orientated and encapsulated structure

You still need a list (array) or dictionary (hash) type structure to store multiple run objects. 'self.relax.data' would be a dictionary type object with key-value pairs, the key being the run name and the value being a standard class instance object containing all the data associated with the run as objects. The dictionary would be more logical than an array in this case.


B. There is a twist here, if relax is a global variable referenced by
everything if you want to run relax in a threaded manner
(multiprocessor machines are becoming more and more popular) then
self.relax poses a problem as we may need a different relax variable
for each processor so the relax variable needs to be acessed from
thread local storage cf
(http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302088)

This is a good idea which should be developed a bit more. Instead of passing 'self.relax' to each class in relax and then realiasing the argument to 'self.relax' (i.e. creating lots of pointers) the object 'self.relax' could be placed into '__builtin__' as 'relax'. Is this how you would do this in Java?

As for threading problems if the threads simply pass the calculated
data back to the parent thread, then race conditions in the
modification of the data structures in 'self.relax.data' can be
avoided.  Anyway, the threading code in relax is currently broken.  My
implementation is poorly designed grid computing whereby a separate
relax process is launched on the target machine (which could be a
second CPU on the current machine).  The parent process sends minimal
data to the slave process which does the calculation and returns the
results.  Ideally I would like to separate the threading from the grid
computing so that three different usages could one day be supported:
   One relax process with multiple threads.
   Grid computing.
   Clustering (see the MPI thread starting at
https://mail.gna.org/public/relax-devel/2006-04/msg00023.html,
Message-id: <443E3032.8040907@xxxxxxxxxxxxxxx>.  The thread continues
into the next month as well:
https://mail.gna.org/public/relax-devel/2006-05/threads.html).

The threading issues can be dealt with later though.


>   More information about the data model change is given in the message at
>   located at https://mail.gna.org/public/relax-devel/2006-05/msg00008.html
>   (Message-id:
>   <7f080ed10605232038j5036278dg39136d75a05a9904@xxxxxxxxxxxxxx>) and the
>   response located at
>   https://mail.gna.org/public/relax-devel/2006-05/msg00010.html
>   (Message-id:
>   <7f080ed10605241912i7c35f574i94f139588c5fa16b@xxxxxxxxxxxxxx>).
>
>
>   2.3  The pipe concept
>
>   A single run can be thought of as a pipe where data is input, processed,
>   or output as user functions are called.  There are different types of
>   pipe for different analyses, e.g. a reduced spectral density mapping
>   pipe, a model-free pipe, an exponential curve-fitting pipe, etc.  When
>   running relax you choose which run (or pipe) you are currently in and
>   the 'run.switch()' user function allows you to jump between multiple
>   runs (or pipes).  The modification of user functions in which runs are
>   combined or branched (which can be thought of as the pipes merging or
>   splitting) would be straight forward.  For example the
>   'model_selection()' user function currently accepts the following
>   arguments:
>
>   model_selection(self, method=None, modsel_run=None, runs=None)
>
>   In this case the 'modsel_run' can be dropped and the results of model
>   selection placed into the current run (or pipe).  The 'run' user
>   function class could contain the following user functions for pipe
>   manipulation:
>
>   run.copy()    # Create a new run (or pipe) with the current contents of
>   another run (or pipe).
>   run.create()    # Create a new run (or pipe).  Switch to this pipe by
>   default.
>   run.current()    # Print the current run (or pipe).
>   run.delete()    # Delete the given run (or pipe).
>   run.delete_all()    # Delete all runs.  Essentially deleting
>   'self.relax.data'.

you might want to consider a nullObject here so that if all runs are
deleted you don't crash just raise error messages...

The proposed behaviour would be an empty dictionary type object 'self.relax.data' which hence contains no runs. Each of these user functions currently check for the existence of the runs and throw a RelaxError if the user tells relax to access a non-existent run. Having 'self.relax.data' set to None rather than an empty but modifiable dictionary like container wouldn't be necessary.


>   run.hybridise()    # Fuse two runs (or pipes) into the current run (or
>   pipe).  Overlapping data in the two runs must be identical!
>   run.list()    # Print all runs (or pipes).
>   run.switch()    # Switch to another run (or pipe).

Now here is a further comment if run were an object that contained its
own data many of these processes could be dealt with using pythons own
semantics

e.g.

run.copy():

        from copy import copy

new_run=copy(run)

That's exactly the idea, each run in the 'self.relax.data' dictionary would have it's own class instance acting as a data container. The function deepcopy would be better in this case.


run.create():
        new_run = Run()

run.delete():
        new_run = Run()
        new_run =  None # run dissapears due to grbage collection/ref counts

How about:

run.delete():
   del self.relax.data['name']


>   One evolutionary path of the run concept which could be followed with
>   this set of proposed changes is to completely replace it with the pipe
>   concept.  All instances of 'run' in relax would be renamed to 'pipe'.
>   For example 'run.create()' will become 'pipe.create()',
>   'self.relax.data[self.relax.run]' will become
>   'self.relax.data[self.relax.pipe]', etc.  I believe that the name 'pipe'
>   is a better representation of the run concept than 'run'.  What do you
>   think of the idea?

another name would be processor or command

I'm not sure if these names encapsulate the UI concept that well. Essentially the run corresponds to a data pipe. A command would be more synonymous with the user functions, many of which can operate sequentially on the data pipe. The run or data pipe refers to the data rather than any actions, and the name processor implies an action.

Cheers,

Edward



Related Messages


Powered by MHonArc, Updated Mon Jan 15 16:20:17 2007