Redesign of the relax data model: 1. Why change? -- October 11, 2006

On Wed, 2006-10-11 at 17:02 +1000, Edward d'Auvergne wrote:

This post is proposal for the redesign the relax data model.  This will
affect how data is input into the program, how data is selected, how
molecular structures are handled, how spin systems are handled, and how
many other parts of relax function.  Importantly the internal structure
of 'self.relax.data' will completely change.  These modifications will
essentially break every part of relax (the isolated code in the
directories 'minimise', 'maths_fns', and 'docs' will be safe from the
carnage, as will a few files in the base directory).  If you have any
ideas for extending or improving the proposed data model, can see any
short-comings, deficiencies, or flaws, are familiar with the PDB
conventions, etc., your input is very much sought after.  The changes
should occur in the 1.3 line of the repository.  1.2 versions will be
unaffected - scripts will remain compatible and the 1.2 line will
continue to be supported with bug fixes, etc.

I have to apologise in advance for the size of this proposal, to
simplify it I have divided the text into numbered sections.  Once this
initial parent message has been sent I will respond to it with the text
of the 4 major sections.  This will allow 4 major threads to branch off
from this message on the mailing list archive
(https://mail.gna.org/public/relax-devel).  If you have an opinion,
idea, etc. about a specific section, could you please post a separate
message in response to the relevant major section post?  Also if you
have unrelated ideas for one of these sections, could you post these as
separate messages as well?  For example if you have separate points
about sections 3.1 and 3.5.1, two different posts responding to the
parent Section 3 post would be appreciated.  Thanks.  This will help to
focus each discussion point into specific threads.

Edward



Redesign of the relax data model

Index:
1.  Why change?
    1.1  The runs
    1.2  The molecules
    1.3  The residues
    1.4  The spins
2.  A new run concept
    2.1  Parcelling up an abstract space
    2.2  The run data model
    2.3  The pipe concept
3.  Molecules, residues, and spins
    3.1  The spin data model
    3.2  The data selection concept - identifying spin systems
        3.2.1  Function arguments
        3.2.2  NH data of a single protein macromolecule
        3.2.3  A single organic molecule (non-polymeric)
        3.2.4  A single RNA or DNA macromolecule
        3.2.5  Complexes
    3.3  Regular expression
    3.4  The spin loop
    3.5  Molecule, sequence, and spin user function classes
        3.5.1  The 'molecule' user function class
        3.5.2  The 'sequence' user function class
        3.5.3  The 'spin' user function class
    3.6  The input and output files
4.  Conclusion




Before reading this post, please read the parent message 'Redesign of
the relax data model:  A HOWTO for breaking relax.' located at
http://https://mail.gna.org/public/relax-devel/2006-10/msg00053.html
(Message-id:
<1160550133.9523.54.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).


1.  Why change?

It is becoming apparent that the capability and flexibility of the
current relax data model is very limited.  For example posts to both the
relax-users (https://mail.gna.org/public/relax-users) and relax-devel
(https://mail.gna.org/public/relax-devel) mailing lists by Alex Hansen
demonstrate that the data model is not flexible enough to handle
relaxation data from RNA or DNA.  While handling the RNA data in the 1.2
relax versions is possible, the required script would be far too
complex.  I have identified four major categories where I believe that
the relax data model is not flexible or elegant enough:  the runs, the
molecules, the residues, and the spin systems (or atoms).  If you can
think of any other logical categories, please say.  As the planned
changes are so disruptive, fine tuning and perfecting the design now is
very important.


1.1  The runs

The way runs are currently handled in relax is far from optimal.  The
history of the run concept is linked to the history of the program.
relax was originally designed as a small program for the creation of
input files for Art Palmer's Modelfree program, for reading the output
files, and as an implementation of the model selection data analysis
step.  While relax still does all of these things (with Dasha now
supported as well), it has evolved such that it implements all of the
steps of the data analysis chain and can be used as a complete
replacement.  The origin of the run was the Modelfree input and output
file handling function where each model-free model optimised was called
a run.  When the prompt and script based interfaces were implemented and
user functions first created, no user functions were associated with a
'run'.  As relax has evolved, the percentage of user functions
associated with the run concept has gradually risen.  Now almost every
single user function requires a run argument.

The problem with the current run concept is that, because you must
supply the run name to each user function, the run string needs to be
passed to all relevant functions in the program.  For each user function
called this results in the string bouncing around many functions, each
time being placed into 'self.run'.  This dense branching effect means
that the text 'self.run' is extremely widespread within the relax code
base.  For example, by grepping the source code, the number of lines
containing the text 'self.run' in relax version 1.2.7 is 1645!  This
branching approach mandates knowing which functions require the run
argument, passing the string to that function, and then making sure that
the run argument is placed into 'self.run' so that the previous run is
not accidentally accessed.  The current branching run concept is an
important and constant source of bugs.  The complexity of the source
code is unnecessarily high.


1.2  The molecules

In the 1.2 relax versions only a single molecule is supported.  This
molecule must also be the first molecule in the supplied PDB file.
relax cannot currently handle molecular complexes.  For example in PDB
file of an RNA/protein complex there is no way of specifying whether you
are studying the RNA or the protein.


1.3  The residues

Currently relax assumes that you are studying a polymer system in which
the data can be characterised by residue number (and name).  However if
you are studying a molecule that isn't a polymer, there are no residues.
A 'residue number' and 'residue name' could be invented for the system
but, as the residue concept is entrenched into relax, there may be
problems with this approach.


1.4  The spins

relax also assumes that you only have a single relaxation data set per
residue.  For studying the backbone NH relaxation of a protein this
isn't a problem.  However in RNA, DNA, and non-polymeric molecules,
there are likely to be multiple spins studied per 'residue'.  The data
model should be modified to allow the analysis of these types of
molecules.  An additional benefit when studying a protein is that you
would then be able to study the backbone NH data simultaneously with CA
data (together with any other data sets you may have collected).

Redesign of the relax data model: 1. Why change?

Header

Content

Related Messages