Redesign of the relax data model: 3. Molecules, residues, and spins -- October 11, 2006

On Wed, 2006-10-11 at 17:02 +1000, Edward d'Auvergne wrote:

This post is proposal for the redesign the relax data model.  This will
affect how data is input into the program, how data is selected, how
molecular structures are handled, how spin systems are handled, and how
many other parts of relax function.  Importantly the internal structure
of 'self.relax.data' will completely change.  These modifications will
essentially break every part of relax (the isolated code in the
directories 'minimise', 'maths_fns', and 'docs' will be safe from the
carnage, as will a few files in the base directory).  If you have any
ideas for extending or improving the proposed data model, can see any
short-comings, deficiencies, or flaws, are familiar with the PDB
conventions, etc., your input is very much sought after.  The changes
should occur in the 1.3 line of the repository.  1.2 versions will be
unaffected - scripts will remain compatible and the 1.2 line will
continue to be supported with bug fixes, etc.

I have to apologise in advance for the size of this proposal, to
simplify it I have divided the text into numbered sections.  Once this
initial parent message has been sent I will respond to it with the text
of the 4 major sections.  This will allow 4 major threads to branch off
from this message on the mailing list archive
(https://mail.gna.org/public/relax-devel).  If you have an opinion,
idea, etc. about a specific section, could you please post a separate
message in response to the relevant major section post?  Also if you
have unrelated ideas for one of these sections, could you post these as
separate messages as well?  For example if you have separate points
about sections 3.1 and 3.5.1, two different posts responding to the
parent Section 3 post would be appreciated.  Thanks.  This will help to
focus each discussion point into specific threads.

Edward



Redesign of the relax data model

Index:
1.  Why change?
    1.1  The runs
    1.2  The molecules
    1.3  The residues
    1.4  The spins
2.  A new run concept
    2.1  Parcelling up an abstract space
    2.2  The run data model
    2.3  The pipe concept
3.  Molecules, residues, and spins
    3.1  The spin data model
    3.2  The data selection concept - identifying spin systems
        3.2.1  Function arguments
        3.2.2  NH data of a single protein macromolecule
        3.2.3  A single organic molecule (non-polymeric)
        3.2.4  A single RNA or DNA macromolecule
        3.2.5  Complexes
    3.3  Regular expression
    3.4  The spin loop
    3.5  Molecule, sequence, and spin user function classes
        3.5.1  The 'molecule' user function class
        3.5.2  The 'sequence' user function class
        3.5.3  The 'spin' user function class
    3.6  The input and output files
4.  Conclusion



Before reading this post, please read the previous posts:

* The parent message 'Redesign of the relax data model:  A HOWTO for
breaking relax.' located at
https://mail.gna.org/public/relax-devel/2006-10/msg00053.html
(Message-id:
<1160550133.9523.54.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).

* Section 1 'Redesign of the relax data model:  1.  Why change?' located
at https://mail.gna.org/public/relax-devel/2006-10/msg00054.html
(Message-id:
<1160551172.9523.60.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).

* Section 2 'Redesign of the relax data model:  2.  A new run concept'
located at https://mail.gna.org/public/relax-devel/2006-10/msg00056.html
(Message-id:
<1160555137.9523.70.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).



3.  Molecules, residues, and spins

3.1  The spin data model

For model-free analysis, etc., it is the NMR relaxation of individual
spin systems (or atoms) which is most important.  The residue number and
molecule that the spin belongs to is of secondary importance.  A three
level system could be used to access and categorise the spins.  In
letting

data = self.relax.data[self.relax.run],

the first level is the molecule:

data.mol[0]

The structure 'mol' is an array, each element being a container for
molecule specific data.  This allows for multiple molecules if
necessary.  The second level is the residue:

data.mol[0].res[0]

This 'res' data structure is an array, each element being a container
for residue specific data.  This allows for multiple residues per
molecule (although this is not essential).  The third and last level is
the spin or atom:

data.mol[0].res[0].spin[0]

Again the 'spin' data structure is an array, each element being a
container for spin specific data.  This allows for multiple spins per
residue.

For example, the optimised chi-squared value of the 2nd spin of the 56th
residue of the 3rd molecule would be stored at
'data.mol[2].res[55].spin[1].chi2'.  The residue number would be stored
at 'data.mol[2].res[55].num'.  The molecule name would be stored at
'data.mol[2].name'.


3.2  The data selection concept - identifying spin systems

3.2.1  Function arguments

The current way that spins are identified in the user functions (as well
as internal relax functions) is through the residue number and/or
residue name.  There is no formal or consistent way that this is done
though.  Some arguments are called 'res_num' while others are just
'num'.  The proposal is to standardise the interface and create the file
called 'generic_fns/spin_selector.py'.  Using the three-level spin data
model introduced in section 3.1, six identifiers are possible.  These
are:

Molecule number, 'data.mol[0].num' (e.g. the NMR model number).
Molecule name,   'data.mol[0].name' (e.g. the chain or segment ID).
Residue number,  'data.mol[0].res[0].num'.
Residue name,    'data.mol[0].res[0].name'.
Atom number,     'data.mol[0].res[0].spin[0].num' (e.g. the PDB atom
number).
Atom name,       'data.mol[0].res[0].spin[0].name' (e.g. the PDB atom
name).

These could be synonymous with the spin identifying function arguments
'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and
'atom_name'.  These would all default to the inactive value of None and
would be the very last arguments of the relevant user functions.  Are
there other ways that a spin or set of spins be identified?

3.2.2  NH data of a single protein macromolecule

The operation of relax would remain essentially the same for those
studying NH relaxation data of single molecule protein systems.  The
residues can be individually selected using the 'res_num' and/or
'res_name' arguments.  In this case the molecule number and name are
left as None, hence this will default to 'data.mol[0]'.  As there is
only one spin per residue, the spin number and name can be left as None,
hence defaulting to 'data.mol[0].res[index].spin[0]'.  The average user
need not know about these default data structures, this information will
essentially be invisible.  However power users are free to manipulate
this data structure.

3.2.3  A single organic molecule (non-polymeric)

For a single non-polymeric organic molecule, only the 'atom_num' and
'atom_name' arguments need be used.  The molecule and residue will,
invisibly from the perspective of the user, default to 'data.mol[0]' and
'data.mol[0].res[0]' respectively.

3.2.4  A single RNA or DNA macromolecule

The four arguments 'res_num', 'res_name', 'atom_num', and 'atom_name'
can be used to identify the residue and the different spins of those
residues.  Individual spins can be selected using all four arguments.
Like spins of all residues (e.g. N3 data) can be selected using solely
the atom specific arguments, whereas all data of an individual residue
can be selected using the residue specific arguments.  This approach can
also be used when both NH and CA data of a single protein macromolecule
have been collected.

3.2.5  Complexes

The individual molecules in a complex can be selected using the
molecular arguments 'mol_num' and 'mol_name'.


3.3  Regular expression

The six identifiers 'mol_num', 'mol_name', 'res_num', 'res_name',
'atom_num', and 'atom_name' will all be allowed to be Python regular
expression strings (the number arguments can be integers and the names
simple strings).  This allows for the selection of ranges of residues,
multiple residue types at the same time, etc.  For example
"res_name='[UG]'" when working with RNA will select both uracil and
guanine.

The user supplied regular expression for all six identifiers will need
to be tested for validity.  This could be done with the function
'self.relax.generic_fns.spin_selector.validate()' using try statements
together with the 'compile' function from the 're' module.  An example
of this testing is at the start of the 'self.sel_res()' function in the
file 'generic_fns/selection.py'.


3.4  The spin loop

Many parts of relax require looping over all the relaxation data (or
spins).  The implementation of this proposal will require nested looping
over all molecules, all residues, and all spins combined with tests for
matches to the 'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num',
and 'atom_name' arguments.  Rather than implementing this numerous times
throughout the program, the loop could be implemented just once within
the function 'self.relax.generic_fns.spin_selector.spin_loop()'.  In
addition to the six identifiers, this new function could except as an
argument a spin-specific function passed by the part of the code
requesting the loop.  The 'spin_loop()' function will then pass the data
structure 'spin', which is for example an alias to
'self.relax.data.mol[0].res[16].spin[3]', to the spin-specific function.
A sample implementation of the loop function could be:

 
    def spin_loop(fn=None, mol_num=None, mol_name=None, res_num=None,
res_name=None, atom_num=None, atom_name=None):
        """Function for selectively looping over all spins."""

        # Reassign the data container.
        data = self.relax.data[self.relax.run]

        # Loop over the molecules.
        for mol in data.mol:
            # Skip the molecule if there is no match to 'mol_num'.
            if type(mol_num) == int and not mol.num == mol_num:
                continue
            elif type(mol_num) == str and not match(mol_num, `mol.num`):
                continue

            # Skip the molecule if there is no match to 'mol_name'.
            if mol_name != None and not match(mol_name, `mol.name`):
                continue

            # Loop over the residues.
            for res in mol.res:
                # Skip the residue if there is no match to 'res_num'.
                if type(res_num) == int and not res.num == res_num:
                    continue
                elif type(res_num) == str and not match(res_num,
`res.num`):
                    continue

                # Skip the residue if there is no match to 'res_name'.
                if res_name != None and not match(res_name, `res.name`):
                    continue
 
                # Loop over the spins.
                for spin in res.spin:
                    # Skip the spin if there is no match to 'atom_num'.
                    if type(atom_num) == int and not spin.num ==
atom_num:
                        continue
                    elif type(atom_num) == str and not match(atom_num,
`spin.num`):
                        continue

                    # Skip the spin if there is no match to 'atom_name'.
                    if atom_name != None and not match(atom_name,
`spin.name`):
                        continue

                    # Execute the supplied spin-specific function,
passing in the data for the current spin.
                    fn(spin)


It will be up to the spin-specific function passed in by the calling
function to handle the 'spin.select' value.  Because of the complexity
of the loop, the use of this single 'spin_loop()' function will simplify
the relax code base, will minimise potential bugs, and will simplify
future changes to the relax data model (if necessary).


3.5  Molecule, sequence, and spin user function classes

For the three levels of the new data model, currently only user
functions relating to the sequence or residues exist.  These are all
located in the 'sequence' user function class.  The idea would be to
create three independent classes of user function: 'molecule',
'sequence', and 'spin'.

3.5.1  The 'molecule' user function class

This user function class could contain functions such as:

molecule.add()
molecule.copy()    # Copy the molecule information (name and num) from
another pipe.
molecule.delete()
molecule.info()   # Print the molecule info.
molecule.sort()

Other functions could be created to enable associations between the
'data.mol[index]' data structure and the Scientific Python PDB data
structure.  This will allow the 'vectors()' user function to correctly
extract XH bond vectors from the PDB data structure.  The 'pdb' user
function class could also be renamed to 'structure' to enable other 3D
molecular structure files to be transparently supported by the
'molecule' user functions (e.g. CIF).  The mapping of the structure to
the molecule-residue-spin framework could be quite complex, especially
if the Scientific PDB format is not the only format handled.  Would any
one have ideas of how to map multiple molecules in the PDB file to the
molecule name and number in the proposed molecule-residue-spin data
model?  NMR models in a PDB are easy to handle as these are treated as
different structures by the Scientific Python PDB reader, each can be
isolated by model number.  If a specific model is chosen, the model
number could become 'data.mol[index].num'.  If the average of all models
is chosen, then 'data.mol[index].num' could be None.  A number of other
questions are:

How would you expect to handle chain IDs and segment IDs?  Should this
be the 'mol_name'?

Are there other molecule identifiers in the PDB or other structure file
formats (CIF, etc.)?

How do you think a multi-chain molecule should be handled when a segment
of the polypeptide or nucleotide chain has been removed (e.g. insulin)?

If two chains 'A' and 'B' are part of the same polypeptide or RNA
transcript and have sequentially numbered residue numbers, should both
chains be handled as the same molecule in data.mol?  Or should
data.mol[0].name = 'A' and data.mol[1].name = 'B'?

Do you think that the molecule-residue-spin data model is sufficient to
handle all standard variants of the PDB (or other 3D structure formats)?

Would a molecule-chain-residue-spin data model be better?

For identifying a molecule, are the data structures data.mol[0].name and
data.mol[0].num sufficient?

Are there other data structures which could be placed in data.mol[0]?

Which of these issues do you think should be handled by the user rather
than internally by relax?  I would prefer that relax does most of the
work.

What other user functions do you think would be useful to add to the
'molecule' class?

And finally what other 3D structure files do you think should be
supported?  These must have parsers available as Python modules.

3.5.2  The 'sequence' user function class

This user function class could remain as is.  The user functions could
be modified to include the arguments 'mol_num' and 'mol_name' so that
they can be associated with certain molecules if required.

3.5.3  The 'spin' user function class

This new user function class could contain functions such as:

spin.add()
spin.copy()    # Copy the spin info (name and num) from another pipe.
spin.delete()
spin.display()
spin.read()
spin.sort()
spin.write()

These functions could be applied selectively using the 'mol_num',
'mol_name', 'res_num', or 'res_name' arguments.


3.6  The input and output files

Up to 6 columns could be used to identify spin-specific data (for both
input and output).  These could correspond to the six spin identifiers
'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and
'atom_name'.  If any of these are set to None for all spins, the column
could be dropped.  For example if no molecule info exists, these two
columns can be dropped.  If no residues exist, these can be dropped as
well.  For protein NH data of a single molecule, the data could appear
as:

res_num res_name    atom_num    atom_name   ...
1       GLY         1           N           ...
2       PRO         11          N           ...
3       LEU         28          N           ...

For RNA, the data could appear as:

res_num res_name    atom_num    atom_name   ...
1       G           23          N1          ...
1       G           18          C8          ...
2       U           38          N3          ...
2       U           52          C5          ...
2       U           46          C6          ...

For a non-polymeric organic molecule, the data could appear as:

atom_num    atom_name   ...
1           C1          ...
16          C16         ...
23          C23         ...

Are there any other standard ways of representing this data in a
columnar format?  These formats may not be the best solution.

Redesign of the relax data model: 3. Molecules, residues, and spins

Header

Content

Related Messages