Re: Redesign of the relax data model: 3. Molecules, residues, and spins -- January 15, 2007

> 3.5 Molecule, sequence, and spin user function classes


[snip]

>   3.5.1  The 'molecule' user function class
>
>   This user function class could contain functions such as:
>
>   molecule.add()
>   molecule.copy()    # Copy the molecule information (name and num) from
>   another pipe.
>   molecule.delete()
>   molecule.info()   # Print the molecule info.
>   molecule.sort()
>
>   Other functions could be created to enable associations between the
>   'data.mol[index]' data structure and the Scientific Python PDB data
>   structure.  This will allow the 'vectors()' user function to correctly
>   extract XH bond vectors from the PDB data structure.  The 'pdb' user
>   function class could also be renamed to 'structure' to enable other 3D
>   molecular structure files to be transparently supported by the
>   'molecule' user functions (e.g. CIF).  The mapping of the structure to
>   the molecule-residue-spin framework could be quite complex, especially
>   if the Scientific PDB format is not the only format handled.  Would any
>   one have ideas of how to map multiple molecules in the PDB file to the
>   molecule name and number in the proposed molecule-residue-spin data
>   model?


> NMR models in a PDB are easy to handle as these are treated as
>   different structures by the Scientific Python PDB reader, each can be
>   isolated by model number.  If a specific model is chosen, the model
>   number could become 'data.mol[index].num'.  If the average of all models
>   is chosen, then 'data.mol[index].num' could be None.  A number of other
>   questions are:

As an aside I wrote a parser for pdb files a while back that dealt
with multiple molecules with either chain ids or seg id or no ids and
I had to prioriise along the lines of segid> chain-id> nothing.
However, I had to add special cases for nmr structure files and
actually had the idea of an ensemble of structures as well as a single
structure
>
>   How would you expect to handle chain IDs and segment IDs?  Should this
>   be the 'mol_name'?

    I would just give each molecule a name if you don't have a segment
id then use the chain id if there is no chain id and  use a generic
name like mol_1, mol_2  etc


If there is no segment ID or chain ID, would it be safe to set the
molecule name to None?  In this case, surely there is only a single
macromolecule in the PDB file?

>   Are there other molecule identifiers in the PDB or other structure file
>   formats (CIF, etc.)?
>
>   How do you think a multi-chain molecule should be handled when a segment
>   of the polypeptide or nucleotide chain has been removed (e.g. insulin)?

I am not following this one
>
>   If two chains 'A' and 'B' are part of the same polypeptide or RNA
>   transcript and have sequentially numbered residue numbers, should both
>   chains be handled as the same molecule in data.mol?  Or should
>   data.mol[0].name = 'A' and data.mol[1].name = 'B'?
>
I would call them a and b and start numbering  residues at each break
of the chain id numbering of all things in pdb files can be a
nightmare (they can even be negative and in the wrong order...) and
many programs just renumber residues & atoms  in the order they are
read


So if the one molecule has two chains, A and B, we should treat the
structure as containing two molecules A and B.  Does this have the
potential to confuse the user?

As the spin specific data structures will all be implemented as list
types (arrays), there is no need to renumber the residues of the
protein, nucleic acid, etc.  Their name and number will be stored
separately from the array indecies.

>   Do you think that the molecule-residue-spin data model is sufficient to
>   handle all standard variants of the PDB (or other 3D structure formats)?
>
>   Would a molecule-chain-residue-spin data model be better?

no definitely not! trying to identify which chains belong to which
molecules is generally a hopeless operation for most of these files
and what would it gain us?

>
>   For identifying a molecule, are the data structures data.mol[0].name and
>   data.mol[0].num sufficient?

yes. does the number need to be explicit? what happens if you delete a
molecule or add one are the molecule numbers increased monotonically
on reading and never reused or should the number just be the number of
the molecule in the current ordering


I was unsure about the 'data.mol[index].num' variable.  This probably
has no use.

>   Are there other data structures which could be placed in data.mol[0]?
>
>   Which of these issues do you think should be handled by the user rather
>   than internally by relax?  I would prefer that relax does most of the
>   work.
>
>   What other user functions do you think would be useful to add to the
>   'molecule' class?
>
>   And finally what other 3D structure files do you think should be
>   supported?  These must have parsers available as Python modules.
>

I think we should support import/export via plugin  classes (which can
be loaded using the same code as is used in the unit_test) and then
provide a pdb reader as a starting place. The plugins would be based
around an interface class much in the same way that sax is structured
http://www.saxproject.org/

both readers and writers would then be subclasses of the same class:


class stucture_interface():
   def start_molecule(name,number)
        pass
   def end_molecule()
        pass
   def start_residue(type,number)
        pass
   def end_residue()
        pass
   def add_atom(number,name,x,y,z,b_factor=None)
        pass

consider a simple molecule

A.TYR.HA
A.TYR.CO
A.GAL.HA
A.GAL.CO

B.HIS.HA
B.HIS.CO


a reader would get called as follows

start_molecule('A',1)
start_residue('TYR',1)
add_atom(1,'HA',1.0,1.0,1.0)
add_atom(2,'CO',1.0,1.0,1.0)
end_residue()
start_residue('GAL',1)
add_atom(1,'HA',1.0,1.0,1.0)
add_atom(2,'CO',1.0,1.0,1.0)
end_residue()
end_molecule()

start_molecule('B',2)
start_residue('HIS',1)
add_atom(1,'HA',1.0,1.0,1.0)
add_atom(2, 'CO',1.0,1.0,1.0)
end_residue()
end_molecule()


writing is done in the inverse manner

This design produces a simple interface to work with and allows for
future additions such as ensembles to be caterered for with relative
ease. It is simple to learn and code and is symmetric.

as the plugins would be classes which are identified at run time they
could take extra arguments during instantiation to allow for e.g
selection of chain-id in preference to segid


This is another great idea!  The ScientificPython, scipy, or other
pre-written PDB parsers could be easily supported as well as
readers/writers coded into relax.  I would suggest that this idea be
implemented separately from the proposed redesign.

>   3.6  The input and output files
i presume this refers to the relax run files...


This refers to the 'results' files and the value writing/displaying
user functions (for the relaxation data, etc).

>   Up to 6 columns could be used to identify spin-specific data (for both
>   input and output).  These could correspond to the six spin identifiers
>   'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and
>   'atom_name'.  If any of these are set to None for all spins, the column
>   could be dropped.  For example if no molecule info exists, these two
>   columns can be dropped.  If no residues exist, these can be dropped as
>   well.

I would leave them in and give them null values dropping things just
increases code complexity


I don't mind complexity on our side if it mean simplicity for the
user.  For example if you are working on a protein studying solely the
backbone N relaxation, the molecule name, atom number, and atom name
need not be placed in the file containing the R1 relaxation data.  In
that case all you need is the residue number and name and the
relaxation values and errors.  If you are an organic chemist working
on a non-polymeric molecule, the residue number and name are
inconsequential and could be dropped from the output.  Parsing these
files using the names in the header line should be straight forward.

> For protein NH data of a single molecule, the data could appear
>   as:
>
>   res_num res_name    atom_num    atom_name   ...
>   1       GLY         1           N           ...
>   2       PRO         11          N           ...
>   3       LEU         28          N           ...
>
>   For RNA, the data could appear as:
>
>   res_num res_name    atom_num    atom_name   ...
>   1       G           23          N1          ...
>   1       G           18          C8          ...
>   2       U           38          N3          ...
>   2       U           52          C5          ...
>   2       U           46          C6          ...
>
>   For a non-polymeric organic molecule, the data could appear as:
>
>   atom_num    atom_name   ...
>   1           C1          ...
>   16          C16         ...
>   23          C23         ...
>
>   Are there any other standard ways of representing this data in a
>   columnar format?  These formats may not be the best solution.


Cheers,

Edward

Re: Redesign of the relax data model: 3. Molecules, residues, and spins

Header

Content

Related Messages