re: Redesign of the relax data model: 3. Molecules, residues, and spins -- January 07, 2007

  On Wed, 2006-10-11 at 17:02 +1000, Edward d'Auvergne wrote:

      This post is proposal for the redesign the relax data model.  This will
      affect how data is input into the program, how data is selected, how
      molecular structures are handled, how spin systems are handled, and how
      many other parts of relax function.  Importantly the internal structure
      of 'self.relax.data' will completely change.  These modifications will
      essentially break every part of relax (the isolated code in the
      directories 'minimise', 'maths_fns', and 'docs' will be safe from the
      carnage, as will a few files in the base directory).  If you have any
      ideas for extending or improving the proposed data model, can see any
      short-comings, deficiencies, or flaws, are familiar with the PDB
      conventions, etc., your input is very much sought after.  The changes
      should occur in the 1.3 line of the repository.  1.2 versions will be
      unaffected - scripts will remain compatible and the 1.2 line will
      continue to be supported with bug fixes, etc.

      I have to apologise in advance for the size of this proposal, to
      simplify it I have divided the text into numbered sections.  Once this
      initial parent message has been sent I will respond to it with the text
      of the 4 major sections.  This will allow 4 major threads to branch off
      from this message on the mailing list archive
      (https://mail.gna.org/public/relax-devel).  If you have an opinion,
      idea, etc. about a specific section, could you please post a separate
      message in response to the relevant major section post?  Also if you
      have unrelated ideas for one of these sections, could you post these as
      separate messages as well?  For example if you have separate points
      about sections 3.1 and 3.5.1, two different posts responding to the
      parent Section 3 post would be appreciated.  Thanks.  This will help to
      focus each discussion point into specific threads.

      Edward

      Redesign of the relax data model

      Index:
      1.  Why change?
          1.1  The runs
          1.2  The molecules
          1.3  The residues
          1.4  The spins
      2.  A new run concept
          2.1  Parcelling up an abstract space
          2.2  The run data model
          2.3  The pipe concept
      3.  Molecules, residues, and spins
          3.1  The spin data model
          3.2  The data selection concept - identifying spin systems
              3.2.1  Function arguments
              3.2.2  NH data of a single protein macromolecule
              3.2.3  A single organic molecule (non-polymeric)
              3.2.4  A single RNA or DNA macromolecule
              3.2.5  Complexes
          3.3  Regular expression
          3.4  The spin loop
          3.5  Molecule, sequence, and spin user function classes
              3.5.1  The 'molecule' user function class
              3.5.2  The 'sequence' user function class
              3.5.3  The 'spin' user function class
          3.6  The input and output files
      4.  Conclusion


  Before reading this post, please read the previous posts:

  * The parent message 'Redesign of the relax data model:  A HOWTO for
  breaking relax.' located at
  https://mail.gna.org/public/relax-devel/2006-10/msg00053.html
  (Message-id:
  <1160550133.9523.54.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).

  * Section 1 'Redesign of the relax data model:  1.  Why change?' located
  at https://mail.gna.org/public/relax-devel/2006-10/msg00054.html
  (Message-id:
  <1160551172.9523.60.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).

  * Section 2 'Redesign of the relax data model:  2.  A new run concept'
  located at https://mail.gna.org/public/relax-devel/2006-10/msg00056.html
  (Message-id:
  <1160555137.9523.70.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>).

  3.  Molecules, residues, and spins

  3.1  The spin data model

  For model-free analysis, etc., it is the NMR relaxation of individual
  spin systems (or atoms) which is most important.  The residue number and
  molecule that the spin belongs to is of secondary importance.  A three
  level system could be used to access and categorise the spins.  In
  letting

  data = self.relax.data[self.relax.run],

  the first level is the molecule:

  data.mol[0]

  The structure 'mol' is an array, each element being a container for
  molecule specific data.  This allows for multiple molecules if
  necessary.  The second level is the residue:

  data.mol[0].res[0]

  This 'res' data structure is an array, each element being a container
  for residue specific data.  This allows for multiple residues per
  molecule (although this is not essential).  The third and last level is
  the spin or atom:

  data.mol[0].res[0].spin[0]

  Again the 'spin' data structure is an array, each element being a
  container for spin specific data.  This allows for multiple spins per
  residue.

  For example, the optimised chi-squared value of the 2nd spin of the 56th
  residue of the 3rd molecule would be stored at
  'data.mol[2].res[55].spin[1].chi2'.  The residue number would be stored
  at 'data.mol[2].res[55].num'.  The molecule name would be stored at
  'data.mol[2].name'.


  3.2  The data selection concept - identifying spin systems

  3.2.1  Function arguments

  The current way that spins are identified in the user functions (as well
  as internal relax functions) is through the residue number and/or
  residue name.  There is no formal or consistent way that this is done
  though.  Some arguments are called 'res_num' while others are just
  'num'.  The proposal is to standardise the interface and create the file
  called 'generic_fns/spin_selector.py'.  Using the three-level spin data
  model introduced in section 3.1, six identifiers are possible.  These
  are:

  Molecule number, 'data.mol[0].num' (e.g. the NMR model number).
  Molecule name,   'data.mol[0].name' (e.g. the chain or segment ID).
  Residue number,  'data.mol[0].res[0].num'.
  Residue name,    'data.mol[0].res[0].name'.
  Atom number,     'data.mol[0].res[0].spin[0].num' (e.g. the PDB atom
  number).
  Atom name,       'data.mol[0].res[0].spin[0].name' (e.g. the PDB atom
  name).

  These could be synonymous with the spin identifying function arguments
  'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and
  'atom_name'.  These would all default to the inactive value of None and
  would be the very last arguments of the relevant user functions.  Are
  there other ways that a spin or set of spins be identified?


one answer would be to use a little language concept. Thus for example
Molmol and the UCSF systems use

#<molecule number> | <molecule_name>:<residue_selection>[,
<residue_selection>...]@<atom_selection> [,<atom_selection>]

residue_selection=<residue_number> | <residue_range> | <residue_type>
residue_selection=<residue_number>-<residue_number>
atom_selection=<string_and_wildcards>and Note thuis


this reduces selection to a single argument plus a simple parser which
would yield selection objects which can identify if a
molecule/residue/spin selection is selected and be passed around the
system. Having selection object engenders clarity and simplicity:

e.g.

class selection:
  def selected_spins(self):
        '''returns an iterator of spins which are selected where a spin is
a reference to a spin in the form
'self.relax.data.mol[0].res[16].spin[3]'''


This allows for considerable flexibility for the user and a simple
internal structure


  3.2.2  NH data of a single protein macromoleculeNote thuis

  The operation of relax would remain essentially the same for those
  studying NH relaxation data of single molecule prot+ein systemand s.  The
  residues can be individually selected using the 'res_num' and/or
  'res_name' arguments.  In this case the molecule number and name are
  left as None, hence this will default to 'data.mol[+0]'.  As there is
  only one spin per residue, the spin number and name can be left and as None,
  hence defaulting to 'data.mol[0].res[index].spin[0]'.  The average user
  need not know about these default data structures, this information will
  essentially be invisible.  However power users are free to manipulate
  this data structure.and
  pass
  3.2.3  A single organic molecule (non-polymeric)

  For a single non-polymeric organic molecule, only the 'atom_num' and
  'atom_name' arguments need be used.  The molecule and residue will,
  invisibly from the perspective of the user, default to 'data.moland [0]' and
  'data.mol[0].res[0]' respectively.pass

  3.2.4  A single RNA or DNA macromolecule

  The four arguments 'res_num', 'res_name', 'atom_num', and 'atom_name'
  can be used to identify the residue and the different spins of those
  residues.  Individual spins can be selected using all four arguments.
  Like spins of all residues (e.g. N3 data) can be selected using solely
  the atom specific arguments, whereas all data of an individual residue
  can be selected using the residue specific arguments.  This approach can
  also be used when both NH and CA data of a single protein macromolecule
  have been collected.

  3.2.5  Complexes

  The individual molecules in a complex can be selected using the
  molecular arguments 'mol_num' and 'mol_name'.

  pass
  3.3  Regular expression

  The six identifiers 'mol_num', 'mol_name', 'res_num', 'res_name',
  'atom_num', and 'atom_name' will all be allowed to be Python regular
  expression strings (the number arguments can be integers and the names
  simple strings).  This allows for the selection of ranges of residues,
  multiple residue types at the same time, etc.  For example
  "res_name='[UG]'" when working with RNA will select both uracil and
  guanine.


This doesn't really allow for selection of residue number ranges
though... which is one of the most useful selections generally


  The user supplied regular expression for all six identifiers will need
  to be tested for validity.  This could be done with the function
  'self.relax.generic_fns.spin_selector.validate()' using try statements
  together with the 'compile' function from the 're' module.  An example
  of this testing is at the start of the 'self.sel_res()' function in the
  file 'generic_fns/selection.py'.


  3.4  The spin loop


wouldn't a function that returned an iterator be better?

  Many parts of relax require looping over all the relaxation data (or
  spins).  The implementation of this proposal will require nested looping
  over all molecules, all residues, and all spins combined with tests for
  matches to the 'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num',
  and 'atom_name' arguments.  Rather than implementing this numerous times
  throughout the program, the loop could be implemented just once within
  the function 'self.relax.generic_fns.spin_selector.spin_loop()'.  In
  addition to the six identifiers, this new function could except as an
  argument a spin-specific function passed by the part of the code
  requesting the loop.  The 'spin_loop()' function will then pass the data
  structure 'spin', which is for example an alias to
  'self.relax.data.mol[0].res[16].spin[3]', to the spin-specific function.
  A sample implementation of the loop function could be:


      def spin_loop(fn=None, mol_num=None, mol_name=None, res_num=None,
  res_name=None, atom_num=None, atom_name=None):
          """Function for selectively looping over all spins."""

          # Reassign the data container.
          data = self.relax.data[self.relax.run]

          # Loop over the molecules.
          for mol in data.mol:
              # Skip the molecule if there is no match to 'mol_num'.
              if type(mol_num) == int and not mol.num == mol_num:
                  continue
              elif type(mol_num) == str and not match(mol_num, `mol.num`):
                  continue

              # Skip the molecule if there is no match to 'mol_name'.
              if mol_name != None and not match(mol_name, `mol.name`):
                  continue

              # Loop over the residues.
              for res in mol.res:
                  # Skip the residue if there is no match to 'res_num'.
                  if type(res_num) == int and not res.num == res_num:
                      continue
                  elif type(res_num) == str and not match(res_num,
  `res.num`):
                      continue

                  # Skip the residue if there is no match to 'res_name'.
                  if res_name != None and not match(res_name, `res.name`):
                      continue

                  # Loop over the spins.
                  for spin in res.spin:
                      # Skip the spin if there is no match to 'atom_num'.
                      if type(atom_num) == int and not spin.num ==
  atom_num:
                          continue
                      elif type(atom_num) == str and not match(atom_num,
  `spin.num`):
                          continue

                      # Skip the spin if there is no match to 'atom_name'.
                      if atom_name != None and not match(atom_name,
  `spin.name`):
                          continue

                      # Execute the supplied spin-specific function,
  passing in the data for the current spin.
                      fn(spin)


  It will be up to the spin-specific function passed in by the calling
  function to handle the 'spin.select' value.  Because of the complexity
  of the loop, the use of this single 'spin_loop()' function will simplify
  the relax code base, will minimise potential bugs, and will simplify
  future changes to the relax data model (if necessary).


use of an iterator object will provide flexibility as iterators can be
wrapped filtered and generally mucked about with using pythons loops
and iter tools. Whats more they are  doddle to code as all you do is
write an ordinary function and call yield with a value each time you
have  identified a selected spin
(http://www.python.org/dev/peps/pep-0255/).... This also allows
arbitrary selection to be added as wrapper iterators or filtered
iterators

  3.5  Molecule, sequence, and spin user function classes

  For the three levels of the new data model, currently only user
  functions relating to the sequence or residues exist.  These are all
  located in the 'sequence' user function class.  The idea would be to
  create three independent classes of user function: 'molecule',
  'sequence', and 'spin'.

  3.5.1  The 'molecule' user function class

  This user function class could contain functions such as:

  molecule.add()
  molecule.copy()    # Copy the molecule information (name and num) from
  another pipe.
  molecule.delete()
  molecule.info()   # Print the molecule info.
  molecule.sort()

  Other functions could be created to enable associations between the
  'data.mol[index]' data structure and the Scientific Python PDB data
  structure.  This will allow the 'vectors()' user function to correctly
  extract XH bond vectors from the PDB data structure.  The 'pdb' user
  function class could also be renamed to 'structure' to enable other 3D
  molecular structure files to be transparently supported by the
  'molecule' user functions (e.g. CIF).  The mapping of the structure to
  the molecule-residue-spin framework could be quite complex, especially
  if the Scientific PDB format is not the only format handled.  Would any
  one have ideas of how to map multiple molecules in the PDB file to the
  molecule name and number in the proposed molecule-residue-spin data
  model?

NMR models in a PDB are easy to handle as these are treated as
  different structures by the Scientific Python PDB reader, each can be
  isolated by model number.  If a specific model is chosen, the model
  number could become 'data.mol[index].num'.  If the average of all models
  is chosen, then 'data.mol[index].num' could be None.  A number of other
  questions are:


As an aside I wrote a parser for pdb files a while back that dealt
with multiple molecules with either chain ids or seg id or no ids and
I had to prioriise along the lines of segid> chain-id> nothing.
However, I had to add special cases for nmr structure files and
actually had the idea of an ensemble of structures as well as a single
structure


  How would you expect to handle chain IDs and segment IDs?  Should this
  be the 'mol_name'?


   I would just give each molecule a name if you don't have a segment
id then use the chain id if there is no chain id and  use a generic
name like mol_1, mol_2  etc


  Are there other molecule identifiers in the PDB or other structure file
  formats (CIF, etc.)?

  How do you think a multi-chain molecule should be handled when a segment
  of the polypeptide or nucleotide chain has been removed (e.g. insulin)?

I am not following this one


  If two chains 'A' and 'B' are part of the same polypeptide or RNA
  transcript and have sequentially numbered residue numbers, should both
  chains be handled as the same molecule in data.mol?  Or should
  data.mol[0].name = 'A' and data.mol[1].name = 'B'?

I would call them a and b and start numbering  residues at each break
of the chain id numbering of all things in pdb files can be a
nightmare (they can even be negative and in the wrong order...) and
many programs just renumber residues & atoms  in the order they are
read

  Do you think that the molecule-residue-spin data model is sufficient to
  handle all standard variants of the PDB (or other 3D structure formats)?

Would a molecule-chain-residue-spin data model be better?


no definitely not! trying to identify which chains belong to which
molecules is generally a hopeless operation for most of these files
and what would it gain us?


  For identifying a molecule, are the data structures data.mol[0].name and
  data.mol[0].num sufficient?


yes. does the number need to be explicit? what happens if you delete a
molecule or add one are the molecule numbers increased monotonically
on reading and never reused or should the number just be the number of
the molecule in the current ordering


  Are there other data structures which could be placed in data.mol[0]?

  Which of these issues do you think should be handled by the user rather
  than internally by relax?  I would prefer that relax does most of the
  work.

  What other user functions do you think would be useful to add to the
  'molecule' class?

  And finally what other 3D structure files do you think should be
  supported?  These must have parsers available as Python modules.


I think we should support import/export via plugin  classes (which can
be loaded using the same code as is used in the unit_test) and then
provide a pdb reader as a starting place. The plugins would be based
around an interface class much in the same way that sax is structured
http://www.saxproject.org/

both readers and writers would then be subclasses of the same class:


class stucture_interface():
  def start_molecule(name,number)
        pass
  def end_molecule()
        pass
  def start_residue(type,number)
       pass
  def end_residue()
       pass
  def add_atom(number,name,x,y,z,b_factor=None)
       pass

consider a simple molecule

A.TYR.HA
A.TYR.CO
A.GAL.HA
A.GAL.CO

B.HIS.HA
B.HIS.CO


a reader would get called as follows

start_molecule('A',1)
start_residue('TYR',1)
add_atom(1,'HA',1.0,1.0,1.0)
add_atom(2,'CO',1.0,1.0,1.0)
end_residue()
start_residue('GAL',1)
add_atom(1,'HA',1.0,1.0,1.0)
add_atom(2,'CO',1.0,1.0,1.0)
end_residue()
end_molecule()

start_molecule('B',2)
start_residue('HIS',1)
add_atom(1,'HA',1.0,1.0,1.0)
add_atom(2, 'CO',1.0,1.0,1.0)
end_residue()
end_molecule()


writing is done in the inverse manner

This design produces a simple interface to work with and allows for
future additions such as ensembles to be caterered for with relative
ease. It is simple to learn and code and is symmetric.

as the plugins would be classes which are identified at run time they
could take extra arguments during instantiation to allow for e.g
selection of chain-id in preference to segid

  3.5.2  The 'sequence' user function class

  This user function class could remain as is.  The user functions could
  be modified to include the arguments 'mol_num' and 'mol_name' so that
  they can be associated with certain molecules if required.

  3.5.3  The 'spin' user function class

  This new user function class could contain functions such as:

  spin.add()
  spin.copy()    # Copy the spin info (name and num) from another pipe.
  spin.delete()
  spin.display()
  spin.read()
  spin.sort()
  spin.write()

  These functions could be applied selectively using the 'mol_num',
  'mol_name', 'res_num', or 'res_name' arguments.

3.6 The input and output files

i presume this refers to the relax run files...


  Up to 6 columns could be used to identify spin-specific data (for both
  input and output).  These could correspond to the six spin identifiers
  'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and
  'atom_name'.  If any of these are set to None for all spins, the column
  could be dropped.  For example if no molecule info exists, these two
  columns can be dropped.  If no residues exist, these can be dropped as
  well.


I would leave them in and give them null values dropping things just
increases code complexity

For protein NH data of a single molecule, the data could appear
  as:

  res_num res_name    atom_num    atom_name   ...
  1       GLY         1           N           ...
  2       PRO         11          N           ...
  3       LEU         28          N           ...

  For RNA, the data could appear as:

  res_num res_name    atom_num    atom_name   ...
  1       G           23          N1          ...
  1       G           18          C8          ...
  2       U           38          N3          ...
  2       U           52          C5          ...
  2       U           46          C6          ...

  For a non-polymeric organic molecule, the data could appear as:

  atom_num    atom_name   ...
  1           C1          ...
  16          C16         ...
  23          C23         ...

  Are there any other standard ways of representing this data in a
  columnar format?  These formats may not be the best solution.

re: Redesign of the relax data model: 3. Molecules, residues, and spins

Header

Content

Related Messages