Re: Redesign of the relax data model: 3. Molecules, residues, and spins -- October 11, 2006

On Wed, 2006-10-11 at 18:57 +1000, Edward d'Auvergne wrote:

3.5.1  The 'molecule' user function class

This user function class could contain functions such as:

molecule.add()
molecule.copy()    # Copy the molecule information (name and num) from
another pipe.
molecule.delete()
molecule.info()   # Print the molecule info.
molecule.sort()

Other functions could be created to enable associations between the
'data.mol[index]' data structure and the Scientific Python PDB data
structure.  This will allow the 'vectors()' user function to correctly
extract XH bond vectors from the PDB data structure.  The 'pdb' user
function class could also be renamed to 'structure' to enable other 3D
molecular structure files to be transparently supported by the
'molecule' user functions (e.g. CIF).  The mapping of the structure to
the molecule-residue-spin framework could be quite complex, especially
if the Scientific PDB format is not the only format handled.  Would any
one have ideas of how to map multiple molecules in the PDB file to the
molecule name and number in the proposed molecule-residue-spin data
model?  NMR models in a PDB are easy to handle as these are treated as
different structures by the Scientific Python PDB reader, each can be
isolated by model number.  If a specific model is chosen, the model
number could become 'data.mol[index].num'.  If the average of all models
is chosen, then 'data.mol[index].num' could be None.


One question worth keeping in mind is whether we want to stick with
Scientific for our structure handling. Although it works well,
Scientific is currently tying relax to Numeric. The conversion to Numpy
is trivial and has some performance benefits (based on my limited
tests). Other Python PDB handlers exist, though I have no experience
with them. Even starting from scratch shouldn't be too much of a
problem, despite the widespread abuse the PDB format suffers. 

With that question in the back of the mind, it seems to me that there
are two extreme approaches we might consider in terms of the mechanics
of the molecule/residue/spin disection. Either we take the PDB file as
read, using the PDB definitions of structures and chains to define our
molecules, and the PDB definitions of residues to define our residues.
In this case, if the user needs flexibility in this regard, they achieve
it by hacking the PDB file appropriately. Clearly the behaviour of our
PDB handler will be critical in this case. The other extreme is to just
read in atoms (spins) and their coordinates from the PDB file, and let
the user define their own molecule/residue definitions inside relax with
a series of commands like:

residue.create(res_name, res_num, atoms)
molecule.create(mol_name, mol_num, residues)

(where atoms and residues are lists of the components to include in the
new residue or molecule). Obviously this emphasisies flexibility for the
user at the cost of lengthy molecular setup scripts. In reality we would
want a balance of the two approaches - the first approach should work
very well for basic protein and nucleic acid work, but will be a pain
for more exotic molecules, or if the user wants to treat their sugars as
different residues from their nucleobases, or their methyls as different
residues from their protein backbone. The question is how common will be
the need for that flexibility?

My feeling is that the best balance is as follows:
Every chain in every structure in the PDB is a new molecule by default.
Residues are as defined by the PDB by default.
Names and numbers for molecules, residues, atoms come from the PDB by
default. 
Full flexibility is provided to the brave user with commands like:
    residue.create()
    residue.delete()
    residue.merge()
    residue.rename()
    residue.renumber()????
and similar for molecules and atoms

It might be that some common modifications to the default behaviour are
achieved by specific arguments to the pdb.read() user command.

  A number of other
questions are:

How would you expect to handle chain IDs and segment IDs?  Should this
be the 'mol_name'?

Are there other molecule identifiers in the PDB or other structure file
formats (CIF, etc.)?

How do you think a multi-chain molecule should be handled when a segment
of the polypeptide or nucleotide chain has been removed (e.g. insulin)?

If two chains 'A' and 'B' are part of the same polypeptide or RNA
transcript and have sequentially numbered residue numbers, should both
chains be handled as the same molecule in data.mol?  Or should
data.mol[0].name = 'A' and data.mol[1].name = 'B'?

Do you think that the molecule-residue-spin data model is sufficient to
handle all standard variants of the PDB (or other 3D structure formats)?


I don't think this is the issue. The relax data model should be just
complex enough to handle the requirements of relaxation analysis in
diverse macromolecules, and not more complex. We can always massage a
molecular structure into the molecule-residue-spin data model (or any
other we deside is useful), whatever its source.

Would a molecule-chain-residue-spin data model be better?


I don't see any benefit for the added complexity. All separate chains
are separate molecules, unless the user chooses to merge them.

For identifying a molecule, are the data structures data.mol[0].name and
data.mol[0].num sufficient?

Are there other data structures which could be placed in data.mol[0]?

Which of these issues do you think should be handled by the user rather
than internally by relax?  I would prefer that relax does most of the
work.


In priciple I agree, with the proviso that adequate flexibility is
always availible. The user can't exercise flexibility without doing some
work. The issue is to ensure that the most common applications of relax
are easy, while ensuring that the more exotic requirements of a few are
available if they want to do the work.


What other user functions do you think would be useful to add to the
'molecule' class?

And finally what other 3D structure files do you think should be
supported?  These must have parsers available as Python modules.

Re: Redesign of the relax data model: 3. Molecules, residues, and spins

Header

Content

Related Messages