On Wed, 2006-10-11 at 18:57 +1000, Edward d'Auvergne wrote:
3.5.1 The 'molecule' user function class This user function class could contain functions such as: molecule.add() molecule.copy() # Copy the molecule information (name and num) from another pipe. molecule.delete() molecule.info() # Print the molecule info. molecule.sort() Other functions could be created to enable associations between the 'data.mol[index]' data structure and the Scientific Python PDB data structure. This will allow the 'vectors()' user function to correctly extract XH bond vectors from the PDB data structure. The 'pdb' user function class could also be renamed to 'structure' to enable other 3D molecular structure files to be transparently supported by the 'molecule' user functions (e.g. CIF). The mapping of the structure to the molecule-residue-spin framework could be quite complex, especially if the Scientific PDB format is not the only format handled. Would any one have ideas of how to map multiple molecules in the PDB file to the molecule name and number in the proposed molecule-residue-spin data model? NMR models in a PDB are easy to handle as these are treated as different structures by the Scientific Python PDB reader, each can be isolated by model number. If a specific model is chosen, the model number could become 'data.mol[index].num'. If the average of all models is chosen, then 'data.mol[index].num' could be None.
One question worth keeping in mind is whether we want to stick with Scientific for our structure handling. Although it works well, Scientific is currently tying relax to Numeric. The conversion to Numpy is trivial and has some performance benefits (based on my limited tests). Other Python PDB handlers exist, though I have no experience with them. Even starting from scratch shouldn't be too much of a problem, despite the widespread abuse the PDB format suffers. With that question in the back of the mind, it seems to me that there are two extreme approaches we might consider in terms of the mechanics of the molecule/residue/spin disection. Either we take the PDB file as read, using the PDB definitions of structures and chains to define our molecules, and the PDB definitions of residues to define our residues. In this case, if the user needs flexibility in this regard, they achieve it by hacking the PDB file appropriately. Clearly the behaviour of our PDB handler will be critical in this case. The other extreme is to just read in atoms (spins) and their coordinates from the PDB file, and let the user define their own molecule/residue definitions inside relax with a series of commands like: residue.create(res_name, res_num, atoms) molecule.create(mol_name, mol_num, residues) (where atoms and residues are lists of the components to include in the new residue or molecule). Obviously this emphasisies flexibility for the user at the cost of lengthy molecular setup scripts. In reality we would want a balance of the two approaches - the first approach should work very well for basic protein and nucleic acid work, but will be a pain for more exotic molecules, or if the user wants to treat their sugars as different residues from their nucleobases, or their methyls as different residues from their protein backbone. The question is how common will be the need for that flexibility? My feeling is that the best balance is as follows: Every chain in every structure in the PDB is a new molecule by default. Residues are as defined by the PDB by default. Names and numbers for molecules, residues, atoms come from the PDB by default. Full flexibility is provided to the brave user with commands like: residue.create() residue.delete() residue.merge() residue.rename() residue.renumber()???? and similar for molecules and atoms It might be that some common modifications to the default behaviour are achieved by specific arguments to the pdb.read() user command.
A number of other questions are: How would you expect to handle chain IDs and segment IDs? Should this be the 'mol_name'? Are there other molecule identifiers in the PDB or other structure file formats (CIF, etc.)? How do you think a multi-chain molecule should be handled when a segment of the polypeptide or nucleotide chain has been removed (e.g. insulin)? If two chains 'A' and 'B' are part of the same polypeptide or RNA transcript and have sequentially numbered residue numbers, should both chains be handled as the same molecule in data.mol? Or should data.mol[0].name = 'A' and data.mol[1].name = 'B'? Do you think that the molecule-residue-spin data model is sufficient to handle all standard variants of the PDB (or other 3D structure formats)?
I don't think this is the issue. The relax data model should be just complex enough to handle the requirements of relaxation analysis in diverse macromolecules, and not more complex. We can always massage a molecular structure into the molecule-residue-spin data model (or any other we deside is useful), whatever its source.
Would a molecule-chain-residue-spin data model be better?
I don't see any benefit for the added complexity. All separate chains are separate molecules, unless the user chooses to merge them.
For identifying a molecule, are the data structures data.mol[0].name and data.mol[0].num sufficient? Are there other data structures which could be placed in data.mol[0]? Which of these issues do you think should be handled by the user rather than internally by relax? I would prefer that relax does most of the work.
In priciple I agree, with the proviso that adequate flexibility is always availible. The user can't exercise flexibility without doing some work. The issue is to ensure that the most common applications of relax are easy, while ensuring that the more exotic requirements of a few are available if they want to do the work.
What other user functions do you think would be useful to add to the 'molecule' class? And finally what other 3D structure files do you think should be supported? These must have parsers available as Python modules.