On Wed, 2006-10-11 at 17:02 +1000, Edward d'Auvergne wrote:
This post is proposal for the redesign the relax data model. This will affect how data is input into the program, how data is selected, how molecular structures are handled, how spin systems are handled, and how many other parts of relax function. Importantly the internal structure of 'self.relax.data' will completely change. These modifications will essentially break every part of relax (the isolated code in the directories 'minimise', 'maths_fns', and 'docs' will be safe from the carnage, as will a few files in the base directory). If you have any ideas for extending or improving the proposed data model, can see any short-comings, deficiencies, or flaws, are familiar with the PDB conventions, etc., your input is very much sought after. The changes should occur in the 1.3 line of the repository. 1.2 versions will be unaffected - scripts will remain compatible and the 1.2 line will continue to be supported with bug fixes, etc. I have to apologise in advance for the size of this proposal, to simplify it I have divided the text into numbered sections. Once this initial parent message has been sent I will respond to it with the text of the 4 major sections. This will allow 4 major threads to branch off from this message on the mailing list archive (https://mail.gna.org/public/relax-devel). If you have an opinion, idea, etc. about a specific section, could you please post a separate message in response to the relevant major section post? Also if you have unrelated ideas for one of these sections, could you post these as separate messages as well? For example if you have separate points about sections 3.1 and 3.5.1, two different posts responding to the parent Section 3 post would be appreciated. Thanks. This will help to focus each discussion point into specific threads. Edward Redesign of the relax data model Index: 1. Why change? 1.1 The runs 1.2 The molecules 1.3 The residues 1.4 The spins 2. A new run concept 2.1 Parcelling up an abstract space 2.2 The run data model 2.3 The pipe concept 3. Molecules, residues, and spins 3.1 The spin data model 3.2 The data selection concept - identifying spin systems 3.2.1 Function arguments 3.2.2 NH data of a single protein macromolecule 3.2.3 A single organic molecule (non-polymeric) 3.2.4 A single RNA or DNA macromolecule 3.2.5 Complexes 3.3 Regular expression 3.4 The spin loop 3.5 Molecule, sequence, and spin user function classes 3.5.1 The 'molecule' user function class 3.5.2 The 'sequence' user function class 3.5.3 The 'spin' user function class 3.6 The input and output files 4. Conclusion
Before reading this post, please read the previous posts: * The parent message 'Redesign of the relax data model: A HOWTO for breaking relax.' located at https://mail.gna.org/public/relax-devel/2006-10/msg00053.html (Message-id: <1160550133.9523.54.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>). * Section 1 'Redesign of the relax data model: 1. Why change?' located at https://mail.gna.org/public/relax-devel/2006-10/msg00054.html (Message-id: <1160551172.9523.60.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>). * Section 2 'Redesign of the relax data model: 2. A new run concept' located at https://mail.gna.org/public/relax-devel/2006-10/msg00056.html (Message-id: <1160555137.9523.70.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>). 3. Molecules, residues, and spins 3.1 The spin data model For model-free analysis, etc., it is the NMR relaxation of individual spin systems (or atoms) which is most important. The residue number and molecule that the spin belongs to is of secondary importance. A three level system could be used to access and categorise the spins. In letting data = self.relax.data[self.relax.run], the first level is the molecule: data.mol[0] The structure 'mol' is an array, each element being a container for molecule specific data. This allows for multiple molecules if necessary. The second level is the residue: data.mol[0].res[0] This 'res' data structure is an array, each element being a container for residue specific data. This allows for multiple residues per molecule (although this is not essential). The third and last level is the spin or atom: data.mol[0].res[0].spin[0] Again the 'spin' data structure is an array, each element being a container for spin specific data. This allows for multiple spins per residue. For example, the optimised chi-squared value of the 2nd spin of the 56th residue of the 3rd molecule would be stored at 'data.mol[2].res[55].spin[1].chi2'. The residue number would be stored at 'data.mol[2].res[55].num'. The molecule name would be stored at 'data.mol[2].name'. 3.2 The data selection concept - identifying spin systems 3.2.1 Function arguments The current way that spins are identified in the user functions (as well as internal relax functions) is through the residue number and/or residue name. There is no formal or consistent way that this is done though. Some arguments are called 'res_num' while others are just 'num'. The proposal is to standardise the interface and create the file called 'generic_fns/spin_selector.py'. Using the three-level spin data model introduced in section 3.1, six identifiers are possible. These are: Molecule number, 'data.mol[0].num' (e.g. the NMR model number). Molecule name, 'data.mol[0].name' (e.g. the chain or segment ID). Residue number, 'data.mol[0].res[0].num'. Residue name, 'data.mol[0].res[0].name'. Atom number, 'data.mol[0].res[0].spin[0].num' (e.g. the PDB atom number). Atom name, 'data.mol[0].res[0].spin[0].name' (e.g. the PDB atom name). These could be synonymous with the spin identifying function arguments 'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and 'atom_name'. These would all default to the inactive value of None and would be the very last arguments of the relevant user functions. Are there other ways that a spin or set of spins be identified? 3.2.2 NH data of a single protein macromolecule The operation of relax would remain essentially the same for those studying NH relaxation data of single molecule protein systems. The residues can be individually selected using the 'res_num' and/or 'res_name' arguments. In this case the molecule number and name are left as None, hence this will default to 'data.mol[0]'. As there is only one spin per residue, the spin number and name can be left as None, hence defaulting to 'data.mol[0].res[index].spin[0]'. The average user need not know about these default data structures, this information will essentially be invisible. However power users are free to manipulate this data structure. 3.2.3 A single organic molecule (non-polymeric) For a single non-polymeric organic molecule, only the 'atom_num' and 'atom_name' arguments need be used. The molecule and residue will, invisibly from the perspective of the user, default to 'data.mol[0]' and 'data.mol[0].res[0]' respectively. 3.2.4 A single RNA or DNA macromolecule The four arguments 'res_num', 'res_name', 'atom_num', and 'atom_name' can be used to identify the residue and the different spins of those residues. Individual spins can be selected using all four arguments. Like spins of all residues (e.g. N3 data) can be selected using solely the atom specific arguments, whereas all data of an individual residue can be selected using the residue specific arguments. This approach can also be used when both NH and CA data of a single protein macromolecule have been collected. 3.2.5 Complexes The individual molecules in a complex can be selected using the molecular arguments 'mol_num' and 'mol_name'. 3.3 Regular expression The six identifiers 'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and 'atom_name' will all be allowed to be Python regular expression strings (the number arguments can be integers and the names simple strings). This allows for the selection of ranges of residues, multiple residue types at the same time, etc. For example "res_name='[UG]'" when working with RNA will select both uracil and guanine. The user supplied regular expression for all six identifiers will need to be tested for validity. This could be done with the function 'self.relax.generic_fns.spin_selector.validate()' using try statements together with the 'compile' function from the 're' module. An example of this testing is at the start of the 'self.sel_res()' function in the file 'generic_fns/selection.py'. 3.4 The spin loop Many parts of relax require looping over all the relaxation data (or spins). The implementation of this proposal will require nested looping over all molecules, all residues, and all spins combined with tests for matches to the 'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and 'atom_name' arguments. Rather than implementing this numerous times throughout the program, the loop could be implemented just once within the function 'self.relax.generic_fns.spin_selector.spin_loop()'. In addition to the six identifiers, this new function could except as an argument a spin-specific function passed by the part of the code requesting the loop. The 'spin_loop()' function will then pass the data structure 'spin', which is for example an alias to 'self.relax.data.mol[0].res[16].spin[3]', to the spin-specific function. A sample implementation of the loop function could be: def spin_loop(fn=None, mol_num=None, mol_name=None, res_num=None, res_name=None, atom_num=None, atom_name=None): """Function for selectively looping over all spins.""" # Reassign the data container. data = self.relax.data[self.relax.run] # Loop over the molecules. for mol in data.mol: # Skip the molecule if there is no match to 'mol_num'. if type(mol_num) == int and not mol.num == mol_num: continue elif type(mol_num) == str and not match(mol_num, `mol.num`): continue # Skip the molecule if there is no match to 'mol_name'. if mol_name != None and not match(mol_name, `mol.name`): continue # Loop over the residues. for res in mol.res: # Skip the residue if there is no match to 'res_num'. if type(res_num) == int and not res.num == res_num: continue elif type(res_num) == str and not match(res_num, `res.num`): continue # Skip the residue if there is no match to 'res_name'. if res_name != None and not match(res_name, `res.name`): continue # Loop over the spins. for spin in res.spin: # Skip the spin if there is no match to 'atom_num'. if type(atom_num) == int and not spin.num == atom_num: continue elif type(atom_num) == str and not match(atom_num, `spin.num`): continue # Skip the spin if there is no match to 'atom_name'. if atom_name != None and not match(atom_name, `spin.name`): continue # Execute the supplied spin-specific function, passing in the data for the current spin. fn(spin) It will be up to the spin-specific function passed in by the calling function to handle the 'spin.select' value. Because of the complexity of the loop, the use of this single 'spin_loop()' function will simplify the relax code base, will minimise potential bugs, and will simplify future changes to the relax data model (if necessary). 3.5 Molecule, sequence, and spin user function classes For the three levels of the new data model, currently only user functions relating to the sequence or residues exist. These are all located in the 'sequence' user function class. The idea would be to create three independent classes of user function: 'molecule', 'sequence', and 'spin'. 3.5.1 The 'molecule' user function class This user function class could contain functions such as: molecule.add() molecule.copy() # Copy the molecule information (name and num) from another pipe. molecule.delete() molecule.info() # Print the molecule info. molecule.sort() Other functions could be created to enable associations between the 'data.mol[index]' data structure and the Scientific Python PDB data structure. This will allow the 'vectors()' user function to correctly extract XH bond vectors from the PDB data structure. The 'pdb' user function class could also be renamed to 'structure' to enable other 3D molecular structure files to be transparently supported by the 'molecule' user functions (e.g. CIF). The mapping of the structure to the molecule-residue-spin framework could be quite complex, especially if the Scientific PDB format is not the only format handled. Would any one have ideas of how to map multiple molecules in the PDB file to the molecule name and number in the proposed molecule-residue-spin data model? NMR models in a PDB are easy to handle as these are treated as different structures by the Scientific Python PDB reader, each can be isolated by model number. If a specific model is chosen, the model number could become 'data.mol[index].num'. If the average of all models is chosen, then 'data.mol[index].num' could be None. A number of other questions are: How would you expect to handle chain IDs and segment IDs? Should this be the 'mol_name'? Are there other molecule identifiers in the PDB or other structure file formats (CIF, etc.)? How do you think a multi-chain molecule should be handled when a segment of the polypeptide or nucleotide chain has been removed (e.g. insulin)? If two chains 'A' and 'B' are part of the same polypeptide or RNA transcript and have sequentially numbered residue numbers, should both chains be handled as the same molecule in data.mol? Or should data.mol[0].name = 'A' and data.mol[1].name = 'B'? Do you think that the molecule-residue-spin data model is sufficient to handle all standard variants of the PDB (or other 3D structure formats)? Would a molecule-chain-residue-spin data model be better? For identifying a molecule, are the data structures data.mol[0].name and data.mol[0].num sufficient? Are there other data structures which could be placed in data.mol[0]? Which of these issues do you think should be handled by the user rather than internally by relax? I would prefer that relax does most of the work. What other user functions do you think would be useful to add to the 'molecule' class? And finally what other 3D structure files do you think should be supported? These must have parsers available as Python modules. 3.5.2 The 'sequence' user function class This user function class could remain as is. The user functions could be modified to include the arguments 'mol_num' and 'mol_name' so that they can be associated with certain molecules if required. 3.5.3 The 'spin' user function class This new user function class could contain functions such as: spin.add() spin.copy() # Copy the spin info (name and num) from another pipe. spin.delete() spin.display() spin.read() spin.sort() spin.write() These functions could be applied selectively using the 'mol_num', 'mol_name', 'res_num', or 'res_name' arguments. 3.6 The input and output files Up to 6 columns could be used to identify spin-specific data (for both input and output). These could correspond to the six spin identifiers 'mol_num', 'mol_name', 'res_num', 'res_name', 'atom_num', and 'atom_name'. If any of these are set to None for all spins, the column could be dropped. For example if no molecule info exists, these two columns can be dropped. If no residues exist, these can be dropped as well. For protein NH data of a single molecule, the data could appear as: res_num res_name atom_num atom_name ... 1 GLY 1 N ... 2 PRO 11 N ... 3 LEU 28 N ... For RNA, the data could appear as: res_num res_name atom_num atom_name ... 1 G 23 N1 ... 1 G 18 C8 ... 2 U 38 N3 ... 2 U 52 C5 ... 2 U 46 C6 ... For a non-polymeric organic molecule, the data could appear as: atom_num atom_name ... 1 C1 ... 16 C16 ... 23 C23 ... Are there any other standard ways of representing this data in a columnar format? These formats may not be the best solution.