On Wed, 2006-10-11 at 17:02 +1000, Edward d'Auvergne wrote:
This post is proposal for the redesign the relax data model. This will affect how data is input into the program, how data is selected, how molecular structures are handled, how spin systems are handled, and how many other parts of relax function. Importantly the internal structure of 'self.relax.data' will completely change. These modifications will essentially break every part of relax (the isolated code in the directories 'minimise', 'maths_fns', and 'docs' will be safe from the carnage, as will a few files in the base directory). If you have any ideas for extending or improving the proposed data model, can see any short-comings, deficiencies, or flaws, are familiar with the PDB conventions, etc., your input is very much sought after. The changes should occur in the 1.3 line of the repository. 1.2 versions will be unaffected - scripts will remain compatible and the 1.2 line will continue to be supported with bug fixes, etc. I have to apologise in advance for the size of this proposal, to simplify it I have divided the text into numbered sections. Once this initial parent message has been sent I will respond to it with the text of the 4 major sections. This will allow 4 major threads to branch off from this message on the mailing list archive (https://mail.gna.org/public/relax-devel). If you have an opinion, idea, etc. about a specific section, could you please post a separate message in response to the relevant major section post? Also if you have unrelated ideas for one of these sections, could you post these as separate messages as well? For example if you have separate points about sections 3.1 and 3.5.1, two different posts responding to the parent Section 3 post would be appreciated. Thanks. This will help to focus each discussion point into specific threads. Edward Redesign of the relax data model Index: 1. Why change? 1.1 The runs 1.2 The molecules 1.3 The residues 1.4 The spins 2. A new run concept 2.1 Parcelling up an abstract space 2.2 The run data model 2.3 The pipe concept 3. Molecules, residues, and spins 3.1 The spin data model 3.2 The data selection concept - identifying spin systems 3.2.1 Function arguments 3.2.2 NH data of a single protein macromolecule 3.2.3 A single organic molecule (non-polymeric) 3.2.4 A single RNA or DNA macromolecule 3.2.5 Complexes 3.3 Regular expression 3.4 The spin loop 3.5 Molecule, sequence, and spin user function classes 3.5.1 The 'molecule' user function class 3.5.2 The 'sequence' user function class 3.5.3 The 'spin' user function class 3.6 The input and output files 4. Conclusion
Before reading this post, please read the parent message 'Redesign of the relax data model: A HOWTO for breaking relax.' located at http://https://mail.gna.org/public/relax-devel/2006-10/msg00053.html (Message-id: <1160550133.9523.54.camel@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>). 1. Why change? It is becoming apparent that the capability and flexibility of the current relax data model is very limited. For example posts to both the relax-users (https://mail.gna.org/public/relax-users) and relax-devel (https://mail.gna.org/public/relax-devel) mailing lists by Alex Hansen demonstrate that the data model is not flexible enough to handle relaxation data from RNA or DNA. While handling the RNA data in the 1.2 relax versions is possible, the required script would be far too complex. I have identified four major categories where I believe that the relax data model is not flexible or elegant enough: the runs, the molecules, the residues, and the spin systems (or atoms). If you can think of any other logical categories, please say. As the planned changes are so disruptive, fine tuning and perfecting the design now is very important. 1.1 The runs The way runs are currently handled in relax is far from optimal. The history of the run concept is linked to the history of the program. relax was originally designed as a small program for the creation of input files for Art Palmer's Modelfree program, for reading the output files, and as an implementation of the model selection data analysis step. While relax still does all of these things (with Dasha now supported as well), it has evolved such that it implements all of the steps of the data analysis chain and can be used as a complete replacement. The origin of the run was the Modelfree input and output file handling function where each model-free model optimised was called a run. When the prompt and script based interfaces were implemented and user functions first created, no user functions were associated with a 'run'. As relax has evolved, the percentage of user functions associated with the run concept has gradually risen. Now almost every single user function requires a run argument. The problem with the current run concept is that, because you must supply the run name to each user function, the run string needs to be passed to all relevant functions in the program. For each user function called this results in the string bouncing around many functions, each time being placed into 'self.run'. This dense branching effect means that the text 'self.run' is extremely widespread within the relax code base. For example, by grepping the source code, the number of lines containing the text 'self.run' in relax version 1.2.7 is 1645! This branching approach mandates knowing which functions require the run argument, passing the string to that function, and then making sure that the run argument is placed into 'self.run' so that the previous run is not accidentally accessed. The current branching run concept is an important and constant source of bugs. The complexity of the source code is unnecessarily high. 1.2 The molecules In the 1.2 relax versions only a single molecule is supported. This molecule must also be the first molecule in the supplied PDB file. relax cannot currently handle molecular complexes. For example in PDB file of an RNA/protein complex there is no way of specifying whether you are studying the RNA or the protein. 1.3 The residues Currently relax assumes that you are studying a polymer system in which the data can be characterised by residue number (and name). However if you are studying a molecule that isn't a polymer, there are no residues. A 'residue number' and 'residue name' could be invented for the system but, as the residue concept is entrenched into relax, there may be problems with this approach. 1.4 The spins relax also assumes that you only have a single relaxation data set per residue. For studying the backbone NH relaxation of a protein this isn't a problem. However in RNA, DNA, and non-polymeric molecules, there are likely to be multiple spins studied per 'residue'. The data model should be modified to allow the analysis of these types of molecules. An additional benefit when studying a protein is that you would then be able to study the backbone NH data simultaneously with CA data (together with any other data sets you may have collected).