Hi, This sounds like an awesome plan! Because of the breadth of ideas you propose, this email is forcibly long. But please carefully read all of the details. Some of things you mention are trivial and some actually already function, others would require a little planning. The relax infrastructure is designed to be extremely flexible and modular, so the amount of work involved, as you mentioned, would not be too great. I think some would be trivial to have done in time for the paper while others could and should be done later. However there really needs to be someone pushing these ideas - i.e. possibly someone from your side wherein the interest lies in an internal project and hence a publication (an application paper). Someone who would be willing to join as a relax developer and to push the boundaries. Someone who would not mind intensive code review (I check every line of code that goes into relax, sometimes the other relax developers as well), solid pre-planning on the mailing lists, and working with code in a group environment (rather than the standard work alone in a shielded environment). For example an idea might be rejected if there is a better, more flexible way of doing it. For the good of the whole project, the simplest way is not always the best. And especially if the idea would break other parts of relax, then it will not be acceptable. Note that, when needed, I will point new developers in the correct direction and often lay out the steps needed for implementing a feature - what file needs to be changed, how it should be changed, and the order of the changes. This is needed to be able to understand such a large, established code base. This will require cooperation - I cannot give the time to implement all of these ideas myself. I would be happy to see someone from your side take charge and push those features you desire, now and in to the future. I think that each situation needs to be taken on a case by case basis and that absolutely all details of the problem needs to be laid out. The reason is because the code cannot be written to accept all possibilities - this will make the optimisation far too slow. And there are far too many combinations possible, and we cannot predict future models and data types. There is a lot of flexibility what can be done in relax, but there are boundaries. Some combinations will make no sense, while others might appear to function but without test data and a system test, the results might be false and meaningless. So in each specific case, either a synthetic or real data set needs to exist (ideally both data set types would be good to have). The real data can be obtained from published papers by email, as was done with Dr. Flemming Hansen's data from http://dx.doi.org/10.1021/jp074793o now located in the test_suite/shared_data/dispersion/Hansen directory. Then, before any coding is done, a relax script is written to perform the full analysis on this test data. Or at least how we think the analysis should be performed. This allows the deficiencies to be discovered. The script is then copied into test_suite/system_tests/scripts/relax_disp/ and the data into test_suite/shared_data/dispersion/. It is then called a system test. The test also checks that the result is as you expect. For the rest of your ideas, I have answered below:
Yes, relax could become one of the most complete programs, and this would be a strong point for it. Concerning features that would make relax really unique, I have a suggestion.
It already is the most complete program! However it does not have 100% of the capabilities of all dispersion software combined, though it shouldn't be hard to reach that point. It also has access to better optimisation (for the dispersion problem) and other algorithms that the other software does not have. And it has a very simple GUI for non-power users. The question is how far do we, or you, want to push it. For example there are little features such as using matlibplot which, if Troels doesn't get around to converting his code first, I can help direct someone on your side to copy your Python code to fit into relax. If there is someone who is accepted as a relax developer from your side, you are free to take relax where you wish.
It would be great if relax allowed to perform simultaneous fits of multiple data sets. For example, in a number of cases people may have CPMG relaxation dispersion on methyl 13C, and backbone (15N or CA).
Actually, relax will already automatically handle this. In relax you work with spin systems (or a structure called interatomic data containers for data such as NOEs, J-couplings, RDCs, etc). So if the data handling, equations, and models are the same for all the spins with relaxation data, then this will work now. You do have to specify clusters manually though. For the methyl, I'm assuming CD2H data, otherwise relax will not handle the complex interference pathways.
Likewise, fitting SQ and MQ at the same time would be another instance for such simultaneous fits.
This might be more effort. The way of handling the relaxation data inside a spin container (the relax data store structure for spins) would need to be modified. Currently there is the assumption of only one data set per spin. But, with planning, this would not be too hard to handle. It would be a similar concept to how multiple field strength data is labelled. This would absolutely require test data. Also note that the MQ dispersion models are currently not supported by relax, but you probably already know how trivial this would be to add. Especially considering that almost all the SQ CPMG and R1rho analytic and numeric models are already functional, or will soon be.
One might also think about situations where R1rho and CPMG dispersions might be fit at the same time (although this is a bit more special, as it requires that the time scale can be detected by both methods).
Again, this would require test data. An planning how you would do this. Currently in relax you specify the experiment type at the start. This would need to be changed, and maybe in the same way as labelling the relaxation data as SQ or MQ for the same spin - but just labelling the data instead as 'SQ R1rho', 'SQ CPMG', 'MQ CPMG', etc. The data would then be sorted out prior to sending it to the target function. Is there a paper using this data combination?
In all these cases, one would like to have a set of common fit parameters.
This is already implemented - the clustering concept does this.
Typically, the population and the kinetics would be such common parameters. Then, one would have fit parameters that are specific to each data set, e.g. the chemical shift difference. There might also be cases where e.g. the 13C chemical shift difference is fit from both SQ and MQ dispersions, but the MQ has, in addition, also a contribution from the 1H shift difference.
The cluster vs. spin specific parameter handling is already implemented. For both SQ and MQ, this would require two different models. This is not implemented yet. The way to do this would be to create a special target function for each model combination of interest (see target_functions/relax_disp.py). Then the data is split and sent into two different functions from the relax library (lib.dispersion.*). A chi-squared sum is used for both. Adding the target function itself is simply a case of copy and paste one of the current target functions, modify for the different parameter vector, and duplicate the line with the model and change it to the second model. Done. For the support of simultaneous SQ and MQ data this is more complex, as I mentioned above, and this should be tackled first.
Then, it would of course also be great if such a fit could be done simultaneously for a set of residues.
This requires zero coding, it's already done ;) But in relax you don't work with residues but spins. However you can specify all spins of the residue when defining clusters.
In principle, implementing such joint fits should be rather straightforward, as the target function would contain all these data sets from different residues at the same time, and the fit parameters would contain all the various rates, populations and shift differences.
This is relatively straightforward if the exact problem is stated and data can be created, downloaded, or asked for. Having the data first really makes the coding faster and more reliable. And then converting the data and script to a system test ensures that the code functions perfectly forever.
Of course, the number of fit parameters quickly increases (e.g. the fitted plateau values and the chemical shift differences are separate for each data set and residue). So there will be some careful thinking involved when designing the minimization procedure.
This is a problem, but I think I have already solved that in the dispersion auto-analysis in relax. This might be a key part of the paper I'm writing. See the pre_run_parameters() and nesting() methods in auto_analyses/relax_disp.py. I use 3 tricks. These have already been mentioned in a few different emails I sent. They are: 1) Parameter nesting. If a simpler nested model has already been optimised, then the more complex model of the pair takes the parameters of the simpler as the starting point for optimisation. The models must absolutely be nested for this. 2) Model equivalence. I use the parameters of the optimised CR72 model (Carver & Richards) as the starting point for the CPMG numeric models. This is an incredibly huge computational win :) I don't know why no one has thought of this simple concept before! Maybe they have but I have missed it. 3) Pre-run data. You first run relax with no clustering. Then the second time your run with clustering, but point to the non-clustered results. The averaged parameters from the non-clustered results are taken as the starting point for optimisation. In this case, 1) and 2) are ignored. All 3 tricks remove the need for a grid search. For point 3), the grid search for a cluster of >3 spins is already impossible. The starting points appear to be close enough to the solution to allow for relatively fast optimisation. I might use some of this text for the paper :)
Do you see ways to do this?
If points 1), 2) and 3) above are ok, then this problem is already smashed!
Another challenge will be that this is flexible enough such that the user can chose which data sets to fit, and which parameters affect which data set - such as: fit 13C SQ, 13C MQ, 15N SQ CPMGs for residues X, Y, and Z, and optimize parameters XX, YY, ZZ,...
Again this needs to be on a case by case basis. For example if you have 15N SQ CPMG data and 13Calpha MG CPMG data, this will be treated differently to the data coming from the same spin. Though once simultaneous SQ and MQ data is handled, this is easy. We would need to create special 'combination' models for the user to choose from. For example 'SQ+MQ CR72', or 'SQ+MQ NS 2-site'. Each model combination would be a different target function. This is simply for speed. Which residues or spins are used is already handled.
Possibly, if one could then even fix some of the parameter, such that they are not fitted, this could be helpful.
This is more of a problem. And I believe it encourages bad behaviour from the users. But implementing this is quite complex. That would require the whole of the minfx library (https://gna.org/projects/minfx/) to be modified. But each optimisation algorithm would handle this differently and I don't know how the performance would be affected. It is also possible to create special target functions for each fixed parameter case.
For example, one might know the chemical shift difference for some of the nuclei from somewhere...
This is a different problem. The problem would be tailored differently. Forget about fixing parameters. You have a new input data set and, importantly for Monte Carlo simulations, also have errors. These shifts would need to be assembled in the specific_analyses.relax_disp.api module, as all data is, and then sent into the dispersion target function class with all the other data. Then there would be a new target function for this problem. The dw value would not be a parameter of this target function. Note that the lib.dispersion.* code would not change - the target function just pulls out the appropriate dw value from the permanent class data rather than from the parameter vector. Again each case would be handled separately.
Any thoughts on this?
I hope I've covered everything ;) What do you think? Regards, Edward P. S. We still need the copyright permission on the code you attached to https://gna.org/task/?7712. There's no licence in the code, so you need to give permission there.