Re: Updating the copyright notices in relax. -- July 26, 2013

Hi,

This sounds like an awesome plan!  Because of the breadth of ideas you
propose, this email is forcibly long.  But please carefully read all
of the details.

Some of things you mention are trivial and some actually already
function, others would require a little planning.  The relax
infrastructure is designed to be extremely flexible and modular, so
the amount of work involved, as you mentioned, would not be too great.
 I think some would be trivial to have done in time for the paper
while others could and should be done later.  However there really
needs to be someone pushing these ideas - i.e. possibly someone from
your side wherein the interest lies in an internal project and hence a
publication (an application paper).  Someone who would be willing to
join as a relax developer and to push the boundaries.   Someone who
would not mind intensive code review (I check every line of code that
goes into relax, sometimes the other relax developers as well), solid
pre-planning on the mailing lists, and working with code in a group
environment (rather than the standard work alone in a shielded
environment).  For example an idea might be rejected if there is a
better, more flexible way of doing it.  For the good of the whole
project, the simplest way is not always the best.  And especially if
the idea would break other parts of relax, then it will not be
acceptable.  Note that, when needed, I will point new developers in
the correct direction and often lay out the steps needed for
implementing a feature - what file needs to be changed, how it should
be changed, and the order of the changes.  This is needed to be able
to understand such a large, established code base.  This will require
cooperation - I cannot give the time to implement all of these ideas
myself.  I would be happy to see someone from your side take charge
and push those features you desire, now and in to the future.

I think that each situation needs to be taken on a case by case basis
and that absolutely all details of the problem needs to be laid out.
The reason is because the code cannot be written to accept all
possibilities - this will make the optimisation far too slow.  And
there are far too many combinations possible, and we cannot predict
future models and data types.  There is a lot of flexibility what can
be done in relax, but there are boundaries.  Some combinations will
make no sense, while others might appear to function but without test
data and a system test, the results might be false and meaningless.

So in each specific case, either a synthetic or real data set needs to
exist (ideally both data set types would be good to have).  The real
data can be obtained from published papers by email, as was done with
Dr. Flemming Hansen's data from http://dx.doi.org/10.1021/jp074793o
now located in the test_suite/shared_data/dispersion/Hansen directory.
 Then, before any coding is done, a relax script is written to perform
the full analysis on this test data.  Or at least how we think the
analysis should be performed.  This allows the deficiencies to be
discovered.  The script is then copied into
test_suite/system_tests/scripts/relax_disp/ and the data into
test_suite/shared_data/dispersion/.  It is then called a system test.
The test also checks that the result is as you expect.

For the rest of your ideas, I have answered below:

Yes, relax could become one of the most complete programs, and this would be
a strong point for it. Concerning features that would make relax really
unique, I have a suggestion.


It already is the most complete program!  However it does not have
100% of the capabilities of all dispersion software combined, though
it shouldn't be hard to reach that point.  It also has access to
better optimisation (for the dispersion problem) and other algorithms
that the other software does not have.  And it has a very simple GUI
for non-power users.  The question is how far do we, or you, want to
push it.  For example there are little features such as using
matlibplot which, if Troels doesn't get around to converting his code
first, I can help direct someone on your side to copy your Python code
to fit into relax.  If there is someone who is accepted as a relax
developer from your side, you are free to take relax where you wish.

It would be great if relax allowed to perform simultaneous fits of multiple
data sets. For example, in a number of cases people may have CPMG relaxation
dispersion on methyl 13C, and backbone (15N or CA).


Actually, relax will already automatically handle this.  In relax you
work with spin systems (or a structure called interatomic data
containers for data such as NOEs, J-couplings, RDCs, etc).  So if the
data handling, equations, and models are the same for all the spins
with relaxation data, then this will work now.  You do have to specify
clusters manually though.  For the methyl, I'm assuming CD2H data,
otherwise relax will not handle the complex interference pathways.

Likewise, fitting SQ and
MQ at the same time would be another instance for such simultaneous fits.


This might be more effort.  The way of handling the relaxation data
inside a spin container (the relax data store structure for spins)
would need to be modified.  Currently there is the assumption of only
one data set per spin.  But, with planning, this would not be too hard
to handle.  It would be a similar concept to how multiple field
strength data is labelled.  This would absolutely require test data.
Also note that the MQ dispersion models are currently not supported by
relax, but you probably already know how trivial this would be to add.
 Especially considering that almost all the SQ CPMG and R1rho analytic
and numeric models are already functional, or will soon be.

One might also think about situations where R1rho and CPMG dispersions might
be fit at the same time (although this is a bit more special, as it requires
that the time scale can be detected by both methods).


Again, this would require test data.  An planning how you would do
this.  Currently in relax you specify the experiment type at the
start.  This would need to be changed, and maybe in the same way as
labelling the relaxation data as SQ or MQ for the same spin - but just
labelling the data instead as 'SQ R1rho', 'SQ CPMG', 'MQ CPMG', etc.
The data would then be sorted out prior to sending it to the target
function.  Is there a paper using this data combination?

In all these cases, one would like to have a set of common fit parameters.


This is already implemented - the clustering concept does this.

Typically, the population and the kinetics would be such common parameters.
Then, one would have fit parameters that are specific to each data set, e.g.
the chemical shift difference. There might also be cases where e.g. the 13C
chemical shift difference is fit from both SQ and MQ dispersions, but the MQ
has, in addition, also a contribution from the 1H shift difference.


The cluster vs. spin specific parameter handling is already
implemented.  For both SQ and MQ, this would require two different
models.  This is not implemented yet.  The way to do this would be to
create a special target function for each model combination of
interest (see target_functions/relax_disp.py).  Then the data is split
and sent into two different functions from the relax library
(lib.dispersion.*).  A chi-squared sum is used for both.  Adding the
target function itself is simply a case of copy and paste one of the
current target functions, modify for the different parameter vector,
and duplicate the line with the model and change it to the second
model.  Done.  For the support of simultaneous SQ and MQ data this is
more complex, as I mentioned above, and this should be tackled first.

Then, it would of course also be great if such a fit could be done
simultaneously for a set of residues.


This requires zero coding, it's already done ;)  But in relax you
don't work with residues but spins.  However you can specify all spins
of the residue when defining clusters.

In principle, implementing such joint fits should be rather straightforward,
as the target function would contain all these data sets from different
residues at the same time, and the fit parameters would contain all the
various rates, populations and shift differences.


This is relatively straightforward if the exact problem is stated and
data can be created, downloaded, or asked for.  Having the data first
really makes the coding faster and more reliable.  And then converting
the data and script to a system test ensures that the code functions
perfectly forever.

Of course, the number of
fit parameters quickly increases (e.g. the fitted plateau values and the
chemical shift differences are separate for each data set and residue). So
there will be some careful thinking involved when designing the minimization
procedure.


This is a problem, but I think I have already solved that in the
dispersion auto-analysis in relax.  This might be a key part of the
paper I'm writing.  See the pre_run_parameters() and nesting() methods
in auto_analyses/relax_disp.py.  I use 3 tricks.  These have already
been mentioned in a few different emails I sent.  They are:

1)  Parameter nesting.  If a simpler nested model has already been
optimised, then the more complex model of the pair takes the
parameters of the simpler as the starting point for optimisation.  The
models must absolutely be nested for this.
2)  Model equivalence.  I use the parameters of the optimised CR72
model (Carver & Richards) as the starting point for the CPMG numeric
models.  This is an incredibly huge computational win :)  I don't know
why no one has thought of this simple concept before!  Maybe they have
but I have missed it.
3)  Pre-run data.  You first run relax with no clustering.  Then the
second time your run with clustering, but point to the non-clustered
results.  The averaged parameters from the non-clustered results are
taken as the starting point for optimisation.  In this case, 1) and 2)
are ignored.

All 3 tricks remove the need for a grid search.  For point 3), the
grid search for a cluster of >3 spins is already impossible.  The
starting points appear to be close enough to the solution to allow for
relatively fast optimisation.  I might use some of this text for the
paper :)

Do you see ways to do this?


If points 1), 2) and 3) above are ok, then this problem is already smashed!

Another challenge will be that this is flexible
enough such that the user can chose which data sets to fit, and which
parameters affect which data set - such as: fit 13C SQ, 13C MQ, 15N SQ CPMGs
for residues X, Y, and Z, and optimize parameters XX, YY, ZZ,...


Again this needs to be on a case by case basis.  For example if you
have 15N SQ CPMG data and 13Calpha MG CPMG data, this will be treated
differently to the data coming from the same spin.  Though once
simultaneous SQ and MQ data is handled, this is easy.  We would need
to create special 'combination' models for the user to choose from.
For example 'SQ+MQ CR72', or 'SQ+MQ NS 2-site'.  Each model
combination would be a different target function.  This is simply for
speed.  Which residues or spins are used is already handled.

 Possibly,
if one could then even fix some of the parameter, such that they are not
fitted, this could be helpful.


This is more of a problem.  And I believe it encourages bad behaviour
from the users.  But implementing this is quite complex.  That would
require the whole of the minfx library
(https://gna.org/projects/minfx/) to be modified.  But each
optimisation algorithm would handle this differently and I don't know
how the performance would be affected.  It is also possible to create
special target functions for each fixed parameter case.

For example, one might know the chemical
shift difference for some of the nuclei from somewhere...


This is a different problem.  The problem would be tailored
differently.  Forget about fixing parameters.  You have a new input
data set and, importantly for Monte Carlo simulations, also have
errors.  These shifts would need to be assembled in the
specific_analyses.relax_disp.api module, as all data is, and then sent
into the dispersion target function class with all the other data.
Then there would be a new target function for this problem.  The dw
value would not be a parameter of this target function.  Note that the
lib.dispersion.* code would not change - the target function just
pulls out the appropriate dw value from the permanent class data
rather than from the parameter vector.  Again each case would be
handled separately.

Any thoughts on this?


I hope I've covered everything ;)  What do you think?

Regards,

Edward


P. S.  We still need the copyright permission on the code you attached
to https://gna.org/task/?7712.  There's no licence in the code, so you
need to give permission there.

Re: Updating the copyright notices in relax.

Header

Content

Related Messages