Re: -- November 24, 2010

On 24 November 2010 10:31, Dr. Klaus-Peter Neidig
<peter.neidig@xxxxxxxxxxxxxxxxx> wrote:


Dear Edward,


I talked to some collegues at Bruker to discuss your various items.
In general there is agreement that being as much consistent with
relax as possible is a good idea. However other software at Bruker
does not have to be affected (some of the code is used in multiple
areas) and changes that cause too much effort and are already documented
and in use should be avoided. Enhancing the export functionality
to provide more infos to relax is regarded a good way to go.


Hi,

Yes, it would be a bad idea to make big changes.  Interfacing with
relax will be quite straightforward and probably no changes will be
required from Bruker's side.

In detail:

Do the users define the signal-free region?  For example there is
usually more noise in the random coil part of the spectrum due to
unwanted junk in the sample, so the peaks in this region have larger
errors.  What about signals close to the water signal?  Is if possible
to have different errors for these peaks?


No, the PDC does currently not allow to assign different errors to
different peaks. We take several regions in the spectrum automatically,
e.g. close to the 4 corners and finally regard the one with the lowest 
noise.
We have to rely on some assumptions like proper base line correction,
reasonably clean sample, careful post processing, if any.


Ah, ok.  We will have to make sure users manually exclude spins with
signals near the water region.  Again this is an easy thing to solve.

I don't understand the origin of this scaling.  Since Art Palmer's
seminal 1991 publication (http://dx.doi.org/10.1021/ja00012a001), the
RMSD of the pure base-plane noise has been used for the standard
deviation.  The key text is on page 4375:


Thanks for the various references. The factor we used so far is empirical
but no problem, I will remove it. Other software pieces at Bruker
that use such factors are not affected. I did run some tests already,
the fitted parameters do not really change but of course the errors of the
fitted parameters do. This needs to be documented to the user.


Thank you.  In a model-free analysis the accuracy of the errors is
just as important, or maybe even more important, than the data itself
so this will make a big difference.

From my work a long time ago, I noticed that the spectral errors
decreased with an increasing relaxation period.  This can be taken
into account in relax if all spectra are duplicated/triplicated/etc.
But if not, then the errors for all spectra are averaged (using
variance averaging, not standard deviation averaging).  For a single
duplicated/triplicated spectrum, the error is taken as the average
variance of each peak.  So when some, but not all, spectra are
replicated, there will be one error for all spin system at all
relaxation periods.  This sounds similar to what the PDC does, but is
it exactly the same?


I think the Bruker internal opinion was different in a sense that one should
either have enough replicates (which never happens) and do sdev averaging or
just provide an estimate of the systematic error. The change in intensity 
obviously
depends on the peaks. We have seen peaks with longer T2 varying much
more than those with shorter T2.  We concluded to to assume a worst case 
error
scenario and check for the largest difference of all peak intensities at
replicated mixing times. This error was then applied (added) to all peaks at
all mixing times. I should also mention that in the PDC the user has the 
option
to replace peak intensities/integrals of replicated mixing times by their 
mean
and each mixing time occurs only once in the fit. This yields slightly 
different
results. The request for implementing this came from our application people
who obviously talked to their customers.


Having enough replicates would never be possible for properly
determining errors, so it must be crudely estimated.  The maximum
error would mean that the errors would probably never be
underestimated.  But for downstream analysis, underestimating some
errors (and overestimating others) would be better as no bias would be
introduced into the errors.  Using the maximum error applied to all
peaks will actually result in molecules appearing more rigid than they
are!  The result will be the hiding of many motional processes.  So
the second option would be better.  Even better would be to use an
average error rather than the maximum by averaging variances (sdev
values cannot be averaged as they cannot be summed, but variances
sum).  Is the peak intensity error located in the PDC files, I cannot
see it in our current testing files
https://gna.org/file/testT2.txt?file_id=10641?

Since changing the error estimation of replicates could influence the 
results
significantly, we do not want to blindly do it. The proposal would be to 
offer
several options (variance averaging, sdev averaging, worst case estimate)


The sdev averaging here should be avoided as it statistically should
not be done.  For example from Wikipedia:

"The standard deviation of the sum of two random variables can be
related to their individual standard deviations and the covariance
between them:

    stdev(X + Y) = sqrt{var(X) + var(Y) + 2 * cov(X,Y)}.

where var = stdev^2 and cov stand for variance and covariance, respectively."

The average would then be sqrt(2)*stdev(X+Y), and the covariance is
zero.  Another example are the messages in the thread
http://mathforum.org/kb/message.jspa?messageID=1602446&tstart=0.

Having a number of options (variance averaging, worst case estimate,
etc.) would be great to have.  Could the user's choice be recorded in
the PDC file?  I would like relax to give the user a detailed warning
about hiding dynamics if they choose the worst case estimate.  The
worst case estimate would nevertheless be quite useful so I would not
suggest removing it.

but advise to use variance averaging in the documentation.
Your remark about decreasing spectral errors with increasing mixing times
is correct, I see this in all the data I have. The peak 
intensities/integrals
decrease however much more, the relative error of each thus increases.


This is a bit off topic, but would you know if this has something to
do with the total power of the spectrum?  Maybe Wolfgang Bermel's new
pulse sequences with both the temperature compensation block and
single-scan interleaving will eliminate this strange effect.

Does peak intensity mean peak height?  In relax, the term peak
intensity is inclusive of both volume, height, or any other measure of
the peak.  For relaxation data with proper temperature calibration and
proper temperature compensation, this is now often considered the best
method.  Peter Wright published a while back that volumes are more
accurate, but with the temperature consistency improvements the
problems Peter encountered are all solved and hence heights are better
as overlap problems and other spectral issues are avoided.  Is there a
way of automatically determining the method used in the PDC file,
because in the end the user will have to specify this for BMRB data
deposition?  This is not so important though, the user can manually
specify this if needed.


yes, intensity === height. The integration method used is part of the pdf 
report,
I just see that it is however not included in the export (text, or xls). I 
will change
that of course. The problem with peak heights is if peaks overlap, say one 
peak is
hidden in the shoulder of another. Such peaks usually have to be picked by 
hand
then peak deconvolution would be preferable. We keep on offering our options
but advise in the hand book that peak intensities would be preferred.


We could have relax read the integration method so that this is
automatically placed in the BMRB deposition file for the user.
Actually, maybe it would be an idea to have relax store the entire PDC
file and have this deposited within one of the BMRB method saveframes?

relax allows the user to do whatever they wish (the GUI interface does
not though), but if there is significant overlap, that data should not
be used.  This is all up to the user though.

The covariance analysis is known to be lowest quality of all the error
estimates.  For an exponential fit, I have not checked how
non-quadratic the minimum is, so I don't know the performance of the
method.  Do the errors from that compare to the area integral together
with base-plane RMSD?

I can't remember exactly, but I think Art Palmer published something
different to this and that is why his curvefit program uses MC as well
as Jackknife simulation.  Anyway, if the minimum has a perfectly
quadratic curvature, then the covariance matrix analysis and MC will
give identical results - the methods converge.  But as the minimum
deviates from quadratic, the performance of the covariance matrix
method drops off quickly.  MC simulations are the gold standard for
error propagation, when the direct error equation is too complex to
derive.  But again I don't know how non-quadratic the minimum is in
the exponential optimisation space, so I don't know how well the
covariance matrix performs.  The difference in techniques might be
seen for very weak exchanging peaks with large errors.  Or special
cases such as when the spectra are not properly processed and there
are truncation artifacts.


With my data the errors obtained from LM (taking errorY incto account) and 
MC
are pretty close. If a typical T1 is 0.473 the errors are 0.0026 and 0.0025.
The base plane rms decreases with mixing time the relative errors with the
given examples are 0.000386, ...0.00651, ...0.003124.


I haven't tested this personally, and just followed Art Palmer and
other's lead when I implemented this fitting and used the gold
standard Monte Carlo simulations.  It could be that the covariance
matrix technique performs just as well as MC simulations for all
exponential curve-fitting - the equations are not particularly
complex.  Maybe the user choice of covariance matrix estimate or Monte
Carlo simulations could be recorded in the PDC file so that this
information is also deposited in the BMRB?  I would have to talk to
Eldon Ulrich at the BMRB about adding some new tags for that.

What is this factor and why is it used?  The MC simulation errors
should be exact, assuming the input peak intensity errors are exact
estimates.  MC simulations propagate the errors through the non-linear
space, and hence these errors are perfectly exact.  Is there a
reference for this post-MC sim scaling?  Is it optional?  For relax,
it would be good to know both this scaling factor and that mentioned
above for the peak intensity error scaling, so that both can be
removed prior to analysis.  These will impact model-free analysis in a
very negative way.


In many applications people do not just want a single error but an error 
range
they can rely on with a certain confidence. Also they would like to
account for that different numbers of peaks may be used for different fits
(sometimes not all peaks are picked in all planes or the user removes points
from the fitted curve interactively). This is why implemented the same as
Matlab typically does. I would put an extra column to the output and list
the factor (typically ~ 2) for each fit. This way you can just recalculate
the original errors.


This would be perfect!  Cheers.

Certainly, model-free analysis will be influenced, whether in a negative way
or not I don't know. What I have seen so far (limited experience however) is
that if errors are too tight, the modelling calculates for ever and often
finds solutions with such large errors that practically unusable. I must
admit however that the PDC does currently not allow to use multiple fields.


In model-free analysis, underestimating errors has the strange effect
of introducing artificial motions!  Overestimating causes real motions
to disappear.  Error estimates are incredibly important.

Ah, historic reasons.  We will be able to handle both in relax.  But
if the PDC presents T1 and T2 plots, and these values are in the PDC
output file, the user is highly likely to then use this in
publications.  Some people prefer T1 and T2 plots, but these are an
exception in publications.  This will likely lead to a convention
change in the literature (for only Bruker users) which will make it
harder to compare systems.  It would be good to consider this effect.


The Bruker people seem to regard this as a minor issue for various reasons.
One is that plots for publications will probably be made with more advanced
tools like Excel and it is no effort to switch between T and R. Another
argument is that people who really want to compare details need to compare
numbers anyway. I will check if there is a simple way to allow the user
to switch between the presentations but currently we would keep the T
displays and outputs.


Ok, that makes sense.  relax uses xmgrace to display the relaxation
rates (never the rotational relaxation times though).  With BMRB data
deposition very close to being complete within relax, maybe more data
will be placed in the database and not just shown as a low resolution
T1 or T2 plot.  In many currently published papers, unfortunately
comparing low resolution graphs is the only way.  This is the purpose
of the relax-BMRB collaboration.

This all gives me some work to do and of course some testing. I hope to
do everything this week and then send new export files to you for 
inspection.
If accepted I will proceed to version PDC 1.1.3.


Cheers.  Michael Bieri has created a relax branch for implementing the
reading of the file.  Having newer files attached to
https://gna.org/task/?7180 will help in our implementation.

Thank you,

Edward

Re:

Header

Content

Related Messages