Re: Curve fitting -- October 20, 2008

On Mon, Oct 20, 2008 at 1:20 AM, Chris MacRaild <macraild@xxxxxxxxxxx> wrote:

On Sat, Oct 18, 2008 at 8:20 AM, Edward d'Auvergne
<edward.dauvergne@xxxxxxxxx> wrote:

Hi,

Before you sent this message, I was talking to Ben Frank (a PhD
student in Griesinger's lab) about this exact problem - baseplane RMSD
noise to volume error.  The formula of Nicholson et al., 1992 you
mentioned makes perfect sense as that's what we came up with too.
Volume integration over a given area is the sum of the heights of all
the discrete points in the frequency domain spectrum within that box.
So the error of a single point is the same as that of the peak height.
 We just have n*m points within this box.  And as variances add, not
standard deviations, then the variance (sigma^2) of the volume is:

sigma_vol^2 = sigma_i^2 * n * m,

where sigma_vol is the standard deviation of the volume, sigma_i is
the standard deviation of a single point assumed to be equal to the
RMSD of the baseplane noise, and n and m are the dimensions of the
box.  Taking the square root of this gives the Nicholson et al.
formula:

sigma_vol = sigma_i * sqrt(n*m).


This is the strategy I have used to try and get precision estimates
from peak volumes. As I said earlier, in my hands it does not perform
well. Uncertainties from this method will systematically over-estimate
the precision of strong peaks and underestimate the precision of weak
ones as compared to estimates from duplicate spectra (or perhaps its
the other way around, I don't remember). This may not be evident for
proteins like ubiquitin, where virtually all amides give uniformly
strong peaks in the HSQC, but for proteins with more varied relaxation
behaviour, this can be a major issue. Its important to keep in mind
just how much signal processing goes on between a raw fid (in which
the noise in adjacent points is independent and uncorrelated) and the
spectrum that we integrate (in which, apparently, noise in adjacent
points is not always independent and uncorrelated).

Even apart from this issue, I have always found peak height to give
better results for fitting relaxation data. Heights would be expected
to be less sensitive to all sorts of experimental complications like
imperfect baselines, peak overlap, phase errors, etc. In my hands this
always seems to outweigh the greater precision afforded by peak
volumes.


I'm about to implement a much better system for handling spectra and
peak intensities in relax, by creating a new 'spectrum' user function
class.  I hope to implement as many different ways of handling
intensities and errors.  I might summarise all later, but these
include:

Intensity type;  Noise source;  Error scope

height;  RMSD baseplane;  sigma per peak per spectrum.
height;  partial duplicate + variance averaging;  one sigma for all
peaks, all spectra.
height;  all replicated + variance averaging;  one sigma per time point.
volume;  partial duplicate + variance averaging;  one sigma for all
peaks, all spectra.
volume;  all replicated + variance averaging;  one sigma per time point.

Note that there is no volume + RMSD of baseplane yet because I don't
know how to handle this.  Maybe I could let the user specify how many
points were used in the integration (and force them to state the
integration method to internally check for compatibility - i.e.
disallow Sparky Gaussian integration!).  As you said, the errors are
correlated in the frequency domain - I think this is due to the
smoothing of the window function and the wavelet like interpolation of
zero filling - but the RMSD measure takes that into account.  So for
volume integration methods using point summing, then we can use the
equation:

sigma_vol = sigma_i * sqrt(N),

where sigma_vol is the standard deviation of the volume, sigma_i is
the standard deviation of a single point assumed to be equal to the
RMSD of the baseplane noise, and N is the total number of points used
in the summation integration method.  Does anyone know any other
methods that could be used here?  Because of your description Chris,
do you think we should we have relax throw a RelaxWarning stating that
this error estimation method is not very accurate?

Edward d'Auvergne wrote:

Oh, I forgot about the std error formula. Is where the sqrt(2) comes
from? Doh, that would be retarded. Then I know someone who would
require sqrt(3) for the NOE spectra! Is that really what Palmer
meant, that std error is the same as "the standard deviation of the
differences between the heights of corresponding peaks in the paired
spectra" which "is equal to sqrt(2)*sigma" (Palmer et al., 1991)?

I'm pretty sure though that the standard error is not the measure we
want for the confidence interval of the peak intensity. The reason is
because I think that the std error is a measure of how far the sample
mean is from the true mean (ignore this, this is a quick reference for
myself: http://en.wikipedia.org/wiki/Standard_error_(statistics) ).
(Warning, from here to the end of the paragraph is a rant!) This is
similar in concept to AIC model selection (see
http://en.wikipedia.org/wiki/Akaike_information_criterion and
http://en.wikipedia.org/wiki/Model_selection if you haven't heard
about the advanced statistical field of model selection before). AIC
is a little more advanced though as it estimates the Kullback-Leibler
discrepancy (http://en.wikipedia.org/wiki/Kullback–Leibler_divergence)
which is a measure of distance between the true distribution and the
back-calculated distribution using all information about the
distribution. Ok, that wasn't too relevant. Anyway, the std error as
a measure of the differences in means of 2 different distributions is
not a measure of the spread of either the true, measured, or
back-calculated distributions (or the 4th distribution, the
back-calculated from the fit to the true model). The std error is not
the confidence intervals of any of these 4 distributions, just the
difference between 2 of them using only a small part of the
information of those distributions. It's the statistical measure of
the difference in means of the true and measured distributions. As an
aside, for those completely lost now a clearer explanation of these 4
distributions fundamental to data analysis, likelihood, discrepancies,
etc. can be read in section 2.2 of my PhD thesis at
http://dtl.unimelb.edu.au:80/R/-?func=dbin-jump-full&amp;object_id=67077&amp;current_base=GEN01
or
http://www.amazon.com/Protein-Dynamics-Model-free-Analysis-Relaxation/dp/3639057627/ref=sr_1_6?ie=UTF8&s=books&qid=1219247007&sr=8-6
(sorry for the blatant plug ;). Oh, the best way to picture all of
these concepts and the links between them is to draw and label 4
distributions on a piece of paper on the same x and y-axes with not
too much overlap between them and connect them with arrows labelled
with all the weird terminology.

Sorry, that was just a long way of saying that the std error is the
quality of how the sample mean matches the real mean and how the
standard deviation is the spread of the distribution. That being so,
I would avoid setting the standard error as the peak height
uncertainty. Maybe it would be best to do as you say Chris, and also
avoid the averaging of the replicated intensities.


I didn't mean to suggest that std error should be taken as the peak
height uncertainty. Rather, it should be taken as the uncertainty for
any value which is the mean of peak heights (eg. the mean value of a
duplicate measurement). So, for a peak height measured from a single
spectrum, the uncertainty is best estimated as the std dev of
duplicates, but failing that from the RMS noise. For a mean value
derived from duplicates the std error is the appropriate estimate of
precision.


Would you recommend, therefore, that we should not average peak
heights and use the normal peak height standard deviation from the
replicated spectra?  This would weight the fitting towards the
replicated points, but maybe that would be more accurate for the error
analysis.  Does anyone have opinions as to the best method for
fitting+error propagation using replicated spectra?

Regards,

Edward

Re: Curve fitting

Header

Content

Related Messages