Re: Addition of PDC support to relax 1.3.10 and an inconsistency in the error calculation. -- February 21, 2011

Dear Peter,

Please see below:


On 21 February 2011 09:51, Dr. Klaus-Peter Neidig
<peter.neidig@xxxxxxxxxxxxxxxxx> wrote:


Dear Edward,

I'm running a current of output files today and send them.

The error of a ratio of numbers y = a/b with given aErr, bErr is
yErr = sqrt [ (aErr/a)^2 + (bErr/b)^2], which for a=1 and aErr=0 is
yErr = bErr/b

Does this correspond to yout notation ?


This is not quite right, for example see the NOE error formula on page
5988 of the Farrow et al, Biochem. (1994), 33, 5984-6003 reference.
The formula should be:

yErr/y = sqrt [ (aErr/a)^2 + (bErr/b)^2].

For a=1 and aErr=0, the formula will then be:

yErr/y = bErr/b.

In the case of the R1:

sigma_R1 / R1 = sigma_T1 / T1,

or:

sigma_R1 = sigma_T1 / T1^2,
sigma_R1 = sigma_T1 * R1^2.

There is a chance that this formula from the Kay paper is incorrect.
I've been trying to find another reference, one that I found a long
time ago, but I can't find it any more.  Do you have a reference for
the yErr = sqrt [ (aErr/a)^2 + (bErr/b)^2] formula?  This does not
explain the differences though, they are even more pronounced using
yErr = sqrt [ (aErr/a)^2 + (bErr/b)^2].

The fit error determination in the PDC depends on the method the user has 
chosen. Under the
tag Fit parameter Error estimation method: the options are
- from fit using arbitray y uncertainties
- from fit using calculated y uncertainties
- from Monte Carlo simulations


Which of these does the "SECTION:  used integral errors" belong to?
This section appears to be errors associated with each peak intensity
from "SECTION:  used integrals".  I simply took these to be individual
peak errors and then performed standard Monte Carlo simulations on
these.  Each peak appears to have a different error, but does "Fit
parameter Error estimation method:  from fit using arbitray y
uncertainties" means one error for all peaks?

The first 2 are offered by the our standard Marquardt implementation.


The Levenberg-Marquardt optimisation and Monte Carlo simulations are
two unrelated techniques, one is for optimisation, the other for error
propagation.  A very low quality error estimation method comes from
the covariance matrix.  So one can use Levenberg-Marquardt (or any
other optimisation algorithm) and then determine the errors using the
covariance matrix defined at the solution.  I don't understand the
differences in the error estimation methods in the PDC :S

The first option assumes the same
error for all y values. This then cancales out of the formula, therefore we 
call it arbitrary.


But this could be used for either the Monte Carlo simulations or the
covariance matrix error estimation methods.  In relax, this is used in
combination with MC simulations.

The second option uses the y errors as determined before. As we discussed 
earlier the base plane RMS
is one source for these errors the other is from optional repetition 
experiments. If repetition experiments
are available the systematic error can be calculated as:
- worst case per peak scenario
- average variance calculation

As I showed using theoretical examples in
https://mail.gna.org/public/relax-devel/2010-11/msg00027.html (data,
python and java scripts are attached to https://gna.org/task/?7180),
the 2 worst case per peak methods on average significantly
overestimate the errors, and have huge variances meaning that the real
errors compared to the estimates are multiplied by a normal
distribution of 1.594+/-0.852. So 95% of the estimates from
duplicated spectra (assuming a normal distribution) will be scaled by
a factor between -0.11 and 3.298! This type of error estimate would
be fatal for subsequent model-free analysis.

Therefore in relax we will need to catch "Systematic error estimation
of data: worst case per peak scenario" and re-perform the relaxation
curve-fitting and Monte Carlo simulations. Or maybe throw an in relax
error and tell the user that they must go back to the PDC and use
"average variance calculation" with "Monte Carlo simlautions". In the
example file you sent
(http://svn.gna.org/viewcvs/relax/1.3/test_suite/shared_data/pdc_files/testT1.txt?revision=12551&view=markup),
it says "Fit parameter Error estimation method: from fit using
arbitray y uncertainties". Above you said that this corresponds to 1
error for all peaks. But you described in
https://mail.gna.org/public/relax-devel/2010-11/msg00021.html that the
current "worst case per peak scenario" method in the PDC uses a
different estimate for each peak. The file also says "Random error
estimation of data: RMS per spectrum (or trace/plane)". So I am quite
confused as to how to interpret the data. Where do I read the peak
intensity errors from?

The test output files I have sent so far calculated the fit parameter 
errors via from fit using arbitray y uncertainties.
This option was chosen accidentally. The error there is different from the 
other 2 methods. Perhaps it is
a better idea to generate a new set of output files using the option "from 
fit using calculated y uncertainties".
That usually compares well with MC simulations.


What is the "from fit using calculated y uncertainties"?  Is this the
covariance matrix technique for error estimation?  For simple single
exponentials, the covariance matrix technique and Monte Carlo
simulations should not be too different, I hope.  How does this
combine with "worst case per peak scenario" and "average variance
calculation"?

In the PDC I do 1000 MC simulations such that a vary the
y values based on a gaussian random distribution, the mean is the original 
y, the width is the error of y.


You mean you take the back-calculated y value and randomise this 1000
times with a Gaussian distribution centred at the back-calculated y
with an error set to the peak intensity error?  If you take the
original y, this is not Monte Carlo simulations but Bootstrapping - a
technique which should never be used for error estimation.  Well, you
can use bootstrapping for a direct calculation of values, but never
for optimised values.

I am quite confused.  The way I see the analysis, using the notations
from the statistics and optimisation fields, there are the following
categories:

Peak intensity notation:
  - Height (the physical height of the Lorentizian peak at its mean,
or maximum point)
  - Volume (sum of all points, i.e. the discrete integration of the 
Lorentizian)
  - Height (as above but via fitting peak shape to a function)
  - Volume (as above but via fitting peak shape to a function)

Peak errors:
  - 1 average error for all peaks (from some spectra duplicated,
triplicated, etc.)
  - 1 max error for all peaks (largest ever occurring difference)
  - 1 average error for all peaks in one spectrum (only if all spectra
are duplicated)
  - 1 max error for each peak across all spectra (current "worst case
per peak scenario"?).
  - Each peak has a different error (from the baseplane RMSD)

Optimisation methods:
  - Levenberg-Marquardt
  - Simplex
  - Newton
  - etc.

Error estimation methods:
  - Covariance matrix (bad)
  - Jackknife (also bad)
  - Bootstrapping (never to be used with optimisation)
  - Monte Carlo simulations (considered the gold standard)
  - etc.

I am having trouble classifying the PDC file
http://svn.gna.org/viewcvs/relax/1.3/test_suite/shared_data/pdc_files/testT1.txt?revision=12551&view=markup
into these categories.  This is very, very confusing :S

Regards,

Edward

Re: Addition of PDC support to relax 1.3.10 and an inconsistency in the error calculation.

Header

Content

Related Messages