mailRe: Addition of PDC support to relax 1.3.10 and an inconsistency in the error calculation.


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on February 23, 2011 - 15:47:
Dear Peter,

Please see below:


On 23 February 2011 12:33, Dr. Klaus-Peter Neidig
<peter.neidig@xxxxxxxxxxxxxxxxx> wrote:

Dear Edward,

I will provide the other output file examples  as soon as  the format is 
accepted by you. We still
have a few points open.

The relation between dialog window and ouput:
dialog                                            explanation               
                                              appears in output as
error estimation by fit                   all errors are unknown but the 
same for all y      "from fit using arbitray y uncertainties"
error estimation by weighted fit   errors as listed in the output are used  
              "from fit using calculated y uncertainties"
error estimation by Monte Carlo  N fits with synthetically modified y 
values          "from Monte Carlo simulations"

Ah, ok, that explanation helps.  Maybe the explanation could be
expanded a little as both "error estimation by weighted fit" and
"error estimation by Monte Carlo" use the same errors from the "used
intergral error" section.  The first appears to be the covariance
matrix estimate.  Is this correct?  What is the technique used in
"error estimation by fit"?  Is this where the target function is
actually the SSE (sum of squared errors) rather than the chi2, so that
the errors are coming from the residuals of the fit?  I think the
explanation should identify the exact source of the errors, the target
function, and the precise technique used (i.e. covariance matrix,
Monte Carlo, etc.).


errorScale:
Matlab and other software packages  provide the fit results within 
confidence bounds which depend on a confidence
level and number of degrees of freedom. The formula is c  = f +/- t* fErr. 
f would be the fitted parameter value,
fErr is the fit parameter error as coming out of Marquardt or MC, t is 
calculated.

This is too complex :S  Most users will likely be confused by this -
they will not know that the standard deviation (sd) on the relaxation
rates is equal to t*fErr.  They will be reporting fErr rather than the
standard deviation as they should.  The error bars in R1 and R2
figures must be standard deviations
(http://en.wikipedia.org/wiki/Error_bar), or alternatively the
standard error.  The reported relaxation rates also need to be the
mean (from the fit) and sd.  All subsequent analyses require the sd
and not fErr.

There is a much simpler alternative.  It is simple error propagation
(http://en.wikipedia.org/wiki/Propagation_of_uncertainty).  This
concept is much simpler and there is no confidence levels or degrees
of freedom used.  The aim is to transfer the errors (sd) in the peak
heights or volumes to the fitted R1 or R2 errors (sd).  So using Art
Palmer's notation from his 1991 paper, we are converting the peak
height sd sigma_h to sigma_R1 or sigma_R2 via Monte Carlo simulations.
We are going from the sd in the base data to the sd in the parameters.

All current NMR software that I know of follows this simple error
propagation principle.  The standard error propagation technique, if a
direct formula cannot be derived, is Monte Carlo simulations.  Other
lower quality techniques are sometimes used such as Jackknife
simulations and sometimes the very rough covariance matrix estimate.

The complexity used by the PDC is not necessary and most users will
not understand the errors they are presented with.  If they are not
using relax to read the PDC file and convert the times to rates and
the fErr to a sd, this will be very bad for model-free analysis,
reduced spectral density mapping, SRLS, etc. (ok, the word
catastrophic would be more apt).  Excluding this confusion, most users
will even have a problem converting from the sd of the times Tx to the
sd of the rates Rx, as required for all subsequent analyses!  relax,
Modelfree4, Dasha, and Tensor2 (the most commonly used programs) all
require rates and the sd of the rates - they do not accept times.

Would it be possible to extend the PDC format in the results section
to include not only all of the current columns but to also add the Tx
standard deviation (t*fErr), the Rx value, and the Rx sd?  If the user
is presented with the fErr value and the standard deviation, they will
instantly recognise which is the correct to use.  Also presenting the
rates and their errors:

R1 = 1/T1
sigma_R1 = sigma_T1 / T1^2

Then the user will not make mistakes in the absolutely required
conversion process.  So for example:

SECTION:         results
Peak name          F1 [ppm]        F2 [ppm]        T1 [s]          fErr    
errorScale   
T1 sd [s]          R1 [rad.s-1]    R1 sd [rad.s-1]
Gln [2]     122.508           8.898        0.4560        0.0068944         
2.2281       
0.0030942        2.192982          0.0148805


The pdf output of the PDC shows f and (t*fErr) but you only want f and 
fErr. Therefore I have added an extra column
for you, called it errScale and put t into it. The column contains always 
the same number since at the moment
all peaks are fitted to the same model function and there is one user 
defined confidence level.

Most users won't know what to do with the fErr, or even how it is
useful, so I would recommend presenting both fErr and the basic sd.


This whole thing has nothing to do with NC_Proc. NC_proc for the scaling of 
all datapoints of spectra. This is because
Bruker originally did the FTT in integer arithmetic. To use as many bits as 
possible in the integers the data are scaled up
by 2**NC_proc. The parameter NC_proc is stored and any software can do the 
reverse calculation to get different spectra
comparable.

Ok, I was not sure how t and fErr were derived.  I'm quite familiar
with nc_proc and the ancient convention of the integer storage of the
ser file.  The scaling concept is much clearer with the formula c  = f
+/- t* fErr.  I can see exactly what it is now.  fErr is the half
measure of 95% of the spread of the distribution, assuming a 95%
confidence limit.  And t is the parameter converting this back to the
standard deviation.


Monte Carlo:
We had a first initial internal meeting and there were some shaking heads 
here. I have furthermore searched
the internet for other software packages and articles and what we offer as 
Monte Carlo is indeed what many
others do as well offer as Monte Carlo. We agreed to go into some more 
details within  the next few days, one
collegue will do some Matlab simulations  and we will have another meeting 
then. So for the moment I would
not change anything in the PDC related to this topic.

One thing to be careful of is the direct calculation.  The R1 and R2
use optimisation.  But the NOE uses a direct calculation.  Hence the
back-calculated value for the NOE is the original value.  So if one
uses Monte Carlo simulations for the NOE, this collapses to the
bootstrap technique.

Also please don't be tricked by similarity of numbers.  Comparing the
numbers from Bootstrapping verses those of Monte Carlo simulations is
of no use, as one doesn't know the real errors.  This is not how one
should judge Bootstrapping verses Monte Carlo simulations.  There is
one case where Bootstrapping can be used.  This is when the original
errors in peak heights or volumes are not known.  Then it can be used.
 However it is still very bad and techniques such as Jackknife
simulations have been developed to improve this situation and remove
the Bootstrap bias.

This is a complex topic of information theory and numerical analysis
books will usually just state that Monte Carlo simulations is the gold
standard and that Bootstrapping is not the way to go.  A relatively
simple explanation is given in short PDF file at
http://www.terry.uga.edu/~satkinso/8120/8120_montebs.pdf.  Here they
mention how the Jackknife technique is for removing the bias of the
Bootstrapping technique.  The also mention that bootstrapping is for
when there are no known original errors and that you use the residuals
of the fit, though this is one of the other Bootstrapping techniques
(this PDF talks about 3 of the many types).


People here accept that Marquardt's covariance method might cause problems 
depending on function and
actual data but for the relaxation functions it worked well and gave very 
similar results to what you obtained.
So it can't be that bad. It might be a more serious problem for other 
functions of course.

I would consider it ok.  Note that the covariance matrix error
estimate and Levenberg-Marquardt are completely distinct concepts
derived 100% independently and coming from different areas of
numerical analysis.  The covariance estimate is not ideal, but the
exponential functions are not very complex and there is not much
convolution between the fitted parameters.  This can be seen in the
matrix, the off-diagonal elements of the covariance matrix should be
much lower than the diagonal, i.e. there should be not much
covariance.  If this is the case, then the covariance estimate should
be ok.


So, tell me if you want to other output files at current status or are 
there small things that should be changed
first.

It might be best to attach the files to https://gna.org/task/?7180
after things are sorted out a little.

Cheers,

Edward



Related Messages


Powered by MHonArc, Updated Wed Feb 23 19:40:13 2011