mailRE: Addition of PDC support to relax 1.3.10 and an inconsistency in the error calculation.


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Neidig, Klaus-Peter on February 23, 2011 - 19:33:
Dear Edward,

in short: 

the first 2 fit options (by fit, by weighted fit) both use the covariance 
technique. The difference is that
the first option applies  an additional internal scaling by: final chi/number 
of degrees of freedom 
This means that when doing modelling of (NOE, T1, T2) with (M1-M5, TM1-TM5) 
it cannot be used if these
triples of number are only available at one field and the model contains 3 
parameters. But in
in that area the PDC user does not have a selection of fit methods anyway, we 
internally use a combination 
of simplex and marquardt. But this is another area and should be done by 
relax anyway.

I pick up your statements about the confidence stuff. Indeed you are right 
that everything would be
much simpler if the PDC did just output Rx instead of Tx and corresponding 
error. Whether keep the
confidence stuff at all is then a question, in the mean I think I follow your 
opinion and would rather
give it up. But this I have to discuss internally here. It might be that I'm 
not allowed to just give it up
then I would have to increase the output with more columns or a further 
section which you could read
instead of the current section. 

I also have to say that my developments on the PDC must end by April (ENC) 
and this includes
updating the manul, release procedure and some internal developments. So 
there is not too much
time left. The PDC release number will then go up to 1.2. After April I can 
still maintenance work
but no major changes, at least until someone decides something else.

Best regards,
Peter

Dr. Klaus-Peter Neidig
Software Development / Head of Analysis Group

Bruker BioSpin GmbH
Silberstreifen
76287 Rheinstetten, Germany
Phone: +49 721 5161-6447
Fax:     +49 721 5161-6480

                                                               
Peter.Neidig@xxxxxxxxxxxxxxxxx
                                                               www.bruker.com

Bruker BioSpin GmbH: Sitz der Gesellschaft/Registered Office: Rheinstetten, 
HRB 102368 Amtsgericht Mannheim
Geschäftsführer/Managing Directors: Jörg Laukien, Dr. Bernd Gewiese, Dr. 
Gerhard Roth

Diese E-Mail und alle Anlagen können Betriebs- oder Geschäftsgeheimnisse, 
oder sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail 
irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine 
Vervielfältigung oder Weitergabe der E-Mail und aller Anlagen ausdrücklich 
untersagt. Bitte benachrichtigen Sie den Absender und löschen/vernichten Sie 
die empfangene E-Mail und alle Anlagen. Vielen Dank.

This message and any attachments may contain trade secrets or privileged, 
undisclosed or otherwise confidential information. If you have received this 
e-mail in error, you are hereby notified that any review, copying or 
distribution of it and its attachments is strictly prohibited. Please inform 
the sender immediately and delete/destroy the original message and any 
copies. Thank you.
________________________________________
From: edward.dauvergne@xxxxxxxxx [edward.dauvergne@xxxxxxxxx] On Behalf Of 
Edward d'Auvergne [edward@xxxxxxxxxxxxx]
Sent: Wednesday, February 23, 2011 3:46 PM
To: Neidig, Klaus-Peter
Cc: Neidig, Klaus-Peter; relax-devel@xxxxxxx; Michael Bieri
Subject: Re: Addition of PDC support to relax 1.3.10 and an inconsistency in 
the error calculation.

Dear Peter,

Please see below:


On 23 February 2011 12:33, Dr. Klaus-Peter Neidig
<peter.neidig@xxxxxxxxxxxxxxxxx> wrote:

Dear Edward,

I will provide the other output file examples  as soon as  the format is 
accepted by you. We still
have a few points open.

The relation between dialog window and ouput:
dialog                                            explanation               
                                              appears in output as
error estimation by fit                   all errors are unknown but the 
same for all y      "from fit using arbitray y uncertainties"
error estimation by weighted fit   errors as listed in the output are used  
              "from fit using calculated y uncertainties"
error estimation by Monte Carlo  N fits with synthetically modified y 
values          "from Monte Carlo simulations"

Ah, ok, that explanation helps.  Maybe the explanation could be
expanded a little as both "error estimation by weighted fit" and
"error estimation by Monte Carlo" use the same errors from the "used
intergral error" section.  The first appears to be the covariance
matrix estimate.  Is this correct?  What is the technique used in
"error estimation by fit"?  Is this where the target function is
actually the SSE (sum of squared errors) rather than the chi2, so that
the errors are coming from the residuals of the fit?  I think the
explanation should identify the exact source of the errors, the target
function, and the precise technique used (i.e. covariance matrix,
Monte Carlo, etc.).


errorScale:
Matlab and other software packages  provide the fit results within 
confidence bounds which depend on a confidence
level and number of degrees of freedom. The formula is c  = f +/- t* fErr. 
f would be the fitted parameter value,
fErr is the fit parameter error as coming out of Marquardt or MC, t is 
calculated.

This is too complex :S  Most users will likely be confused by this -
they will not know that the standard deviation (sd) on the relaxation
rates is equal to t*fErr.  They will be reporting fErr rather than the
standard deviation as they should.  The error bars in R1 and R2
figures must be standard deviations
(http://en.wikipedia.org/wiki/Error_bar), or alternatively the
standard error.  The reported relaxation rates also need to be the
mean (from the fit) and sd.  All subsequent analyses require the sd
and not fErr.

There is a much simpler alternative.  It is simple error propagation
(http://en.wikipedia.org/wiki/Propagation_of_uncertainty).  This
concept is much simpler and there is no confidence levels or degrees
of freedom used.  The aim is to transfer the errors (sd) in the peak
heights or volumes to the fitted R1 or R2 errors (sd).  So using Art
Palmer's notation from his 1991 paper, we are converting the peak
height sd sigma_h to sigma_R1 or sigma_R2 via Monte Carlo simulations.
We are going from the sd in the base data to the sd in the parameters.

All current NMR software that I know of follows this simple error
propagation principle.  The standard error propagation technique, if a
direct formula cannot be derived, is Monte Carlo simulations.  Other
lower quality techniques are sometimes used such as Jackknife
simulations and sometimes the very rough covariance matrix estimate.

The complexity used by the PDC is not necessary and most users will
not understand the errors they are presented with.  If they are not
using relax to read the PDC file and convert the times to rates and
the fErr to a sd, this will be very bad for model-free analysis,
reduced spectral density mapping, SRLS, etc. (ok, the word
catastrophic would be more apt).  Excluding this confusion, most users
will even have a problem converting from the sd of the times Tx to the
sd of the rates Rx, as required for all subsequent analyses!  relax,
Modelfree4, Dasha, and Tensor2 (the most commonly used programs) all
require rates and the sd of the rates - they do not accept times.

Would it be possible to extend the PDC format in the results section
to include not only all of the current columns but to also add the Tx
standard deviation (t*fErr), the Rx value, and the Rx sd?  If the user
is presented with the fErr value and the standard deviation, they will
instantly recognise which is the correct to use.  Also presenting the
rates and their errors:

R1 = 1/T1
sigma_R1 = sigma_T1 / T1^2

Then the user will not make mistakes in the absolutely required
conversion process.  So for example:

SECTION:         results
Peak name          F1 [ppm]        F2 [ppm]        T1 [s]          fErr    
errorScale
T1 sd [s]          R1 [rad.s-1]    R1 sd [rad.s-1]
Gln [2]     122.508           8.898        0.4560        0.0068944         
2.2281
0.0030942        2.192982          0.0148805


The pdf output of the PDC shows f and (t*fErr) but you only want f and 
fErr. Therefore I have added an extra column
for you, called it errScale and put t into it. The column contains always 
the same number since at the moment
all peaks are fitted to the same model function and there is one user 
defined confidence level.

Most users won't know what to do with the fErr, or even how it is
useful, so I would recommend presenting both fErr and the basic sd.


This whole thing has nothing to do with NC_Proc. NC_proc for the scaling of 
all datapoints of spectra. This is because
Bruker originally did the FTT in integer arithmetic. To use as many bits as 
possible in the integers the data are scaled up
by 2**NC_proc. The parameter NC_proc is stored and any software can do the 
reverse calculation to get different spectra
comparable.

Ok, I was not sure how t and fErr were derived.  I'm quite familiar
with nc_proc and the ancient convention of the integer storage of the
ser file.  The scaling concept is much clearer with the formula c  = f
+/- t* fErr.  I can see exactly what it is now.  fErr is the half
measure of 95% of the spread of the distribution, assuming a 95%
confidence limit.  And t is the parameter converting this back to the
standard deviation.


Monte Carlo:
We had a first initial internal meeting and there were some shaking heads 
here. I have furthermore searched
the internet for other software packages and articles and what we offer as 
Monte Carlo is indeed what many
others do as well offer as Monte Carlo. We agreed to go into some more 
details within  the next few days, one
collegue will do some Matlab simulations  and we will have another meeting 
then. So for the moment I would
not change anything in the PDC related to this topic.

One thing to be careful of is the direct calculation.  The R1 and R2
use optimisation.  But the NOE uses a direct calculation.  Hence the
back-calculated value for the NOE is the original value.  So if one
uses Monte Carlo simulations for the NOE, this collapses to the
bootstrap technique.

Also please don't be tricked by similarity of numbers.  Comparing the
numbers from Bootstrapping verses those of Monte Carlo simulations is
of no use, as one doesn't know the real errors.  This is not how one
should judge Bootstrapping verses Monte Carlo simulations.  There is
one case where Bootstrapping can be used.  This is when the original
errors in peak heights or volumes are not known.  Then it can be used.
 However it is still very bad and techniques such as Jackknife
simulations have been developed to improve this situation and remove
the Bootstrap bias.

This is a complex topic of information theory and numerical analysis
books will usually just state that Monte Carlo simulations is the gold
standard and that Bootstrapping is not the way to go.  A relatively
simple explanation is given in short PDF file at
http://www.terry.uga.edu/~satkinso/8120/8120_montebs.pdf.  Here they
mention how the Jackknife technique is for removing the bias of the
Bootstrapping technique.  The also mention that bootstrapping is for
when there are no known original errors and that you use the residuals
of the fit, though this is one of the other Bootstrapping techniques
(this PDF talks about 3 of the many types).


People here accept that Marquardt's covariance method might cause problems 
depending on function and
actual data but for the relaxation functions it worked well and gave very 
similar results to what you obtained.
So it can't be that bad. It might be a more serious problem for other 
functions of course.

I would consider it ok.  Note that the covariance matrix error
estimate and Levenberg-Marquardt are completely distinct concepts
derived 100% independently and coming from different areas of
numerical analysis.  The covariance estimate is not ideal, but the
exponential functions are not very complex and there is not much
convolution between the fitted parameters.  This can be seen in the
matrix, the off-diagonal elements of the covariance matrix should be
much lower than the diagonal, i.e. there should be not much
covariance.  If this is the case, then the covariance estimate should
be ok.


So, tell me if you want to other output files at current status or are 
there small things that should be changed
first.

It might be best to attach the files to https://gna.org/task/?7180
after things are sorted out a little.

Cheers,

Edward



Related Messages


Powered by MHonArc, Updated Thu Feb 24 10:40:19 2011