Dear Peter, Please see below: On 23 February 2011 12:33, Dr. Klaus-Peter Neidig <peter.neidig@xxxxxxxxxxxxxxxxx> wrote:
Dear Edward, I will provide the other output file examples as soon as the format is accepted by you. We still have a few points open. The relation between dialog window and ouput: dialog explanation appears in output as error estimation by fit all errors are unknown but the same for all y "from fit using arbitray y uncertainties" error estimation by weighted fit errors as listed in the output are used "from fit using calculated y uncertainties" error estimation by Monte Carlo N fits with synthetically modified y values "from Monte Carlo simulations"
Ah, ok, that explanation helps. Maybe the explanation could be expanded a little as both "error estimation by weighted fit" and "error estimation by Monte Carlo" use the same errors from the "used intergral error" section. The first appears to be the covariance matrix estimate. Is this correct? What is the technique used in "error estimation by fit"? Is this where the target function is actually the SSE (sum of squared errors) rather than the chi2, so that the errors are coming from the residuals of the fit? I think the explanation should identify the exact source of the errors, the target function, and the precise technique used (i.e. covariance matrix, Monte Carlo, etc.).
errorScale: Matlab and other software packages provide the fit results within confidence bounds which depend on a confidence level and number of degrees of freedom. The formula is c = f +/- t* fErr. f would be the fitted parameter value, fErr is the fit parameter error as coming out of Marquardt or MC, t is calculated.
This is too complex :S Most users will likely be confused by this - they will not know that the standard deviation (sd) on the relaxation rates is equal to t*fErr. They will be reporting fErr rather than the standard deviation as they should. The error bars in R1 and R2 figures must be standard deviations (http://en.wikipedia.org/wiki/Error_bar), or alternatively the standard error. The reported relaxation rates also need to be the mean (from the fit) and sd. All subsequent analyses require the sd and not fErr. There is a much simpler alternative. It is simple error propagation (http://en.wikipedia.org/wiki/Propagation_of_uncertainty). This concept is much simpler and there is no confidence levels or degrees of freedom used. The aim is to transfer the errors (sd) in the peak heights or volumes to the fitted R1 or R2 errors (sd). So using Art Palmer's notation from his 1991 paper, we are converting the peak height sd sigma_h to sigma_R1 or sigma_R2 via Monte Carlo simulations. We are going from the sd in the base data to the sd in the parameters. All current NMR software that I know of follows this simple error propagation principle. The standard error propagation technique, if a direct formula cannot be derived, is Monte Carlo simulations. Other lower quality techniques are sometimes used such as Jackknife simulations and sometimes the very rough covariance matrix estimate. The complexity used by the PDC is not necessary and most users will not understand the errors they are presented with. If they are not using relax to read the PDC file and convert the times to rates and the fErr to a sd, this will be very bad for model-free analysis, reduced spectral density mapping, SRLS, etc. (ok, the word catastrophic would be more apt). Excluding this confusion, most users will even have a problem converting from the sd of the times Tx to the sd of the rates Rx, as required for all subsequent analyses! relax, Modelfree4, Dasha, and Tensor2 (the most commonly used programs) all require rates and the sd of the rates - they do not accept times. Would it be possible to extend the PDC format in the results section to include not only all of the current columns but to also add the Tx standard deviation (t*fErr), the Rx value, and the Rx sd? If the user is presented with the fErr value and the standard deviation, they will instantly recognise which is the correct to use. Also presenting the rates and their errors: R1 = 1/T1 sigma_R1 = sigma_T1 / T1^2 Then the user will not make mistakes in the absolutely required conversion process. So for example: SECTION: results Peak name F1 [ppm] F2 [ppm] T1 [s] fErr errorScale T1 sd [s] R1 [rad.s-1] R1 sd [rad.s-1] Gln [2] 122.508 8.898 0.4560 0.0068944 2.2281 0.0030942 2.192982 0.0148805
The pdf output of the PDC shows f and (t*fErr) but you only want f and fErr. Therefore I have added an extra column for you, called it errScale and put t into it. The column contains always the same number since at the moment all peaks are fitted to the same model function and there is one user defined confidence level.
Most users won't know what to do with the fErr, or even how it is useful, so I would recommend presenting both fErr and the basic sd.
This whole thing has nothing to do with NC_Proc. NC_proc for the scaling of all datapoints of spectra. This is because Bruker originally did the FTT in integer arithmetic. To use as many bits as possible in the integers the data are scaled up by 2**NC_proc. The parameter NC_proc is stored and any software can do the reverse calculation to get different spectra comparable.
Ok, I was not sure how t and fErr were derived. I'm quite familiar with nc_proc and the ancient convention of the integer storage of the ser file. The scaling concept is much clearer with the formula c = f +/- t* fErr. I can see exactly what it is now. fErr is the half measure of 95% of the spread of the distribution, assuming a 95% confidence limit. And t is the parameter converting this back to the standard deviation.
Monte Carlo: We had a first initial internal meeting and there were some shaking heads here. I have furthermore searched the internet for other software packages and articles and what we offer as Monte Carlo is indeed what many others do as well offer as Monte Carlo. We agreed to go into some more details within the next few days, one collegue will do some Matlab simulations and we will have another meeting then. So for the moment I would not change anything in the PDC related to this topic.
One thing to be careful of is the direct calculation. The R1 and R2 use optimisation. But the NOE uses a direct calculation. Hence the back-calculated value for the NOE is the original value. So if one uses Monte Carlo simulations for the NOE, this collapses to the bootstrap technique. Also please don't be tricked by similarity of numbers. Comparing the numbers from Bootstrapping verses those of Monte Carlo simulations is of no use, as one doesn't know the real errors. This is not how one should judge Bootstrapping verses Monte Carlo simulations. There is one case where Bootstrapping can be used. This is when the original errors in peak heights or volumes are not known. Then it can be used. However it is still very bad and techniques such as Jackknife simulations have been developed to improve this situation and remove the Bootstrap bias. This is a complex topic of information theory and numerical analysis books will usually just state that Monte Carlo simulations is the gold standard and that Bootstrapping is not the way to go. A relatively simple explanation is given in short PDF file at http://www.terry.uga.edu/~satkinso/8120/8120_montebs.pdf. Here they mention how the Jackknife technique is for removing the bias of the Bootstrapping technique. The also mention that bootstrapping is for when there are no known original errors and that you use the residuals of the fit, though this is one of the other Bootstrapping techniques (this PDF talks about 3 of the many types).
People here accept that Marquardt's covariance method might cause problems depending on function and actual data but for the relaxation functions it worked well and gave very similar results to what you obtained. So it can't be that bad. It might be a more serious problem for other functions of course.
I would consider it ok. Note that the covariance matrix error estimate and Levenberg-Marquardt are completely distinct concepts derived 100% independently and coming from different areas of numerical analysis. The covariance estimate is not ideal, but the exponential functions are not very complex and there is not much convolution between the fitted parameters. This can be seen in the matrix, the off-diagonal elements of the covariance matrix should be much lower than the diagonal, i.e. there should be not much covariance. If this is the case, then the covariance estimate should be ok.
So, tell me if you want to other output files at current status or are there small things that should be changed first.
It might be best to attach the files to https://gna.org/task/?7180 after things are sorted out a little. Cheers, Edward