Hi Edward, after reading the mails again and checking the code again I think it would be good to do another cross-check, perhaps best with an artificial example: suppose we have mixing times 10, 10, 10 (3 replicates) 20 30, 30 (2 replicates) 40 50 and spectra with 3 peaks. A: I would take peak1, then check for the largest difference in intensity in the first 3 replicates and in the 2 replicates. The largest found difference is then assigned as a systematic error to peak1 at all mixing times. The procedure is repeated for peak2 and peak3. Most likely peak1, peak2 and peak3 then have different systematic errors. That would be like a worst case scenario per peak and is what currently was active in the PDC. B: The absolute worst scenario would be to check for the largest ever ocuring difference and assign that to all 3 peaks at all mixing times. This we used in earlier versions of the PDC. C: Your variance averaging would work as follows: loop over the first 3 replicates, calculate variance for each peak loop over the 2 replicates, calculate variance for each peak finally take the average over 6 variances and assign this as a systematic error to all peaks at all mixing times Is this correct ? I think A: and C: should be offered. After implementing this I see my data that the C: produces much larger errors than A and the errors of the fitted parameters get much larger, sometimes larger than the values itself. This seems not acceptable. In the mean I also spoke to Wolfgang Bermel. My statement that peaks with longer T2 seem to very more in replicates is not obvious to him. When looking to the data it turned out that I had looked at peaks that do not only have longer T2 but also have larger resonance offsets. Wolfgang suspects that their might be an offset effect. The statement that spectra taken at longer mixing times show a lower base plane noise seems to be not true in general. The trend I have seen in one of the spectra is not fully convincing so we checked a couple of other spectra all taken with the latest release pulse programs. With those data we would not really confirm the trend. We concluded to have an eye on that from now on. Best regards, Peter On 11/24/2010 3:11 PM, Edward d'Auvergne wrote: On 24 November 2010 10:31, Dr. Klaus-Peter Neidig <peter.neidig@xxxxxxxxxxxxxxxxx> wrote:Dear Edward, I talked to some collegues at Bruker to discuss your various items. In general there is agreement that being as much consistent with relax as possible is a good idea. However other software at Bruker does not have to be affected (some of the code is used in multiple areas) and changes that cause too much effort and are already documented and in use should be avoided. Enhancing the export functionality to provide more infos to relax is regarded a good way to go.Hi, Yes, it would be a bad idea to make big changes. Interfacing with relax will be quite straightforward and probably no changes will be required from Bruker's side.In detail:Do the users define the signal-free region? For example there is usually more noise in the random coil part of the spectrum due to unwanted junk in the sample, so the peaks in this region have larger errors. What about signals close to the water signal? Is if possible to have different errors for these peaks?No, the PDC does currently not allow to assign different errors to different peaks. We take several regions in the spectrum automatically, e.g. close to the 4 corners and finally regard the one with the lowest noise. We have to rely on some assumptions like proper base line correction, reasonably clean sample, careful post processing, if any.Ah, ok. We will have to make sure users manually exclude spins with signals near the water region. Again this is an easy thing to solve.I don't understand the origin of this scaling. Since Art Palmer's seminal 1991 publication (http://dx.doi.org/10.1021/ja00012a001), the RMSD of the pure base-plane noise has been used for the standard deviation. The key text is on page 4375:Thanks for the various references. The factor we used so far is empirical but no problem, I will remove it. Other software pieces at Bruker that use such factors are not affected. I did run some tests already, the fitted parameters do not really change but of course the errors of the fitted parameters do. This needs to be documented to the user.Thank you. In a model-free analysis the accuracy of the errors is just as important, or maybe even more important, than the data itself so this will make a big difference.From my work a long time ago, I noticed that the spectral errors decreased with an increasing relaxation period. This can be taken into account in relax if all spectra are duplicated/triplicated/etc. But if not, then the errors for all spectra are averaged (using variance averaging, not standard deviation averaging). For a single duplicated/triplicated spectrum, the error is taken as the average variance of each peak. So when some, but not all, spectra are replicated, there will be one error for all spin system at all relaxation periods. This sounds similar to what the PDC does, but is it exactly the same?I think the Bruker internal opinion was different in a sense that one should either have enough replicates (which never happens) and do sdev averaging or just provide an estimate of the systematic error. The change in intensity obviously depends on the peaks. We have seen peaks with longer T2 varying much more than those with shorter T2. We concluded to to assume a worst case error scenario and check for the largest difference of all peak intensities at replicated mixing times. This error was then applied (added) to all peaks at all mixing times. I should also mention that in the PDC the user has the option to replace peak intensities/integrals of replicated mixing times by their mean and each mixing time occurs only once in the fit. This yields slightly different results. The request for implementing this came from our application people who obviously talked to their customers.Having enough replicates would never be possible for properly determining errors, so it must be crudely estimated. The maximum error would mean that the errors would probably never be underestimated. But for downstream analysis, underestimating some errors (and overestimating others) would be better as no bias would be introduced into the errors. Using the maximum error applied to all peaks will actually result in molecules appearing more rigid than they are! The result will be the hiding of many motional processes. So the second option would be better. Even better would be to use an average error rather than the maximum by averaging variances (sdev values cannot be averaged as they cannot be summed, but variances sum). Is the peak intensity error located in the PDC files, I cannot see it in our current testing files https://gna.org/file/testT2.txt?file_id=10641?Since changing the error estimation of replicates could influence the results significantly, we do not want to blindly do it. The proposal would be to offer several options (variance averaging, sdev averaging, worst case estimate)The sdev averaging here should be avoided as it statistically should not be done. For example from Wikipedia: "The standard deviation of the sum of two random variables can be related to their individual standard deviations and the covariance between them: stdev(X + Y) = sqrt{var(X) + var(Y) + 2 * cov(X,Y)}. where var = stdev^2 and cov stand for variance and covariance, respectively." The average would then be sqrt(2)*stdev(X+Y), and the covariance is zero. Another example are the messages in the thread http://mathforum.org/kb/message.jspa?messageID=1602446&tstart=0. Having a number of options (variance averaging, worst case estimate, etc.) would be great to have. Could the user's choice be recorded in the PDC file? I would like relax to give the user a detailed warning about hiding dynamics if they choose the worst case estimate. The worst case estimate would nevertheless be quite useful so I would not suggest removing it.but advise to use variance averaging in the documentation. Your remark about decreasing spectral errors with increasing mixing times is correct, I see this in all the data I have. The peak intensities/integrals decrease however much more, the relative error of each thus increases.This is a bit off topic, but would you know if this has something to do with the total power of the spectrum? Maybe Wolfgang Bermel's new pulse sequences with both the temperature compensation block and single-scan interleaving will eliminate this strange effect.Does peak intensity mean peak height? In relax, the term peak intensity is inclusive of both volume, height, or any other measure of the peak. For relaxation data with proper temperature calibration and proper temperature compensation, this is now often considered the best method. Peter Wright published a while back that volumes are more accurate, but with the temperature consistency improvements the problems Peter encountered are all solved and hence heights are better as overlap problems and other spectral issues are avoided. Is there a way of automatically determining the method used in the PDC file, because in the end the user will have to specify this for BMRB data deposition? This is not so important though, the user can manually specify this if needed.yes, intensity === height. The integration method used is part of the pdf report, I just see that it is however not included in the export (text, or xls). I will change that of course. The problem with peak heights is if peaks overlap, say one peak is hidden in the shoulder of another. Such peaks usually have to be picked by hand then peak deconvolution would be preferable. We keep on offering our options but advise in the hand book that peak intensities would be preferred.We could have relax read the integration method so that this is automatically placed in the BMRB deposition file for the user. Actually, maybe it would be an idea to have relax store the entire PDC file and have this deposited within one of the BMRB method saveframes? relax allows the user to do whatever they wish (the GUI interface does not though), but if there is significant overlap, that data should not be used. This is all up to the user though.The covariance analysis is known to be lowest quality of all the error estimates. For an exponential fit, I have not checked how non-quadratic the minimum is, so I don't know the performance of the method. Do the errors from that compare to the area integral together with base-plane RMSD?I can't remember exactly, but I think Art Palmer published something different to this and that is why his curvefit program uses MC as well as Jackknife simulation. Anyway, if the minimum has a perfectly quadratic curvature, then the covariance matrix analysis and MC will give identical results - the methods converge. But as the minimum deviates from quadratic, the performance of the covariance matrix method drops off quickly. MC simulations are the gold standard for error propagation, when the direct error equation is too complex to derive. But again I don't know how non-quadratic the minimum is in the exponential optimisation space, so I don't know how well the covariance matrix performs. The difference in techniques might be seen for very weak exchanging peaks with large errors. Or special cases such as when the spectra are not properly processed and there are truncation artifacts.With my data the errors obtained from LM (taking errorY incto account) and MC are pretty close. If a typical T1 is 0.473 the errors are 0.0026 and 0.0025. The base plane rms decreases with mixing time the relative errors with the given examples are 0.000386, ...0.00651, ...0.003124.I haven't tested this personally, and just followed Art Palmer and other's lead when I implemented this fitting and used the gold standard Monte Carlo simulations. It could be that the covariance matrix technique performs just as well as MC simulations for all exponential curve-fitting - the equations are not particularly complex. Maybe the user choice of covariance matrix estimate or Monte Carlo simulations could be recorded in the PDC file so that this information is also deposited in the BMRB? I would have to talk to Eldon Ulrich at the BMRB about adding some new tags for that.What is this factor and why is it used? The MC simulation errors should be exact, assuming the input peak intensity errors are exact estimates. MC simulations propagate the errors through the non-linear space, and hence these errors are perfectly exact. Is there a reference for this post-MC sim scaling? Is it optional? For relax, it would be good to know both this scaling factor and that mentioned above for the peak intensity error scaling, so that both can be removed prior to analysis. These will impact model-free analysis in a very negative way.In many applications people do not just want a single error but an error range they can rely on with a certain confidence. Also they would like to account for that different numbers of peaks may be used for different fits (sometimes not all peaks are picked in all planes or the user removes points from the fitted curve interactively). This is why implemented the same as Matlab typically does. I would put an extra column to the output and list the factor (typically ~ 2) for each fit. This way you can just recalculate the original errors.This would be perfect! Cheers.Certainly, model-free analysis will be influenced, whether in a negative way or not I don't know. What I have seen so far (limited experience however) is that if errors are too tight, the modelling calculates for ever and often finds solutions with such large errors that practically unusable. I must admit however that the PDC does currently not allow to use multiple fields.In model-free analysis, underestimating errors has the strange effect of introducing artificial motions! Overestimating causes real motions to disappear. Error estimates are incredibly important.Ah, historic reasons. We will be able to handle both in relax. But if the PDC presents T1 and T2 plots, and these values are in the PDC output file, the user is highly likely to then use this in publications. Some people prefer T1 and T2 plots, but these are an exception in publications. This will likely lead to a convention change in the literature (for only Bruker users) which will make it harder to compare systems. It would be good to consider this effect.The Bruker people seem to regard this as a minor issue for various reasons. One is that plots for publications will probably be made with more advanced tools like Excel and it is no effort to switch between T and R. Another argument is that people who really want to compare details need to compare numbers anyway. I will check if there is a simple way to allow the user to switch between the presentations but currently we would keep the T displays and outputs.Ok, that makes sense. relax uses xmgrace to display the relaxation rates (never the rotational relaxation times though). With BMRB data deposition very close to being complete within relax, maybe more data will be placed in the database and not just shown as a low resolution T1 or T2 plot. In many currently published papers, unfortunately comparing low resolution graphs is the only way. This is the purpose of the relax-BMRB collaboration.This all gives me some work to do and of course some testing. I hope to do everything this week and then send new export files to you for inspection. If accepted I will proceed to version PDC 1.1.3.Cheers. Michael Bieri has created a relax branch for implementing the reading of the file. Having newer files attached to https://gna.org/task/?7180 will help in our implementation. Thank you, Edward . --
Bruker
BioSpin
|