mailRe: [sr #3045] Support for pooled standard deviation for: Peak heights with partially replicated spectra


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on June 19, 2013 - 17:11:
In code, it is easier to use the pipe_control.mol_res_spin.spin_loop()
generator function to get each spin container.  Then you look at the
spin.intensities dictionary and pull out the elements you like.

However in this case of having the replicates [['0_2', '7_2', '14_2'],
['15_14', '16_14', '20_14'], ['3_30', '8_30', '17_30'], ['9_46',
'19_46', '22_46']], relax does not calculate 4 standard deviations for
a single peak and then average those 4 (via variance averaging) for
that one peak in isolation.  Instead relax will average the standard
deviation for all peaks for just the replicates ['0_2', '7_2',
'14_2'].  And then do this for the other 3.  Hence there will be 4
standard deviation values in total for all peak intensity data for all
spins.

The second averaging then comes into play in relax if you have spectra
without replicates.  This averaging will produce 1 SD value for
absolutely everything!

This is based on the fact that the statistics for a single peak, even
with the replicates [['0_2', '7_2', '14_2'], ['15_14', '16_14',
'20_14'], ['3_30', '8_30', '17_30'], ['9_46', '19_46', '22_46']], is
far too noisy and hence is a bad error estimate.  It is also based on
the fact that the peak height errors for all peaks should be roughly
equal to the RMSD of the baseplane noise.  The assumption that the
white noise of the baseplane maps directly to peak height is robust.
This was demonstrated by Art Palmer in his 1991 JACS paper, from
memory (distant memory so the reference may not be correct).  So, if
you can measure the baseplane RMSD accurately, measuring replicated
spectra is not necessary.

Some exceptions where peaks should not have the same height error is
in the presence of truncation artifacts or if there is overlap with a
small peak due to impurities in the sample.  But in these cases, the
problem is bias and not variance.  See
http://scott.fortmann-roe.com/docs/BiasVariance.html for a great
description.  Therefore no amount of replicates can determine that
error (actually truncation artifacts introduce both bias and decrease
the variance).

I hope it is now clearer how relax handles the errors.

Regards,

Edward



On 19 June 2013 16:03, Troels Emtekær Linnet <tlinnet@xxxxxxxxx> wrote:
Can we take an example?

Can I easily loop over
replicates, and extract intensity from just one spin?

cdp.replicates
[['0_2', '7_2', '14_2'], ['15_14', '16_14', '20_14'], ['3_30', '8_30',
'17_30'], ['9_46', '19_46', '22_46']]


Troels Emtekær Linnet


2013/6/19 Edward d'Auvergne <edward@xxxxxxxxxxxxx>

And again I got it wrong :)  It's not the averaged variance divided by
k!  It's the sum of variances divided by k, i.e. the average variance.
 Therefore as all peaks are either duplicated, triplicated, etc., then
in all cases the pooled variances collapses down to the average
variance.  Therefore as I see it, the pooled variance for replicated
spectra is redundant.  Or am I wrong again?

Regards,

Edward



On 19 June 2013 15:27, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
Just wait, I got that wrong!  First some definitions:

n = number of replicated spectra,
i = each spin system or peak,
k = total number of peaks.

The sum at the bottom of would mean that if n-1=1, then the
denominator would be a large sum.  As n-1 would be the same for all i,
then this becomes the averaged variance divided by k!  Therefore as
you have more and more peaks in the spectrum, the smaller and smaller
the estimator will be.  Taking this to the extreme, as you approach
infinite peaks in the spectrum, the error approaches zero.  That seems
absurd.  Maybe in this case, the unbiased estimator is absurd ;)  I
think I should read that 3rd link you posted.

Regards,

Edward




On 19 June 2013 15:15, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
Hi,

I'm quite aware of this.  Another useful link is:

http://en.wikipedia.org/wiki/Pooled_variance

This has also been pointed out to me by Robert Schneider (but not on
the mailing lists).  I am wondering if it is worth it as the number of
users who would benefit are quite low.  The reason for this is that
most users will only have duplicate spectra.  Therefore n-1 ends up
being 1, as n is the number of replicated spectra, and this collapses
down to the currently used variance averaging.  In the case where you
have collected spectra in triplicate, then implementing this makes
sense.  But the number of people using relax with triplicate spectra
in the last 12 years is probably 1 or 2.  So it would be good to
implement this, but it's priority is very low.  In any case, both
averaged variances and pooled variances from a large collection of 2
point sets is horrible statistics, but that's all we've got.

Note also that there are two averaging steps.  The first is to average
the variance for all peaks in the spectrum.  The variance for a single
peak is the dirty estimate from 2 points.  Then if some spectra are
only measured once, then the variances for all spectra are averaged.

Regards,

Edward






On 19 June 2013 14:50, Troels E. Linnet
<NO-REPLY.INVALID-ADDRESS@xxxxxxx> wrote:
URL:
  <http://gna.org/support/?3045>

                 Summary: Support for pooled standard deviation for:
Peak
heights with partially replicated spectra
                 Project: relax
            Submitted by: tlinnet
            Submitted on: Wed 19 Jun 2013 12:50:08 PM GMT
                Category: None
                Priority: 5 - Normal
                Severity: 3 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
        Originator Email:
             Open/Closed: Open
         Discussion Lock: Any
        Operating System: None

    _______________________________________________________

Details:

According to the manual,
http://www.nmr-relax.com/manual/spectrum_error_analysis.html,
the variance for the replicated datasets are averaged, and used as the
variance for single replicated spectrum.

This is a very reasonable assumption, but I wonder if a pooled
standard
deviation should be used instead.

If we look in the definition of IUPAC Gold Book:
http://goldbook.iupac.org/P04758.html

"""
Results from various series of measurements can be combined in the
following
way to give a pooled relative standard deviation $s_{r,p}$:

$$
s_{r,p}=\sqrt{\frac{\sum(n_i-1)s_{r,i}^2}{\sum n_i -1}} =
\sqrt{\frac{\sum(n_i-1)s_i^2x_i^{-2}}{\sum n_i -1}}
$$
"""

It is not an easy subject, and the discussion can be "hot": See for
example
these gals and gils:
http://www.physicsforums.com/showthread.php?t=268377


So my question is, is the use of average of variances the right way to
estimate the variance for single recorded data point?
And should another way be implemented?




    _______________________________________________________

Reply to this item at:

  <http://gna.org/support/?3045>

_______________________________________________
  Message sent via/by Gna!
  http://gna.org/


_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-devel@xxxxxxx

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel





Related Messages


Powered by MHonArc, Updated Wed Jun 19 19:40:10 2013