mailSpeeding up the relaxation dispersion analysis - optimisation of the missing data handling.


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on May 09, 2014 - 16:30:
Hi,

This post is a placeholder for ideas about how to optimise the
handling of the self.missing data structure in the
target_functions.relax_disp module.  The current implementation is
very slow and has a significant impact on the speed of a dispersion
analysis.

The current logic is:

# Loop over the experiment types.
for ei in range(self.num_exp):
    # Loop over the spins.
    for si in range(self.num_spins):
        # Loop over the spectrometer frequencies.
        for mi in range(self.num_frq):
            # Loop over the offsets.
            for oi in range(self.num_offsets[0][si][mi]):
                # For all missing data points, set the back-calculated
value to the measured values so that it has no effect on the
chi-squared value.
                for di in range(self.num_disp_points[0][si][mi][oi]):
                    if self.missing[0][si][mi][oi][di]:
                        self.back_calc[0][si][mi][oi][di] =
self.values[0][si][mi][oi][di]

                # Calculate the chi-squared value.
                chi2_sum += chi2(self.values[ei][si][mi][0],
self.back_calc[ei][si][mi][0], self.errors[ei][si][mi][0])


Here, the back calculated value is set to the real value when data is
missing.  That way the chi-squared value for this element will be zero
as the difference between the back-calculated value (set to the real
value) and the real value is zero.

The ideal would be to shift this logic to be outside of all of these
loops.  And then to construct a chi2_multi_dim() function, which
replaces chi2(), that will calculate the chi2-square value for all
elements as a huge numpy data structure, then sums over all
dimensions.  Here for the dispersion analysis, the data structures
have 5 dimensions.  Probably for maximum speed, a series of functions
are needed - chi2_2D(), chi2_3D(), chi2_4D() and chi2_5D().

For the handling of the missing data, the following could be
implemented.  The logic will be the same, the elements of
self.back_calc which correspond to missing data would be set to the
real values (self.values).  How can this be done outside of the loops?
 Simply:

        back_calc_new = self.back_calc * self.missing_inv + 
self.missing_values

For this to work, self.missing_inv and self.missing_values would need
to be pre-created in the __init__() method as numpy float64 arrays:

- self.missing_inv:  An element is zero when data is missing, and 1.0
otherwise.  The multiplication will cause all missing elements in
self.back_calc to be set to zero and all other values remain
unmodified.

- self.missing_values:  This will be an array full of zeros, except
for where data is missing and then the element is set to the number in
self.values.

The last two lines of the target function would then be:

        # Return the total chi-squared value.
        return chi2_5D(self.values, self.back_calc, self.errors)

The speed up of such an implementation would be very significant.  A
halving of the calculation time would not be surprising, though that
would depend on the dispersion model.  Anyway, I will leave such an
implementation for anyone who is interested in speed.  Please discuss
first.

Regards,

Edward



Related Messages


Powered by MHonArc, Updated Mon May 19 11:40:15 2014