mailRe: Speed up suggestion for task #7807.


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on June 11, 2014 - 12:29:
And finally kill the rank_1 flag code, and all the old code that
activates, and watch your code fly compared to the trunk :)

Regards,

Edward


On 11 June 2014 12:25, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
From your profiling script, I can see that the numpy.allclose()
function is wasting a lot of your time.  In lib.dispersion.cr72,
simply replace:

        if allclose(dw, zeros(dw.shape)):

with:

        if min(abs(dw)) == 0.0:

Then watch what happens to your profile timings.  You might be
pleasantly surprised :)  If you want to be even faster, pass in the
two dw arrays and only check on the rank-1 version from the parameter
vector.  Oh, I can also see that the kex value check has a bug!

Regards,

Edward



On 11 June 2014 12:19, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
Note that you can find even more savings if you use back_calc as
temporary storage higher up the lib.dispersion module function.
Actually anywhere that the {NE, NS, NM, NO, ND} structures are created
and used, such a trick will save calculation time.  Though you
probably cannot use it everywhere!

Regards,

Edward



On 11 June 2014 12:15, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
And here is how to shave a few percent off the lib.dispersion code
with the numpy ufuncs:

Index: lib/dispersion/cr72.py
===================================================================
--- lib/dispersion/cr72.py      (revision 23825)
+++ lib/dispersion/cr72.py      (working copy)
@@ -92,7 +92,7 @@
 """

 # Python module imports.
-from numpy import allclose, arccosh, array, cos, cosh, isfinite,
isnan, min, max, ndarray, ones, sqrt, sum, zeros
+from numpy import add, allclose, arccosh, array, cos, cosh, isfinite,
isnan, min, max, multiply, ndarray, ones, sqrt, sum, zeros
 from numpy.ma import masked_greater_equal

 # Repetitive calculations (to speed up calculations).
@@ -211,17 +211,16 @@
             return

     # Calculate R2eff.
-    R2eff = r20_kex - cpmg_frqs * arccosh( fact )
+    multiply(cpmg_frqs, arccosh(fact), back_calc)
+    add(r20_kex, -back_calc, back_calc)

     # Replace data in array.
     if t_max_etapos:
-        R2eff[mask_max_etapos.mask] = r20a[mask_max_etapos.mask]
+        back_calc[mask_max_etapos.mask] = r20a[mask_max_etapos.mask]

     # Catch errors, taking a sum over array is the fastest way to check 
for
     # +/- inf (infinity) and nan (not a number).
-    if not isfinite(sum(R2eff)):
+    if not isfinite(sum(back_calc)):
         # Find the data mask which has nan values, and replace.
-        mask = isnan(R2eff)
-        R2eff[mask] = 1e100
-
-    back_calc[:] = R2eff
+        mask = isnan(back_calc)
+        back_calc[mask] = 1e100


Regards,

Edward


On 11 June 2014 12:12, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
And if you want to take this to the extreme, in __init__() define:

            self.dw_shape = (1, 1, self.NM, self.NO, self.ND)

and then in the target function:

            self.dw_struct[:] = 1.0
            multiply(self.dw_struct, tile(asarray(dw).reshape(self.NE,
self.NS)[:,:,None,None,None], self.dw_shape), self.dw_struct)
            multiply(self.dw_struct, self.frqs_a2, self.dw_struct)

These will speed things up by a few percent.  It's a pity the
numpy.tile() function does not use the 'out' argument.

Regards,

Edward


On 11 June 2014 12:09, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
Hi,

Even faster is to use:

"""
            self.dw_struct[:] = 1.0
            multiply(self.dw_struct, tile(asarray(dw).reshape(self.NE,
self.NS)[:,:,None,None,None], (1, 1, self.NM, self.NO, self.ND)),
self.dw_struct)
            multiply(self.dw_struct, self.frqs_a2, self.dw_struct)
"""

Where disp_struct and frqs_a are pre-multipled in the __init__()
function, as that maths operation does not need to happen for each
function call:

            self.frqs_a2 = self.disp_struct * self.frqs_a

Regards,

Edward


On 11 June 2014 12:00, Edward d'Auvergne <edward@xxxxxxxxxxxxx> wrote:
Hi,

Oh well, I can see you've now have an implementation (new = False)
that beats mine when clustered :)  You can use some of the ideas such
as the out ufunc argument and temporary storage to your advantage
nevertheless.  For example you can use the out argument of these
ufuncs even more, replacing:

"""
            self.dw_struct[:] = 1.0
            self.dw_struct[:] = multiply(self.dw_struct,
tile(asarray(dw).reshape(self.NE, self.NS)[:,:,None,None,None], (1, 1,
self.NM, self.NO, self.ND)), ) * self.disp_struct * self.frqs_a
"""


with:

"""
            self.dw_struct[:] = 1.0
            multiply(self.dw_struct, tile(asarray(dw).reshape(self.NE,
self.NS)[:,:,None,None,None], (1, 1, self.NM, self.NO, self.ND)),
self.dw_struct)
            multiply(self.dw_struct, self.disp_struct, self.dw_struct)
            multiply(self.dw_struct, self.frqs_a, self.dw_struct)
"""


That shaves off a few milliseconds by avoiding automatic array
creation and destruction, with before:

"""
('sfrq: ', 600000000.0, 'number of cpmg frq', 15, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.]))
('sfrq: ', 800000000.0, 'number of cpmg frq', 20, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.]))
('sfrq: ', 900000000.0, 'number of cpmg frq', 22, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.,  82.,  
86.]))
('chi2 cluster:', 0.0)
Wed Jun 11 11:45:42 2014    /tmp/tmpwkhLSr

         198252 function calls (197150 primitive calls) in 1.499 
seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.499    1.499 <string>:1(<module>)
        1    0.001    0.001    1.499    1.499 
profiling_cr72.py:449(cluster)
     1000    0.001    0.000    1.427    0.001 
profiling_cr72.py:413(calc)
     1000    0.009    0.000    1.425    0.001 
relax_disp.py:1020(func_CR72_full)
     1000    0.066    0.000    1.409    0.001 
relax_disp.py:544(calc_CR72_chi2)
     1300    0.903    0.001    1.180    0.001 cr72.py:101(r2eff_CR72)
     2300    0.100    0.000    0.222    0.000 numeric.py:2056(allclose)
     3000    0.032    0.000    0.150    0.000 shape_base.py:761(tile)
     4000    0.104    0.000    0.104    0.000 {method 'repeat' of
'numpy.ndarray' objects}
    11828    0.091    0.000    0.091    0.000 {method 'reduce' of
'numpy.ufunc' objects}
        1    0.000    0.000    0.071    0.071 
profiling_cr72.py:106(__init__)
        1    0.010    0.010    0.056    0.056
profiling_cr72.py:173(return_r2eff_arrays)
     1000    0.032    0.000    0.048    0.000 chi2.py:72(chi2_rankN)
     4609    0.005    0.000    0.045    0.000 fromnumeric.py:1762(any)
     2300    0.004    0.000    0.036    0.000 fromnumeric.py:1621(sum)
"""


And after:

"""
('sfrq: ', 600000000.0, 'number of cpmg frq', 15, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.]))
('sfrq: ', 800000000.0, 'number of cpmg frq', 20, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.]))
('sfrq: ', 900000000.0, 'number of cpmg frq', 22, array([  2.,   6.,
10.,  14.,  18.,  22.,  26.,  30.,  34.,  38.,  42.,
        46.,  50.,  54.,  58.,  62.,  66.,  70.,  74.,  78.,  82.,  
86.]))
('chi2 cluster:', 0.0)
Wed Jun 11 11:49:29 2014    /tmp/tmpML9Lx5

         198252 function calls (197150 primitive calls) in 1.462 
seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.462    1.462 <string>:1(<module>)
        1    0.001    0.001    1.462    1.462 
profiling_cr72.py:449(cluster)
     1000    0.001    0.000    1.393    0.001 
profiling_cr72.py:413(calc)
     1000    0.009    0.000    1.392    0.001 
relax_disp.py:1022(func_CR72_full)
     1000    0.056    0.000    1.376    0.001 
relax_disp.py:544(calc_CR72_chi2)
     1300    0.887    0.001    1.158    0.001 cr72.py:101(r2eff_CR72)
     2300    0.097    0.000    0.217    0.000 numeric.py:2056(allclose)
     3000    0.031    0.000    0.148    0.000 shape_base.py:761(tile)
     4000    0.103    0.000    0.103    0.000 {method 'repeat' of
'numpy.ndarray' objects}
    11828    0.090    0.000    0.090    0.000 {method 'reduce' of
'numpy.ufunc' objects}
        1    0.000    0.000    0.068    0.068 
profiling_cr72.py:106(__init__)
        1    0.010    0.010    0.053    0.053
profiling_cr72.py:173(return_r2eff_arrays)
     1000    0.031    0.000    0.047    0.000 chi2.py:72(chi2_rankN)
     4609    0.006    0.000    0.044    0.000 fromnumeric.py:1762(any)
     2300    0.004    0.000    0.036    0.000 fromnumeric.py:1621(sum)
"""


The additional suggestions I didn't specify before was to use these
ufuncs with the out argument in the lib.dispersion modules themselves.
You don't need to create R2eff here, just pack it into back_calc!

Regards,

Edward

On 11 June 2014 11:55, Troels Emtekær Linnet <tlinnet@xxxxxxxxxxxxx> 
wrote:
Hi Edward.

Some timings.
Per spin, you have a faster method.
But I win per cluster.

1000 iterations
1 / 100 spins

Edward
   ncalls  tottime  percall  cumtime  percall 
filename:lineno(function)
        1    0.000    0.000    0.523    0.523 <string>:1(<module>)
   ncalls  tottime  percall  cumtime  percall 
filename:lineno(function)
        1    0.000    0.000    3.875    3.875 <string>:1(<module>)

Troels Tile
   ncalls  tottime  percall  cumtime  percall 
filename:lineno(function)
        1    0.000    0.000    0.563    0.563 <string>:1(<module>)
   ncalls  tottime  percall  cumtime  percall 
filename:lineno(function)
        1    0.000    0.000    2.102    2.102 <string>:1(<module>)

Troels Outer
   ncalls  tottime  percall  cumtime  percall 
filename:lineno(function)
        1    0.000    0.000    0.546    0.546 <string>:1(<module>)
   ncalls  tottime  percall  cumtime  percall 
filename:lineno(function)
        1    0.000    0.000    1.974    1.974 <string>:1(<module>)

2014-06-11 11:46 GMT+02:00 Troels Emtekær Linnet 
<tlinnet@xxxxxxxxxxxxx>:
Hi Edward.

This is a really god page!
http://docs.scipy.org/doc/numpy/reference/ufuncs.html

""
Tip
The optional output arguments can be used to help you save memory for
large calculations. If your arrays are large, complicated expressions
can take longer than absolutely necessary due to the creation and
(later) destruction of temporary calculation spaces. For example, the
expression G = a * b + c is equivalent to t1 = A * B; G = T1 + C; del
t1. It will be more quickly executed as G = A * B; add(G, C, G) which
is the same as G = A * B; G += C.
""

2014-06-10 23:08 GMT+02:00 Edward d'Auvergne <edward@xxxxxxxxxxxxx>:
Note that masks and numpy.ma.multiply() and numpy.ma.add() may speed
this up even more.  However due to overheads in the numpy masking,
there is a chance that this also makes the dw and R20 data structure
construction slower.

Regards,

Edward



On 10 June 2014 22:36, Edward d'Auvergne <edward@xxxxxxxxxxxxx> 
wrote:
Hi Troels,

To make things even simpler, here is what needs to be done for R20,
R20A and R20B:

"""
from numpy import abs, add, array, float64, multiply, ones, sum, 
zeros

# Init mimic.
#############

# Values from 
Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.
NE = 1
NS = 2
NM = 2
NO = 1
ND = 8
R20A = array([  9.984626320294867,  11.495327724693091,
12.991028416082928, 14.498419290021163])
shape = (NE, NS, NM, NO, ND)

# Final structure for lib.dispersion.
R20A_struct = zeros(shape, float64)

# Temporary storage to avoid memory allocations and garbage 
collection.
R20A_temp = zeros(shape, float64)

# The structure for multiplication with R20A to piecewise build up 
the
full R20A structure.
R20A_mask = zeros((NS*NM,) + shape, float64)
for si in range(NS):
    for mi in range(NM):
        R20A_mask[si*NM+mi, :, si, mi] = 1.0
print(R20A_mask)
print("\n\n")

# Values to be found (again taken directly from
Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster - as a
printout of dw_frq_a).
R20A_final = array([[[[[  9.984626320294867,   9.984626320294867,
9.984626320294867,
                          9.984626320294867,   9.984626320294867,
9.984626320294867,
                          9.984626320294867,   9.984626320294867]],

                      [[ 11.495327724693091,  11.495327724693091,
11.495327724693091,
                         11.495327724693091,  11.495327724693091,
11.495327724693091,
                         11.495327724693091,  
11.495327724693091]]],


                     [[[ 12.991028416082928,  12.991028416082928,
12.991028416082928,
                         12.991028416082928,  12.991028416082928,
12.991028416082928,
                         12.991028416082928,  12.991028416082928]],

                      [[ 14.498419290021163,  14.498419290021163,
14.498419290021163,
                         14.498419290021163,  14.498419290021163,
14.498419290021163,
                         14.498419290021163,  
14.498419290021163]]]]])


# Target function.
##################

# Loop over the R20A elements (one per spin).
for r20_index in range(NS*NM):
    # First multiply the spin specific R20A with the spin specific
frequency mask, using temporary storage.
    multiply(R20A[r20_index], R20A_mask[r20_index], R20A_temp)

    # The add to the total.
    add(R20A_struct, R20A_temp, R20A_struct)

# Show that the structure is reproduced perfectly.
print(R20A_struct)
print(R20A_struct - R20A_final)
print(sum(abs(R20A_struct - R20A_final)))
"""


You may notice one simplification compared to my previous example 
for
the dw parameter
(http://thread.gmane.org/gmane.science.nmr.relax.devel/6135/focus=6154).
The values here too come from the
Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster system 
test.

Regards,

Edward


On 10 June 2014 21:31, Edward d'Auvergne <edward@xxxxxxxxxxxxx> 
wrote:
Hi Troels,

No need for an example.  Here is the code to add to your
infrastructure which will make the analytic dispersion models 
insanely
fast:


"""
from numpy import add, array, float64, multiply, ones, zeros

# Init mimic.
#############

# Values from 
Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.
NE = 1
NS = 2
NM = 2
NO = 1
ND = 8
dw = array([ 1.847792726895652,  0.193719379085542])
frqs = [-382.188861036982701, -318.479128911056137]
shape = (NE, NS, NM, NO, ND)

# Final structure for lib.dispersion.
dw_struct = zeros(shape, float64)

# Temporary storage to avoid memory allocations and garbage 
collection.
dw_temp = zeros((NS,) + shape, float64)

# The structure for multiplication with dw to piecewise build up 
the
full dw structure.
dw_mask = zeros((NS,) + shape, float64)
for si in range(NS):
    for mi in range(NM):
        dw_mask[si, :, si, mi] = frqs[mi]
print(dw_mask)

# Values to be found (again taken directly from
Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster - as a
printout of dw_frq_a).
dw_final = array([[[[[-706.205797724669765, -706.205797724669765,
                      -706.205797724669765, -706.205797724669765,
                      -706.205797724669765, -706.205797724669765,
                      -706.205797724669765, 
-706.205797724669765]],

                    [[-588.483418069912318, -588.483418069912318,
                      -588.483418069912318, -588.483418069912318,
                      -588.483418069912318, -588.483418069912318,
                      -588.483418069912318, 
-588.483418069912318]]],


                   [[[ -74.03738885349469 ,  -74.03738885349469 ,
                       -74.03738885349469 ,  -74.03738885349469 ,
                       -74.03738885349469 ,  -74.03738885349469 ,
                       -74.03738885349469 ,  -74.03738885349469 
]],

                    [[ -61.69557910435401 ,  -61.69557910435401 ,
                       -61.69557910435401 ,  -61.69557910435401 ,
                       -61.69557910435401 ,  -61.69557910435401 ,
                       -61.69557910435401 ,  -61.69557910435401 
]]]]])


# Target function.
##################

# Loop over the dw elements (one per spin).
for si in range(NS):
    # First multiply the spin specific dw with the spin specific
frequency mask, using temporary storage.
    multiply(dw[si], dw_mask[si], dw_temp[si])

    # The add to the total.
    add(dw_struct, dw_temp[si], dw_struct)

# Show that the structure is reproduced perfectly.
print(dw_struct - dw_final)
"""

As mentioned in the comments, the structures come from the
Relax_disp.test_cpmg_synthetic_ns3d_to_cr72_noise_cluster.  I just
added a check of "if len(dw) > 1: asdfasd" to kill the test, and 
added
printouts to obtain dw, frq_a, dw_frq_a, etc.  This is exactly the
implementation I described.  Although there might be an even 
faster
way, this will eliminate all numpy array creation and deletion via
Python garbage collection in the target functions (when used for 
R20
as well).

Regards,

Edward

On 10 June 2014 21:09, Edward d'Auvergne <edward@xxxxxxxxxxxxx> 
wrote:
If you have a really complicated example of your current 
'dw_frq_a'
data structure for multiple spins and multiple fields, that 
could help
to construct an example.

Cheers,

Edward



On 10 June 2014 20:57, Edward d'Auvergne <edward@xxxxxxxxxxxxx> 
wrote:
Hi,

I'll have a look tomorrow but, as you've probably seen, some of 
the
fine details such as indices to be used need to be sorted out 
when
implementing this.

Regards,

Edward


On 10 June 2014 20:49, Troels Emtekær Linnet 
<tlinnet@xxxxxxxxxxxxx> wrote:
What ever I do, I cannot get this to work?

Can you show an example ?

2014-06-10 16:29 GMT+02:00 Edward d'Auvergne 
<edward@xxxxxxxxxxxxx>:
Here is an example of avoiding automatic numpy data structure 
creation
and then garbage collection:

"""
from numpy import add, ones, zeros

a = zeros((5, 4))
a[1] = 1
a[:,1] = 2

b = ones((5, 4))

add(a, b, a)
print(a)
"""

The result is:

[[ 1.  3.  1.  1.]
 [ 2.  3.  2.  2.]
 [ 1.  3.  1.  1.]
 [ 1.  3.  1.  1.]
 [ 1.  3.  1.  1.]]

The out argument for numpy.add() is used here to operate in a 
similar
way to the Python "+=" operation.  But it avoids the 
temporary numpy
data structures that the Python "+=" operation will create.  
This will
save a lot of time in the dispersion code.

Regards,

Edward


On 10 June 2014 15:56, Edward d'Auvergne 
<edward@xxxxxxxxxxxxx> wrote:
Hi Troels,

Here is one suggestion, of many that I have, for 
significantly
improving the speed of the analytic dispersion models in your
'disp_spin_speed' branch.  The speed ups you have currently 
achieved
for spin clusters are huge and very impressive.  But now 
that you have
the infrastructure in place, you can advance this much more!

The suggestion has to do with the R20, R20A, and R20B numpy 
data
structures.  They way they are currently handled is 
relatively
inefficient, in that they are created de novo for each 
function call.
This means that memory allocation and Python garbage 
collection
happens for every single function call - something which 
should be
avoided at almost all costs.

A better way to do this would be to have a self.R20_struct,
self.R20A_struct, and self.R20B_struct created in 
__init__(), and then
to pack in the values from the parameter vector into these 
structures.
You could create a special structure in __init__() for this. 
 It would
have the dimensions [r20_index][ei][si][mi][oi], where the 
first
dimension corresponds to the different R20 parameters.  And 
for each
r20_index element, you would have ones at the 
[ei][si][mi][oi]
positions where you would like R20 to be, and zeros 
elsewhere.  The
key is that this is created at the target function start up, 
and not
for each function call.

This would be combined with the very powerful 'out' argument 
set to
self.R20_struct with the numpy.add() and numpy.multiply() 
functions to
prevent all memory allocations and garbage collection.  
Masks could be
used, but I think that that would be much slower than having 
special
numpy structures with ones where R20 should be and zeros 
elsewhere.
For just creating these structures, looping over a single 
r20_index
loop and multiplying by the special 
[r20_index][ei][si][mi][oi]
one/zero structure and using numpy.add() and 
numpy.multiply() with out
arguments would be much, much faster than masks or the 
current
R20_axis logic.  It will also simplify the code.

Regards,

Edward

_______________________________________________
relax (http://www.nmr-relax.com)

This is the relax-devel mailing list
relax-devel@xxxxxxx

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel



Related Messages


Powered by MHonArc, Updated Wed Jun 11 13:20:29 2014