mailRe: [bug #14182] System tests failures depend on the actual machine


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on March 16, 2010 - 16:47:
Hi,

Does this still fail in the 1.3 line?  I should have fixed this one
quite a while ago.  I think it's about time I released relax-1.3.5!

Cheers,

Edward


On 16 March 2010 15:44, Sébastien Morin <sebastien.morin.1@xxxxxxxxx> wrote:
Hi Edward,

I just tested the Gentoo machines again using relax-1.3.4 and minf-1.0.2.

Three 32 bit machines I tested completed the test-suite without any
error. Two other machines I previously had in my possession are no
longer available...

However, one 64 bit machine failed for one test, always with the same
values:

====================
FAIL: Constrained Newton opt, GMW Hessian mod, More and Thuente line
search {S2=0.970, te=2048, Rex=0.149}

...

relax> minimise(*args=('newton',), func_tol=1e-25,
max_iterations=10000000, constraints=True, scaling=True, verbosity=1)
Simulation 1
Simulation 2
Simulation 3

relax> monte_carlo.error_analysis(prune=0.0)
Traceback (most recent call last):
  File "/home/semor/relax-1.3.4/test_suite/system_tests/model_free.py",
line 610, in test_opt_constr_newton_gmw_mt_S2_0_970_te_2048_Rex_0_149
    self.value_test(spin, select, s2, te, rex, chi2, iter, f_count,
g_count, h_count, warning)
  File "/home/semor/relax-1.3.4/test_suite/system_tests/model_free.py",
line 1110, in value_test
    self.assertEqual(spin.f_count, f_count, msg=mesg)
AssertionError: Optimisation failure.

System: Linux
Release: 2.6.20-gentoo-r7
Version: #1 SMP Sat Apr 28 23:31:52 Local time zone must be set--see zic
Win32 version:
Distribution: gentoo 1.12.13
Architecture: 64bit ELF
Machine: x86_64
Processor: Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
Python version: 2.6.4
numpy version: 1.3.0


s2:       0.9699999999999994
te:       2048.0000000000446
rex:      0.14900000000001615
chi2:     8.3312601381368332e-28
iter:     22
f_count:  91
g_count:  91
h_count:  22
warning:  None
====================

Regards,


Séb  :)



On 10-02-21 9:00 AM, Edward d'Auvergne wrote:
Is it different for the different machines, or is it different each
time on the same machine?  If you give a range of numbers for the
optimisation results, these tests could be relaxed a little.

Cheers,

Edward


On 21 February 2010 14:41, Sébastien Morin<sebastien.morin.1@xxxxxxxxx>  
wrote:

Hi Ed,

I agree with you that this is not an important issue given the small
variations observed...

I was just still a bit annoyed by this happening on our Gentoo systems...

But maybe this is just because of Gentoo itself, as in Gentoo almost
everything is compiled locally, so every system is different because of 
all
the variables that can be changed that affect compilation...

Ok, let's forget all this !

Regards,


Séb


On 10-02-21 8:32 AM, Edward d'Auvergne wrote:

Hi,

The code is not parallelised as most optimisation algorithms are not
amenable to parallelisation.  There's a lot of research in that field,
but the code here is not along these lines.  Do you still see this
problem?  Maybe it is a bug in this specific version of the GCC
compiler which created the python executable?  Does it occur on
machines with a different Gentoo versions installed?  Can you
reproduce the error in a virtual machine?  This is a fixed code path
and cannot in any way be different upon different runs of the test
suite.  It doesn't change on all the Mandriva installs I have, all the
Macs it has been tested on, or even on the Windows virtual image I use
to build and test relax on Windows.  I've even tested it on Solaris
without problems!  In any case, this bug is definitely machine
specific and not related to relax itself.  Sorry, I don't know what
else I can do to try to track this down.  Maybe your CPUs are doing
some strange frequency scaling depending on load, and that is causing
this bizarre behaviour?  In any case, this is not an issue for relax
execution and only affects the precision of optimisation in a small
way.

Regards,

Edward



On 21 February 2010 05:34, Sébastien Morin<sebastien.morin.1@xxxxxxxxx>
  wrote:


Hi Ed,

This has been a long time since we discussed about this...

However, talking with Olivier last week, we discussed about one
possibility
to explain this issue. Is the code in question in some way parallelized,
i.e. are there multiple processes running at the same time with their
results being combined subsequently ? If yes, there could be conditions
in
which the problem could arise either because of variations in allocated
memory or cpu that would change the timing between the different
processes,
hence affecting the final result...

Does that make sens ?

Olivier, is this what you explained me last week ?


Sébastien


On 09-09-14 3:30 AM, Edward d'Auvergne wrote:


Hi,

I've been trying to work out what is happening, but it is a complete
mystery to me.  The algorithms are fixed in stone - I coded them
myself and you can see it in the minfx code.  They are standard
optimisation algorithms that obey fixed rules.  On the same machine it
must, without question, give the same result every time!  If it
doesn't, something is wrong with the machine, either hardward or
software.  Would it be possible to install an earlier python and numpy
version (maybe 2.5 and 1.2.1 respectively) to see if that makes a
difference?  Or maybe it is the Linux kernel doing some strange things
with the CPU - maybe switching between power profiles causing the CPU
floating point math precision to change?  Are you 100% sure that all
computers give variable results (between each run), and not that they
just give a different fixed result each time?  Maybe there is a
non-fatal kernel bug not triggered by Oliver's hardward?

Regards,

Edward


P.S.  A note to others reading this - this problem is not serious for
relax's optimisation!


2009/9/4 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:



Hi Ed,

(I added Olivier Fisette in CC as he is quite computer knowledgeable
and
could help us rationalize this issue...)

This strange behavior was observed for my laptop and the two other
computers in the lab with the failures in the system tests (i.e. for
the
three computers of the bug report).

I performed some of the different tests proposed on the following 
page:
    ->
  http://www.gentoo.org/doc/en/articles/hardware-stability-p1.xml
          (tested CPU with infinite rebuild of kernel using gcc for 4
hours)
          (tested CPU with cpuburn-1.4 for XXXX hours)
          (tested RAM with memtester-4.0.7 for>    6 hours)
to check the CPU and RAM, but did not find anything... Of course, 
these
tests may not have uncovered potential problems in my CPU and RAM, but
most chances are they are fine. Moreover, the problem being observed
for
three different computers, it would be surprising that hardware
failures
occur in these three machines...

The three systems run Gentoo Linux with kernel-2.6.30, numpy-1.3.0 and
python-2.6.2. However, the fourth computer to which I have access (for
Olivier: this computer is 'hibou'), and which passes the system tests
properly, also runs Gentoo Linux with kernel 2.6.30, numpy-1.3.0 and
python-2.6.2...

A potential option could be that some kernel configuration is causing
these problems...

Another option would be that, although the algorithms are supposedly
fixed, that they are not...

I could check if the calculations diverge always at the same step and,
if so, try to see what function is problematic...

Other ideas ?

Do you know any other minimisation library with which I could test to
see if these computers indeed give rise to changing results or if this
is limited to relax (and minfx) ?

Regards,


Séb  :)



Edward d'Auvergne wrote:



Hi,

This is very strange, very strange indeed!  I've never seen anything
quite like this.  Is it only your laptop that is giving this variable
result?  I'm pretty sure that it's not related to a random seed
because the optimisation at no point uses random numbers - it is 100%
fixed, pre-determined, etc. and should never, ever vary (well on
different machines it will change, but never on the same machine).
What is the operating system on the laptop?  Can you run a ram
checking program or anything else to diagnose hardware failures?
Maybe the CPU is overheating?  Apart from hardware problems, since 
you
never recompile Python or numpy between these tests I cannot think of
anything else that could possibly cause this.

Cheers,

Edward



2009/9/3 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:




Hi Ed,

I've just tried what you proposed and observed something quite
strange...

Here are the results:






./relax scripts/optimisation_testing.py>    /dev/null




  (stats from my laptop, different trials, see below)
    iter      161   147   151
    f_count   765   620   591
    g_count   168   152   158





./relax -s




  (stats from my laptop, different trials, see below)
    iter      146   159   160   159
    f_count   708   721   649   673
    g_count   152   166   167   166


Problem 1:
The results should be the same in both situations, right ?

Problem 2:
The results should not vary when the test is done multiple times,
right
?


I have tested different things to find out why the tests give rise 
to
different results as a function of time...





./relax scripts/optimisation_testing.py>    /dev/null




    If you modify the file "test_suite/system_tests/__init__.py", 
then
the result will be different. By modifying, I mean just comment a 
few
lines in the run() function. (I usually do that when I want to speed
up
the process of testing a specific issue.) Maybe this behavior is
related
to random seed based on the code files...





./relax -s




    This one varies as a function of time without any change. Just
doing
the test several times in a row will have it varying... Maybe this
behavior is related to random seed based on the date and time...


Any idea ?

If you want, Ed, I could create you an account on one of these
strange-behaving computers...


Regards,


Séb




Edward d'Auvergne wrote:




Hi,

I've now written a script so that you can fix this.  Try running:

./relax scripts/optimisation_testing.py>    /dev/null

This will give you all the info you need, formatted ready for
copying
and pasting into the correct file.  This is currently only
'test_suite/system_tests/model_free.py'.  Just paste the
pre-formatted
python comment into the correct test, and add the different values
to
the list of values checked.

Cheers,

Edward


2009/9/3 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:




Hi Ed,

I just checked my original mail
(https://mail.gna.org/public/relax-devel/2009-05/msg00003.html).


For the failure "FAIL: Constrained BFGS opt, backtracking line
search
{S2=0.970, te=2048, Rex=0.149}", the counts were initially as
follows:
    f_count   386
    g_count   386
and are now:
    f_count   743   694   761
    g_count   168   172   164


For the failure "FAIL: Constrained BFGS opt, More and Thuente line
search {S2=0.970, te=2048, Rex=0.149}", the counts were initially
as
follows:
    f_count   722
    g_count   164
and are now:
    f_count   375   322   385
    g_count   375   322   385


The different values given for the "just-measured" parameters
account
for the 3 different computers I have access to that give rise to
these
two annoying failures...

I wounder if the names of the tests in the original mail were not
mixed,
as numbers just measured in the second test seem closer to those
originally posted in the first test, and vice versa...

Anyway, the problem is that there are variations between the
different
machines. Variations are also present for the other parameters 
(s2,
te,
rex, chi2, iter).

Regards,


Séb  :)



Edward d'Auvergne wrote:




Hi,

Could you check and see if the numbers are exactly the same as in
your
original email
(https://mail.gna.org/public/relax-devel/2009-05/msg00003.html)?
  Specifically look at f_count and g_count.

Cheers,

Edward


2009/9/2 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:




Hi Ed,

I updated my svn copies to r9432 and checked if the problem was
still
present.

Unfortunately, it is still present...

Regards,


Séb



Edward d'Auvergne wrote:




Hi,

Ah, yes, there is a reason.  I went through and fixed a series
of
these optimisation difference issues - in my local svn copy.  I
collected these all together and committed them as one after I
had
shut the bugs.  This was a few minutes ago at r9426.  If you
update
and test now, it should work.

Cheers,

Edward



2009/9/2 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:





Hi Ed,

I just tested the for the presence of this bug (1.3 
repository,
r9425)
and it seems it is still there...

Is there a reason why it was closed ?


   From the data I have, I guess this bug report should be
re-opened.


Maybe I could try to give more details to help debugging...


Séb  :)



Edward d Auvergne wrote:





Update of bug #14182 (project relax):

                   Status:               Confirmed =>    
Fixed
              Assigned to:                    None =>    
bugman
              Open/Closed:                    Open =>    
Closed


     _______________________________________________________

Reply to this item at:

   <http://gna.org/bugs/?14182>

_______________________________________________
   Message sent via/by Gna!
   http://gna.org/








--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&    PROTEO
Québec, Canada







--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&    PROTEO
Québec, Canada






--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&    PROTEO
Québec, Canada







--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&    PROTEO
Québec, Canada




--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&   PROTEO
Québec, Canada




--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&  PROTEO
Québec, Canada


_______________________________________________
relax (http://nmr-relax.com)

This is the relax-devel mailing list
relax-devel@xxxxxxx

To unsubscribe from this list, get a password
reminder, or change your subscription options,
visit the list information page at
https://mail.gna.org/listinfo/relax-devel




Related Messages


Powered by MHonArc, Updated Tue Mar 16 17:20:15 2010