Re: [bug #14182] System tests failures depend on the actual machine -- February 21, 2010

Hi,

The code is not parallelised as most optimisation algorithms are not
amenable to parallelisation.  There's a lot of research in that field,
but the code here is not along these lines.  Do you still see this
problem?  Maybe it is a bug in this specific version of the GCC
compiler which created the python executable?  Does it occur on
machines with a different Gentoo versions installed?  Can you
reproduce the error in a virtual machine?  This is a fixed code path
and cannot in any way be different upon different runs of the test
suite.  It doesn't change on all the Mandriva installs I have, all the
Macs it has been tested on, or even on the Windows virtual image I use
to build and test relax on Windows.  I've even tested it on Solaris
without problems!  In any case, this bug is definitely machine
specific and not related to relax itself.  Sorry, I don't know what
else I can do to try to track this down.  Maybe your CPUs are doing
some strange frequency scaling depending on load, and that is causing
this bizarre behaviour?  In any case, this is not an issue for relax
execution and only affects the precision of optimisation in a small
way.

Regards,

Edward



On 21 February 2010 05:34, Sébastien Morin <sebastien.morin.1@xxxxxxxxx> 
wrote:

Hi Ed,

This has been a long time since we discussed about this...

However, talking with Olivier last week, we discussed about one possibility
to explain this issue. Is the code in question in some way parallelized,
i.e. are there multiple processes running at the same time with their
results being combined subsequently ? If yes, there could be conditions in
which the problem could arise either because of variations in allocated
memory or cpu that would change the timing between the different processes,
hence affecting the final result...

Does that make sens ?

Olivier, is this what you explained me last week ?


Sébastien


On 09-09-14 3:30 AM, Edward d'Auvergne wrote:


Hi,

I've been trying to work out what is happening, but it is a complete
mystery to me.  The algorithms are fixed in stone - I coded them
myself and you can see it in the minfx code.  They are standard
optimisation algorithms that obey fixed rules.  On the same machine it
must, without question, give the same result every time!  If it
doesn't, something is wrong with the machine, either hardward or
software.  Would it be possible to install an earlier python and numpy
version (maybe 2.5 and 1.2.1 respectively) to see if that makes a
difference?  Or maybe it is the Linux kernel doing some strange things
with the CPU - maybe switching between power profiles causing the CPU
floating point math precision to change?  Are you 100% sure that all
computers give variable results (between each run), and not that they
just give a different fixed result each time?  Maybe there is a
non-fatal kernel bug not triggered by Oliver's hardward?

Regards,

Edward


P.S.  A note to others reading this - this problem is not serious for
relax's optimisation!


2009/9/4 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:


Hi Ed,

(I added Olivier Fisette in CC as he is quite computer knowledgeable and
could help us rationalize this issue...)

This strange behavior was observed for my laptop and the two other
computers in the lab with the failures in the system tests (i.e. for the
three computers of the bug report).

I performed some of the different tests proposed on the following page:
   ->   http://www.gentoo.org/doc/en/articles/hardware-stability-p1.xml
         (tested CPU with infinite rebuild of kernel using gcc for 4
hours)
         (tested CPU with cpuburn-1.4 for XXXX hours)
         (tested RAM with memtester-4.0.7 for>  6 hours)
to check the CPU and RAM, but did not find anything... Of course, these
tests may not have uncovered potential problems in my CPU and RAM, but
most chances are they are fine. Moreover, the problem being observed for
three different computers, it would be surprising that hardware failures
occur in these three machines...

The three systems run Gentoo Linux with kernel-2.6.30, numpy-1.3.0 and
python-2.6.2. However, the fourth computer to which I have access (for
Olivier: this computer is 'hibou'), and which passes the system tests
properly, also runs Gentoo Linux with kernel 2.6.30, numpy-1.3.0 and
python-2.6.2...

A potential option could be that some kernel configuration is causing
these problems...

Another option would be that, although the algorithms are supposedly
fixed, that they are not...

I could check if the calculations diverge always at the same step and,
if so, try to see what function is problematic...

Other ideas ?

Do you know any other minimisation library with which I could test to
see if these computers indeed give rise to changing results or if this
is limited to relax (and minfx) ?

Regards,


Séb  :)



Edward d'Auvergne wrote:


Hi,

This is very strange, very strange indeed!  I've never seen anything
quite like this.  Is it only your laptop that is giving this variable
result?  I'm pretty sure that it's not related to a random seed
because the optimisation at no point uses random numbers - it is 100%
fixed, pre-determined, etc. and should never, ever vary (well on
different machines it will change, but never on the same machine).
What is the operating system on the laptop?  Can you run a ram
checking program or anything else to diagnose hardware failures?
Maybe the CPU is overheating?  Apart from hardware problems, since you
never recompile Python or numpy between these tests I cannot think of
anything else that could possibly cause this.

Cheers,

Edward



2009/9/3 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:


Hi Ed,

I've just tried what you proposed and observed something quite
strange...

Here are the results:


./relax scripts/optimisation_testing.py>  /dev/null


 (stats from my laptop, different trials, see below)
   iter      161   147   151
   f_count   765   620   591
   g_count   168   152   158


./relax -s


 (stats from my laptop, different trials, see below)
   iter      146   159   160   159
   f_count   708   721   649   673
   g_count   152   166   167   166


Problem 1:
The results should be the same in both situations, right ?

Problem 2:
The results should not vary when the test is done multiple times, right
?


I have tested different things to find out why the tests give rise to
different results as a function of time...


./relax scripts/optimisation_testing.py>  /dev/null


   If you modify the file "test_suite/system_tests/__init__.py", then
the result will be different. By modifying, I mean just comment a few
lines in the run() function. (I usually do that when I want to speed up
the process of testing a specific issue.) Maybe this behavior is
related
to random seed based on the code files...


./relax -s


   This one varies as a function of time without any change. Just doing
the test several times in a row will have it varying... Maybe this
behavior is related to random seed based on the date and time...


Any idea ?

If you want, Ed, I could create you an account on one of these
strange-behaving computers...


Regards,


Séb




Edward d'Auvergne wrote:


Hi,

I've now written a script so that you can fix this.  Try running:

./relax scripts/optimisation_testing.py>  /dev/null

This will give you all the info you need, formatted ready for copying
and pasting into the correct file.  This is currently only
'test_suite/system_tests/model_free.py'.  Just paste the pre-formatted
python comment into the correct test, and add the different values to
the list of values checked.

Cheers,

Edward


2009/9/3 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:


Hi Ed,

I just checked my original mail
(https://mail.gna.org/public/relax-devel/2009-05/msg00003.html).


For the failure "FAIL: Constrained BFGS opt, backtracking line search
{S2=0.970, te=2048, Rex=0.149}", the counts were initially as
follows:
   f_count   386
   g_count   386
and are now:
   f_count   743   694   761
   g_count   168   172   164


For the failure "FAIL: Constrained BFGS opt, More and Thuente line
search {S2=0.970, te=2048, Rex=0.149}", the counts were initially as
follows:
   f_count   722
   g_count   164
and are now:
   f_count   375   322   385
   g_count   375   322   385


The different values given for the "just-measured" parameters account
for the 3 different computers I have access to that give rise to
these
two annoying failures...

I wounder if the names of the tests in the original mail were not
mixed,
as numbers just measured in the second test seem closer to those
originally posted in the first test, and vice versa...

Anyway, the problem is that there are variations between the
different
machines. Variations are also present for the other parameters (s2,
te,
rex, chi2, iter).

Regards,


Séb  :)



Edward d'Auvergne wrote:


Hi,

Could you check and see if the numbers are exactly the same as in
your
original email
(https://mail.gna.org/public/relax-devel/2009-05/msg00003.html)?
 Specifically look at f_count and g_count.

Cheers,

Edward


2009/9/2 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:


Hi Ed,

I updated my svn copies to r9432 and checked if the problem was
still
present.

Unfortunately, it is still present...

Regards,


Séb



Edward d'Auvergne wrote:


Hi,

Ah, yes, there is a reason.  I went through and fixed a series of
these optimisation difference issues - in my local svn copy.  I
collected these all together and committed them as one after I had
shut the bugs.  This was a few minutes ago at r9426.  If you
update
and test now, it should work.

Cheers,

Edward



2009/9/2 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:


Hi Ed,

I just tested the for the presence of this bug (1.3 repository,
r9425)
and it seems it is still there...

Is there a reason why it was closed ?

From the data I have, I guess this bug report should be
re-opened.


Maybe I could try to give more details to help debugging...


Séb  :)



Edward d Auvergne wrote:


Update of bug #14182 (project relax):

                  Status:               Confirmed =>  Fixed
             Assigned to:                    None =>  bugman
             Open/Closed:                    Open =>  Closed


    _______________________________________________________

Reply to this item at:

  <http://gna.org/bugs/?14182>

_______________________________________________
  Message sent via/by Gna!
  http://gna.org/


--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&  PROTEO
Québec, Canada


--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&  PROTEO
Québec, Canada


--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&  PROTEO
Québec, Canada


--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval&  PROTEO
Québec, Canada

Re: [bug #14182] System tests failures depend on the actual machine

Header

Content

Related Messages