mailRe: [bug #14182] System tests failures depend on the actual machine


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Sébastien Morin on September 05, 2009 - 07:11:
Hi Ed,

(I added Olivier Fisette in CC as he is quite computer knowledgeable and
could help us rationalize this issue...)

This strange behavior was observed for my laptop and the two other
computers in the lab with the failures in the system tests (i.e. for the
three computers of the bug report).

I performed some of the different tests proposed on the following page:
    ->  http://www.gentoo.org/doc/en/articles/hardware-stability-p1.xml
          (tested CPU with infinite rebuild of kernel using gcc for 4 hours)
          (tested CPU with cpuburn-1.4 for XXXX hours)
          (tested RAM with memtester-4.0.7 for > 6 hours)
to check the CPU and RAM, but did not find anything... Of course, these
tests may not have uncovered potential problems in my CPU and RAM, but
most chances are they are fine. Moreover, the problem being observed for
three different computers, it would be surprising that hardware failures
occur in these three machines...

The three systems run Gentoo Linux with kernel-2.6.30, numpy-1.3.0 and
python-2.6.2. However, the fourth computer to which I have access (for
Olivier: this computer is 'hibou'), and which passes the system tests
properly, also runs Gentoo Linux with kernel 2.6.30, numpy-1.3.0 and
python-2.6.2...  

A potential option could be that some kernel configuration is causing
these problems...

Another option would be that, although the algorithms are supposedly
fixed, that they are not...

I could check if the calculations diverge always at the same step and,
if so, try to see what function is problematic...

Other ideas ?

Do you know any other minimisation library with which I could test to
see if these computers indeed give rise to changing results or if this
is limited to relax (and minfx) ?

Regards,


Séb  :)



Edward d'Auvergne wrote:
Hi,

This is very strange, very strange indeed!  I've never seen anything
quite like this.  Is it only your laptop that is giving this variable
result?  I'm pretty sure that it's not related to a random seed
because the optimisation at no point uses random numbers - it is 100%
fixed, pre-determined, etc. and should never, ever vary (well on
different machines it will change, but never on the same machine).
What is the operating system on the laptop?  Can you run a ram
checking program or anything else to diagnose hardware failures?
Maybe the CPU is overheating?  Apart from hardware problems, since you
never recompile Python or numpy between these tests I cannot think of
anything else that could possibly cause this.

Cheers,

Edward



2009/9/3 Sébastien Morin <sebastien.morin.1@xxxxxxxxx>:
  
Hi Ed,

I've just tried what you proposed and observed something quite strange...

Here are the results:


    
./relax scripts/optimisation_testing.py > /dev/null
      
 (stats from my laptop, different trials, see below)
   iter      161   147   151
   f_count   765   620   591
   g_count   168   152   158

    
./relax -s
      
 (stats from my laptop, different trials, see below)
   iter      146   159   160   159
   f_count   708   721   649   673
   g_count   152   166   167   166


Problem 1:
The results should be the same in both situations, right ?

Problem 2:
The results should not vary when the test is done multiple times, right ?


I have tested different things to find out why the tests give rise to
different results as a function of time...

    
./relax scripts/optimisation_testing.py > /dev/null
      
   If you modify the file "test_suite/system_tests/__init__.py", then
the result will be different. By modifying, I mean just comment a few
lines in the run() function. (I usually do that when I want to speed up
the process of testing a specific issue.) Maybe this behavior is related
to random seed based on the code files...

    
./relax -s
      
   This one varies as a function of time without any change. Just doing
the test several times in a row will have it varying... Maybe this
behavior is related to random seed based on the date and time...


Any idea ?

If you want, Ed, I could create you an account on one of these
strange-behaving computers...


Regards,


Séb




Edward d'Auvergne wrote:
    
Hi,

I've now written a script so that you can fix this.  Try running:

./relax scripts/optimisation_testing.py > /dev/null

This will give you all the info you need, formatted ready for copying
and pasting into the correct file.  This is currently only
'test_suite/system_tests/model_free.py'.  Just paste the pre-formatted
python comment into the correct test, and add the different values to
the list of values checked.

Cheers,

Edward


2009/9/3 Sébastien Morin <sebastien.morin.1@xxxxxxxxx>:
      
Hi Ed,

I just checked my original mail
(https://mail.gna.org/public/relax-devel/2009-05/msg00003.html).


For the failure "FAIL: Constrained BFGS opt, backtracking line search
{S2=0.970, te=2048, Rex=0.149}", the counts were initially as follows:
   f_count   386
   g_count   386
and are now:
   f_count   743   694   761
   g_count   168   172   164


For the failure "FAIL: Constrained BFGS opt, More and Thuente line
search {S2=0.970, te=2048, Rex=0.149}", the counts were initially as
follows:
   f_count   722
   g_count   164
and are now:
   f_count   375   322   385
   g_count   375   322   385


The different values given for the "just-measured" parameters account
for the 3 different computers I have access to that give rise to these
two annoying failures...

I wounder if the names of the tests in the original mail were not mixed,
as numbers just measured in the second test seem closer to those
originally posted in the first test, and vice versa...

Anyway, the problem is that there are variations between the different
machines. Variations are also present for the other parameters (s2, te,
rex, chi2, iter).

Regards,


Séb  :)



Edward d'Auvergne wrote:
        
Hi,

Could you check and see if the numbers are exactly the same as in your
original email 
(https://mail.gna.org/public/relax-devel/2009-05/msg00003.html)?
 Specifically look at f_count and g_count.

Cheers,

Edward


2009/9/2 Sébastien Morin <sebastien.morin.1@xxxxxxxxx>:
          
Hi Ed,

I updated my svn copies to r9432 and checked if the problem was still
present.

Unfortunately, it is still present...

Regards,


Séb



Edward d'Auvergne wrote:
            
Hi,

Ah, yes, there is a reason.  I went through and fixed a series of
these optimisation difference issues - in my local svn copy.  I
collected these all together and committed them as one after I had
shut the bugs.  This was a few minutes ago at r9426.  If you update
and test now, it should work.

Cheers,

Edward



2009/9/2 Sébastien Morin <sebastien.morin.1@xxxxxxxxx>:

              
Hi Ed,

I just tested the for the presence of this bug (1.3 repository, 
r9425)
and it seems it is still there...

Is there a reason why it was closed ?
From the data I have, I guess this bug report should be re-opened.

Maybe I could try to give more details to help debugging...


Séb  :)



Edward d Auvergne wrote:

                
Update of bug #14182 (project relax):

                  Status:               Confirmed => Fixed
             Assigned to:                    None => bugman
             Open/Closed:                    Open => Closed


    _______________________________________________________

Reply to this item at:

  <http://gna.org/bugs/?14182>

_______________________________________________
  Message sent via/by Gna!
  http://gna.org/




                  
--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval & PROTEO
Québec, Canada



                
--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval & PROTEO
Québec, Canada


            
--
Sébastien Morin
PhD Student
S. Gagné NMR Laboratory
Université Laval & PROTEO
Québec, Canada




Related Messages


Powered by MHonArc, Updated Mon Sep 14 09:40:36 2009