Hi, The code is not parallelised as most optimisation algorithms are not amenable to parallelisation. There's a lot of research in that field, but the code here is not along these lines. Do you still see this problem? Maybe it is a bug in this specific version of the GCC compiler which created the python executable? Does it occur on machines with a different Gentoo versions installed? Can you reproduce the error in a virtual machine? This is a fixed code path and cannot in any way be different upon different runs of the test suite. It doesn't change on all the Mandriva installs I have, all the Macs it has been tested on, or even on the Windows virtual image I use to build and test relax on Windows. I've even tested it on Solaris without problems! In any case, this bug is definitely machine specific and not related to relax itself. Sorry, I don't know what else I can do to try to track this down. Maybe your CPUs are doing some strange frequency scaling depending on load, and that is causing this bizarre behaviour? In any case, this is not an issue for relax execution and only affects the precision of optimisation in a small way. Regards, Edward On 21 February 2010 05:34, Sébastien Morin <sebastien.morin.1@xxxxxxxxx> wrote:
Hi Ed, This has been a long time since we discussed about this... However, talking with Olivier last week, we discussed about one possibility to explain this issue. Is the code in question in some way parallelized, i.e. are there multiple processes running at the same time with their results being combined subsequently ? If yes, there could be conditions in which the problem could arise either because of variations in allocated memory or cpu that would change the timing between the different processes, hence affecting the final result... Does that make sens ? Olivier, is this what you explained me last week ? Sébastien On 09-09-14 3:30 AM, Edward d'Auvergne wrote:Hi, I've been trying to work out what is happening, but it is a complete mystery to me. The algorithms are fixed in stone - I coded them myself and you can see it in the minfx code. They are standard optimisation algorithms that obey fixed rules. On the same machine it must, without question, give the same result every time! If it doesn't, something is wrong with the machine, either hardward or software. Would it be possible to install an earlier python and numpy version (maybe 2.5 and 1.2.1 respectively) to see if that makes a difference? Or maybe it is the Linux kernel doing some strange things with the CPU - maybe switching between power profiles causing the CPU floating point math precision to change? Are you 100% sure that all computers give variable results (between each run), and not that they just give a different fixed result each time? Maybe there is a non-fatal kernel bug not triggered by Oliver's hardward? Regards, Edward P.S. A note to others reading this - this problem is not serious for relax's optimisation! 2009/9/4 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:Hi Ed, (I added Olivier Fisette in CC as he is quite computer knowledgeable and could help us rationalize this issue...) This strange behavior was observed for my laptop and the two other computers in the lab with the failures in the system tests (i.e. for the three computers of the bug report). I performed some of the different tests proposed on the following page: -> http://www.gentoo.org/doc/en/articles/hardware-stability-p1.xml (tested CPU with infinite rebuild of kernel using gcc for 4 hours) (tested CPU with cpuburn-1.4 for XXXX hours) (tested RAM with memtester-4.0.7 for> 6 hours) to check the CPU and RAM, but did not find anything... Of course, these tests may not have uncovered potential problems in my CPU and RAM, but most chances are they are fine. Moreover, the problem being observed for three different computers, it would be surprising that hardware failures occur in these three machines... The three systems run Gentoo Linux with kernel-2.6.30, numpy-1.3.0 and python-2.6.2. However, the fourth computer to which I have access (for Olivier: this computer is 'hibou'), and which passes the system tests properly, also runs Gentoo Linux with kernel 2.6.30, numpy-1.3.0 and python-2.6.2... A potential option could be that some kernel configuration is causing these problems... Another option would be that, although the algorithms are supposedly fixed, that they are not... I could check if the calculations diverge always at the same step and, if so, try to see what function is problematic... Other ideas ? Do you know any other minimisation library with which I could test to see if these computers indeed give rise to changing results or if this is limited to relax (and minfx) ? Regards, Séb :) Edward d'Auvergne wrote:Hi, This is very strange, very strange indeed! I've never seen anything quite like this. Is it only your laptop that is giving this variable result? I'm pretty sure that it's not related to a random seed because the optimisation at no point uses random numbers - it is 100% fixed, pre-determined, etc. and should never, ever vary (well on different machines it will change, but never on the same machine). What is the operating system on the laptop? Can you run a ram checking program or anything else to diagnose hardware failures? Maybe the CPU is overheating? Apart from hardware problems, since you never recompile Python or numpy between these tests I cannot think of anything else that could possibly cause this. Cheers, Edward 2009/9/3 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:Hi Ed, I've just tried what you proposed and observed something quite strange... Here are the results:./relax scripts/optimisation_testing.py> /dev/null(stats from my laptop, different trials, see below) iter 161 147 151 f_count 765 620 591 g_count 168 152 158./relax -s(stats from my laptop, different trials, see below) iter 146 159 160 159 f_count 708 721 649 673 g_count 152 166 167 166 Problem 1: The results should be the same in both situations, right ? Problem 2: The results should not vary when the test is done multiple times, right ? I have tested different things to find out why the tests give rise to different results as a function of time..../relax scripts/optimisation_testing.py> /dev/nullIf you modify the file "test_suite/system_tests/__init__.py", then the result will be different. By modifying, I mean just comment a few lines in the run() function. (I usually do that when I want to speed up the process of testing a specific issue.) Maybe this behavior is related to random seed based on the code files..../relax -sThis one varies as a function of time without any change. Just doing the test several times in a row will have it varying... Maybe this behavior is related to random seed based on the date and time... Any idea ? If you want, Ed, I could create you an account on one of these strange-behaving computers... Regards, Séb Edward d'Auvergne wrote:Hi, I've now written a script so that you can fix this. Try running: ./relax scripts/optimisation_testing.py> /dev/null This will give you all the info you need, formatted ready for copying and pasting into the correct file. This is currently only 'test_suite/system_tests/model_free.py'. Just paste the pre-formatted python comment into the correct test, and add the different values to the list of values checked. Cheers, Edward 2009/9/3 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:Hi Ed, I just checked my original mail (https://mail.gna.org/public/relax-devel/2009-05/msg00003.html). For the failure "FAIL: Constrained BFGS opt, backtracking line search {S2=0.970, te=2048, Rex=0.149}", the counts were initially as follows: f_count 386 g_count 386 and are now: f_count 743 694 761 g_count 168 172 164 For the failure "FAIL: Constrained BFGS opt, More and Thuente line search {S2=0.970, te=2048, Rex=0.149}", the counts were initially as follows: f_count 722 g_count 164 and are now: f_count 375 322 385 g_count 375 322 385 The different values given for the "just-measured" parameters account for the 3 different computers I have access to that give rise to these two annoying failures... I wounder if the names of the tests in the original mail were not mixed, as numbers just measured in the second test seem closer to those originally posted in the first test, and vice versa... Anyway, the problem is that there are variations between the different machines. Variations are also present for the other parameters (s2, te, rex, chi2, iter). Regards, Séb :) Edward d'Auvergne wrote:Hi, Could you check and see if the numbers are exactly the same as in your original email (https://mail.gna.org/public/relax-devel/2009-05/msg00003.html)? Specifically look at f_count and g_count. Cheers, Edward 2009/9/2 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:Hi Ed, I updated my svn copies to r9432 and checked if the problem was still present. Unfortunately, it is still present... Regards, Séb Edward d'Auvergne wrote:Hi, Ah, yes, there is a reason. I went through and fixed a series of these optimisation difference issues - in my local svn copy. I collected these all together and committed them as one after I had shut the bugs. This was a few minutes ago at r9426. If you update and test now, it should work. Cheers, Edward 2009/9/2 Sébastien Morin<sebastien.morin.1@xxxxxxxxx>:Hi Ed, I just tested the for the presence of this bug (1.3 repository, r9425) and it seems it is still there... Is there a reason why it was closed ?From the data I have, I guess this bug report should be re-opened.Maybe I could try to give more details to help debugging... Séb :) Edward d Auvergne wrote:Update of bug #14182 (project relax): Status: Confirmed => Fixed Assigned to: None => bugman Open/Closed: Open => Closed _______________________________________________________ Reply to this item at: <http://gna.org/bugs/?14182> _______________________________________________ Message sent via/by Gna! http://gna.org/-- Sébastien Morin PhD Student S. Gagné NMR Laboratory Université Laval& PROTEO Québec, Canada-- Sébastien Morin PhD Student S. Gagné NMR Laboratory Université Laval& PROTEO Québec, Canada-- Sébastien Morin PhD Student S. Gagné NMR Laboratory Université Laval& PROTEO Québec, Canada-- Sébastien Morin PhD Student S. Gagné NMR Laboratory Université Laval& PROTEO Québec, Canada