[Fwd: Re: [bug #6503] Uncaught nan in xh_vect] -- August 08, 2006


--
-------------------------------------------------------------------
Dr Gary Thompson
Astbury Centre for Structural Molecular Biology,
University of Leeds, Astbury Building,
Leeds, LS2 9JT, West-Yorkshire, UK             Tel. +44-113-3433024
email: garyt@xxxxxxxxxxxxxxx                   Fax  +44-113-2331407
-------------------------------------------------------------------

--- Begin Message ---

To: Edward d'Auvergne <edward.dauvergne@xxxxxxxxx>
Date: Tue, 08 Aug 2006 12:38:33 +0100
Message-id: <44D877B9.8040808@bmb.leeds.ac.uk>
References: <20060802-104941.sv4125.3051@gna.org> <7f080ed10608030650t7896164mbcfa499f1987270a@mail.gmail.com> <1154620634.28764.190.camel@fbsdpcu021> <7f080ed10608040708k4738dd5cjb41e9056acd88902@mail.gmail.com> <1154706350.28764.284.camel@fbsdpcu021> <44D37486.6060003@bmb.leeds.ac.uk> <7f080ed10608040954n4971257diabac2c1b51f17790@mail.gmail.com> <44D6F769.10204@bmb.leeds.ac.uk> <7f080ed10608070240j43e47b3cif1b60924755c4fa9@mail.gmail.com> <1155027991.28764.327.camel@fbsdpcu021> <7f080ed10608080336j6f601bdbk7c7a7658464cf3cd@mail.gmail.com>
User-agent: Mozilla Thunderbird 1.0 (X11/20041206)

Edward d'Auvergne wrote:

> Because what I wrote above, it would probably have to be listed as a
> dependency.  However I don't really like that idea as dependencies are
> a pain.  Do you have the link to it's homepage?  All the links I've
> tried that look like it would be its homepage (for example
> http://www.warnes.net/rwndown) are still down.  Having it in-built
> into relax would be better.

I agree that having an additional dependency is a bad idea. The options
then are: 1) do nothing, and accept Pythons default NaN handling.
Although I agree that NaNs are a rare occurence, I think this option is
just putting off the inevitable - as more users push relax into more
places, its inevitable that NaNs will appear again and give us more
headaches, probably in ways we cant anticipate. 2) Code our own version
of fpconst for relax. Then we could catch NaNs as and when we need to.
3) Make the move from Numeric to Numpy. As I understand it, this should
give us both the ability to catch NaNs, but also give us reliable NaN
comparisons (ie. NaN will compare False with everything). I've done some
very preliminary testing towards this, and at first glance it seems
easier than I expected. I have run the convertcode script that is
distributed with Numpy on all modules in the maths directory of the
relax 1.3 branch. All tests pass except the openDx tests, which also
failed in 1.3 before the conversion. Obviously this will require a lot
of care and more thorough testing than the current test-suite allows,
but it atleast seems feasible. As we have discussed before, that shift
to Numpy will be necessary soon enough anyway, so perhaps now is the
time to give it a shot?

For 1) I would prefer the NaN catching to be outside of the
'minimise/' directory.  It should be safe to assume that that code
will soon not be part of relax.  As for handling NaNs within the
minimisation code I know of no other minimisation package that does
this - if the user sends garbage to it then returning garbage should
be expected.  The sender and receiver code should do the cleanup.  I
do however think that testing for NaN during optimisation (in the
'maths_fns' code) is too computationally expensive.  If optimisation
terminates in a reasonable time then I don't think we should test for
NaNs during the number crunching phase.

We should check what the overhead is before we say too expensive

For 2) and 3) the NaN value comes from the chi2 variable which is just
a standard Python floating point number rather than a Numeric
construct.  Will the shift to Numpy actually change the behaviour of
the default Python floats?  Or will it just change the behaviour of
vectors, matrices, and other linear algebra?  Or is there a function
similar to the fpconst.py function isNaN() which can be used to catch
it?  Anyway, the 1.3 line is probably the best place to test the shift
from Numeric to Numpy - although in a private branch first.

my undertanding is that in general numpy propogates nans generally and that pure fp maths also propgtates nans. The only place there used to be problems is in ufuncs which used not to propogate nans but raise exceptions in numeric. There is a function similar to isNaN called originally isnan (and isinf) in scipy...Ingenerale we could have a grep for the use of isnan and isinf in the numpy/scipy codebase to see if they are caught much or just propgated. A quick look in scipy/numpy shows only a very few uses of isnan in numpy or scipy

As for the test suite, the optimisation code is completely untested.
It's where the major breakages occur, although the code in
'maths_fns/' is problematic as well.  A shift to Numpy will require
changes to both 'maths_fns/' and 'minimise/'.  To catch problems the
four optimisation classes will need to be tested - standard single
residue, diffusion tensor, all parameters (model-free + diffusion
params), and the residue specific local tm models.  It shouldn't be
too hard to code a number of tests for this as they can all use the
same data.  Then all the optimisation algorithms in ALL combinations
need to be tested - that is quite a few.  However as these minimisers
will be separated from relax, this won't be so easy.

I don't quite follow why this won't be easy. The combinatorial feature is of course a problem but I guess the likley combinations are the first target.

> However I still question whether the floating point value of NaN
> actually needs to be caught at all - it would be useful, but is
> unnecessary.  It's a rare occurrence to begin with and I can easily
> modify the optimisation component of relax to break the infinite
> loops.  Therefore does anything else need to be done?  It will be
> glaringly obvious in the final results file if NaNs get through the
> analysis.  And if there is an issue in relax's handling of NaN then
> that should be classified as a relax bug.
>
>
[snip]

> NaN should propagate - however I do see one point at which they will
> disappear and that is during AIC model selection.  The AIC value will
> be NaN (NaN + 2k = NaN) and any model with a non-NaN AIC value will be
> selected.  Unless of course all model-free models for that residue are
> affected.  Yet if they do propagate the results for that residue will
> make no sense.

Agreed, and that is one reason why we need to catch NaNs (or have
properly defined comparisons on NaNs). In the current situation, if we
do model selection on both NaN and finite values, the behaviour will be
undefined. The options are to use a system that has defined NaN
comparisons (Numpy seems to fit the bill) or to remove NaN values before
model selection (another test in the model elimiation, perhaps?).
Obviously this last option requires that we are able to catch NaNs in a
defined way.

I just tested it and in Python 2.1 NaN is apparently less than all
other numbers and is hence selected.  In 2.4 it's greater than all
other numbers and hence is never selected.  Therefore the model
selection code should try to catch the NaN.  But then what should we
do?  Throw a RelaxError?  Or skip these models, which brings the
notion of 'no selected model' into play and hence will require a large
rework of the code base to handle missing models?  Whether the test is
in the model elimination or model selection shouldn't matter, although
the actual model selection methodology used may dictate what should
happen with a NaN chi-squared value.  And there may be analysis
techniques added in the future which don't use model elimination (and
end users may not have even heard about model elimination).

I believe though that throwing a RelaxError when NaNs occur is the best option. That is because NaN should NEVER occur. Even though it may cause a week long calculation to die at the very end, hence the optimisation was for nothing, an error should still be thrown (it's much more likely that a week long optimisation will die at the very start). The reason for throwing a RelaxError and killing the calculation is simple. Despite the calculation running until the end - it will need to be rerun. If the NaN only occurs for a single residue the entire protein (the diffusion tensor) is nevertheless affected.

surely not if you skip data with nan values?

This is because of the strong link between the diffusion
tensor parameters and the model-free parameters.  The values of one
influences the optimisation of the other and vice versa.  Therefore
the continuation of the calculation will, without doubt, produce
incorrect model-free results.

Hence throwing the RelaxError will help the user to fix the issue.
The user can either trace the error back themselves, email the
problem, or report a bug.  If an email or bug report is sent then we
can modify relax to prevent the user error in the first place.  If,
although I don't know how, the NaN is generated by relax itself rather
than user error then the user can send a bug report and we can fix the
issue.  In this situation I would insist that the bug is fixed before
the user generates any results as the results could be complete
garbage - there's just no way in knowing until we know how the NaN
arose in the first place.

To summarise my opinions:

To catch the NaN:  I think this is useful, though not necessary.
Avoiding fpconst.py as a dependency would be best.  If Numpy has a
function to catch Python native floating point values of NaN - then
migrating to Numpy is worth a go.  Otherwise migrating to Numpy isn't
an issue for this problem.

What to do when NaNs occur:  RelaxError!

Prevention vs. cure:  Well a mix.  Catch the NaN as a cure, then throw
a RelaxError.  The output can then be used to create prevention
measures.

Edward

I have to say at the end of the week long calculation I would like to see the result. Thus for example (this came from chris) if a grid search failed I would personally like the residue to be deselected ( and maybe a warning generated on the console) and then the calculation should go on.. In general i feel exceptions are a blunt tool for these sort of problems as you lose the program state and don't get results on where the calculation was going to for everything other that the faulty data.

Also for example if one MC caculation produced a nan and killed it all that would be annoying in the extreme

I can think of counter examples so for example an nan while calculating a tensor should bring things to a close, but it would be nice to have a default error handler that saved state to some meaningful place.


regards
gary

--
-------------------------------------------------------------------
Dr Gary Thompson
Astbury Centre for Structural Molecular Biology,
University of Leeds, Astbury Building,
Leeds, LS2 9JT, West-Yorkshire, UK             Tel. +44-113-3433024
email: garyt@xxxxxxxxxxxxxxx                   Fax  +44-113-2331407
-------------------------------------------------------------------

--- End Message ---

[Fwd: Re: [bug #6503] Uncaught nan in xh_vect]

Header

Content

Related Messages