Re: [bug #6503] Uncaught nan in xh_vect -- August 09, 2006

> For 1) I would prefer the NaN catching to be outside of the
> 'minimise/' directory.  It should be safe to assume that that code
> will soon not be part of relax.  As for handling NaNs within the
> minimisation code I know of no other minimisation package that does
> this - if the user sends garbage to it then returning garbage should
> be expected.  The sender and receiver code should do the cleanup.  I
> do however think that testing for NaN during optimisation (in the
> 'maths_fns' code) is too computationally expensive.  If optimisation
> terminates in a reasonable time then I don't think we should test for
> NaNs during the number crunching phase.
>

I agree with all of this. NaN handling is the job of relax proper - not
the optimisation code. The only nuance I would put on it is that if a
grid search returns a NaN, we should catch it then and take the
appropriate action, rather than proceed to the next step of the
minimisation which will necessarily entail a lot of iterations waiting
for the impossible.


Exactly, catching the output of the grid search is a brilliant idea.
I would go one further though and not only catch NaN but also catch
Inf values as well.  The question is, what do we do then?  I'll soon
send a post where my idea of a solution is presented in actual Python
code.  Feel free to post a different solution.

> I just tested it and in Python 2.1 NaN is apparently less than all
> other numbers and is hence selected.  In 2.4 it's greater than all
> other numbers and hence is never selected.  Therefore the model
> selection code should try to catch the NaN.  But then what should we
> do?  Throw a RelaxError?  Or skip these models, which brings the
> notion of 'no selected model' into play and hence will require a large
> rework of the code base to handle missing models?
>

Presumably no selected model is a possible outcome of model elimination?
Is it really not handled now? It seems to me that the way to handle it
is to deselect the effected residues and continue. I'm not quite sure
why it entails such a big change.


Actually, that's not the case.  Model elimination as implemented using
the tests presented in d'Auvergne and Gooley, (2006) JBNMR, 35(2),
117-135 will only eliminate those in which there is a correlation
time.  There will always be a model present as m1 and m3 will never be
eliminated.

relax used to handle not having a model as that was quite a common
result of using the model selection as presented in Mandle, Akke,
Palmer, (1995).  However I haven't used that technique for well over
three years in relax and I'm pretty sure that the techniques to handle
'no model' have atrophied.  It's very fixable though, especially if
someone wants to fix the Mandle model selection code.  The residue
shouldn't be deselected though as it is important to know that there
is 'no model' selected.  I think I used to say that the 'no model' was
model 'm0', yet the last couple of years I have redefined that as a
residue experiencing no internal motions.

> I believe though that throwing a RelaxError when NaNs occur is the
> best option.  That is because NaN should NEVER occur.  Even though it
> may cause a week long calculation to die at the very end, hence the
> optimisation was for nothing, an error should still be thrown (it's
> much more likely that a week long optimisation will die at the very
> start).  The reason for throwing a RelaxError and killing the
> calculation is simple.  Despite the calculation running until the end
> - it will need to be rerun.  If the NaN only occurs for a single
> residue the entire protein (the diffusion tensor) is nevertheless
> affected.  This is because of the strong link between the diffusion
> tensor parameters and the model-free parameters.  The values of one
> influences the optimisation of the other and vice versa.  Therefore
> the continuation of the calculation will, without doubt, produce
> incorrect model-free results.

I dissagree here. There are many examples I can think of where the NaN
shouldn't mix with the diffusion tensor calculation. Just one example -
if only one MF model returns NaN, then it should not be selected and
will not influence the diffusion tensor. The other point is that the
propagation behaviour of NaNs is such that if a NaN were to influence
the diffusion tensor in any way, then the effected diffusion tensor
values will be NaN (clearly this is unrecoverable, and is an appropriate
place to throw an exception).


I don't believe that solely one model-free model will have a
chi-squared value of NaN.  The sole reason for NaN is that some input
data is fatally flawed, see below.

Although I'd love to be able to agree with you that NaN should never
occur, floating point maths just isn't that cooperative. Even for
'correct' inputs, it is quite possible for a minimisation to drive a
value so small that a fp underflow occurs, then division or log of that
value will result in NaN (or INF - the two are equivalent for the
purposes of this discussion). I've never seen this with your
minimisation code, but I've certainly seen it in others (probably a
tribute to the robustness of your algorithms, but not grounds for too
much complacency)


It's not the robustness of relax's algorithms but the construction of
the model-free problem itself.   There is a numerical stabilisation of
the model-free equations which avoids NaNs and Infs that I'll present
in my next, as of yet unpublished, paper - a preview of it is in the
relax manual.  I'll go through the model-free equations, relaxation
equations, chi-squared equation, diffusion tensor equations, etc one
by one to show exactly what can cause NaN.  The only way to generate a
NaN in these equations is:
   nan = inf - inf
   nan = inf / inf
   nan = inf * 0
Fortunately logs don't exist in these equations.

First the model-free equations.  We essentially have 2 local
parameters in these equations (S2 and te) and a couple of global
correlation times (tm and equivalent in the spheroids and ellipsoids).
The extension by Clore et al., (1990) from the point of view of
generating a NaN is inconsequential - the form of the equation is the
same.  To generate a NaN the following combinations will do it:  {S2 =
0, tm = inf}, {S2 = 1, te = inf}, and {te = inf, tm = inf}.  In the
case of the spheroid and ellipsoid, just replace the tm value with the
appropriate correlation time tk where k is {-1, 0, 1} or {-2, -1, 0,
1, 2} respectively.  However the grid search will never search at inf
and using the default parameter limits the values of inf will never be
reached.  Even without the limits, it is impossible to use an
optimisation algorithm to get the correlation times to inf.  In the
model-free model elimination paper the way I forced the te and tm
values to go to inf was to manually tweak the parameters to get there
- the minimisers all terminated well before reaching that value due to
the curvature of the space and the iteration limits.  Therefore NaNs
will never come from the model-free equations (unless you manually set
a correlation time to inf using value.set()).

In the relaxation equations we have three places which can cause a
problem: the bond length, the CSA value, and the Rex value.  For the
CSA and Rex values to be problematic they have to be set to inf.  For
the CSA, that only occurs if the user types:
   value.set(run, float('inf'), 'csa')
Now that deserves a RelaxError!  The Rex value is limited by the R2,
specifically Rex < R2.  Hence an Rex value of inf is only possible if
the measured R2 is inf.  Again this deserves a RelaxError as that is
impossible.  The bond length value on the other hand can generate an
inf (and then maybe later NaN) by the user typing:
   value.set(run, 0, 'bond length')
This is unquestionably a RelaxError.

In the chi-squared equation, NaN can be generated if the measure
relaxation value is inf and the back calculated value is also inf.  An
error of zero will generate a value of inf as well.  These are all
FATAL errors - if the user sends garbage in, then a RelaxError prior
to any calculations is justified.  The user must fix their input data
before even starting the model-free analysis.

The biggest problem is the unit vector parallel to the XH bond vector.
This is used in the chi-squared calculation for the prolate spheroid,
oblate spheroid, and ellipsoid.  If the length of this vector is zero
- that will generate a NaN.  Only two scenarios can trigger this.  The
first is what occured in bug #6503 and that is where the X and H atoms
are set to the same PDB atom name.  This should be caught and a
RelaxError generated when this occurs.  The second is if the PDB file
is broken and the H and X atoms lie ontop of eachother.  This again
should be caught and a RelaxError thrown.  The position where to catch
this is when calculating the unit vector - just measure the length of
the XH bond vector and if zero, RelaxError:  The PDB file is corrupt.
Missing protons in an X-ray structure isn't a problem, relax already
catches this one.

These are the only points of failure within the model-free problem
whereby a NaN or Inf value can be generated.  This covers the
parameters and values of: S2, te, S2f, tf, S2s, ts, tm, Rex, CSA, and
the bond length.  It also covers the input relaxation data and the
input PDB file and corresponding X and H atoms.  Because of this, if
we catch the NaN and Inf after, as you said, the grid search and also
after minimisation, then throwing an error will stop the calculation
at the very start (not the fifth day of a week long calculation ;).

> To summarise my opinions:
>
> To catch the NaN:  I think this is useful, though not necessary.
> Avoiding fpconst.py as a dependency would be best.  If Numpy has a
> function to catch Python native floating point values of NaN - then
> migrating to Numpy is worth a go.  Otherwise migrating to Numpy isn't
> an issue for this problem.

I believe catching NaN is necessary for defined performance of model
selection, and useful to avoid wasting an awful lot of time minimising
NaN. I think Numpy will be useful here.


I agree.  But because the change to Numpy is disruptive it should
occur in the 1.3 line rather than the stablised 1.2 line.

> What to do when NaNs occur:  RelaxError!

RelaxError is appropriate when the NaN signals an unrecoverable state,
eg. if the diffusion tensor contains NaN. On the other hand an isolated
NaN should result in the relevant model/residue being deselected and a
warning to highlight the fact. Obviously this more context dependent
response involves more work, but I don't think it needs to be fully
implimented all at once - as you rightly point out this is a rare
occurence.


An isolated NaN will still stop the calculation at the very start, if
an error is thrown.  Then the source of the problem - garbage input -
can be fixed.  There are not many places where this can occur and we
should be able to catch and prevent them all of them.  However
catching the NaN or Inf just in case can be done after the grid search
and after minimisation.  This is relevant if someone wants to
something bizarre with the program.

Of course relax is yours, and I'm happy to recognise your benevolent
dictatorship


Everyone's input is appreciated - it makes the program better - and
this thread has been perfect for sorting out the numerous issues.  The
only sticking point I can still see with this discussion is whether to
throw an error or continue with the calculation.  As I outlined above,
the only way I can see this happening is if the input data is garbage.
If there is another way that NaN can be produced and not encountered
at the start, I'm willing to change my opinion.

Although I own the copyright on may of relax's files and can have the
final say on what goes into the repository, the program belongs to
everyone and I will accept most code (unless it is problematic or
violates someone else's copyright).  Because of the GPL all developers
and all users are the real owners of the program.  It is even possible
to take the file http://svn.gna.org/daily/relax.dump.gz and start a
new and independent fork of relax with all the revision history
preserved (although that would be very counter-productive for end
users).  Anyway, one day I may get tired and stop developing relax.
In that situation either I could appoint someone to take over or we
could write a constitution and set up a voting system to determine who
will be the leader of the relax project.

Edward

Re: [bug #6503] Uncaught nan in xh_vect

Header

Content

Related Messages