Re: [bug #6503] Uncaught nan in xh_vect -- August 09, 2006

On Wed, 2006-08-09 at 13:41 +1000, Edward d'Auvergne wrote:

Although I'd love to be able to agree with you that NaN should never
occur, floating point maths just isn't that cooperative. Even for
'correct' inputs, it is quite possible for a minimisation to drive a
value so small that a fp underflow occurs, then division or log of that
value will result in NaN (or INF - the two are equivalent for the
purposes of this discussion). I've never seen this with your
minimisation code, but I've certainly seen it in others (probably a
tribute to the robustness of your algorithms, but not grounds for too
much complacency)


It's not the robustness of relax's algorithms but the construction of
the model-free problem itself.   There is a numerical stabilisation of
the model-free equations which avoids NaNs and Infs that I'll present
in my next, as of yet unpublished, paper - a preview of it is in the
relax manual.  I'll go through the model-free equations, relaxation
equations, chi-squared equation, diffusion tensor equations, etc one
by one to show exactly what can cause NaN.  The only way to generate a
NaN in these equations is:
    nan = inf - inf
    nan = inf / inf
    nan = inf * 0
Fortunately logs don't exist in these equations.

First the model-free equations.  We essentially have 2 local
parameters in these equations (S2 and te) and a couple of global
correlation times (tm and equivalent in the spheroids and ellipsoids).
 The extension by Clore et al., (1990) from the point of view of
generating a NaN is inconsequential - the form of the equation is the
same.  To generate a NaN the following combinations will do it:  {S2 =
0, tm = inf}, {S2 = 1, te = inf}, and {te = inf, tm = inf}.  In the
case of the spheroid and ellipsoid, just replace the tm value with the
appropriate correlation time tk where k is {-1, 0, 1} or {-2, -1, 0,
1, 2} respectively.  However the grid search will never search at inf
and using the default parameter limits the values of inf will never be
reached.  Even without the limits, it is impossible to use an
optimisation algorithm to get the correlation times to inf.  In the
model-free model elimination paper the way I forced the te and tm
values to go to inf was to manually tweak the parameters to get there
- the minimisers all terminated well before reaching that value due to
the curvature of the space and the iteration limits.  Therefore NaNs
will never come from the model-free equations (unless you manually set
a correlation time to inf using value.set()).

In the relaxation equations we have three places which can cause a
problem: the bond length, the CSA value, and the Rex value.  For the
CSA and Rex values to be problematic they have to be set to inf.  For
the CSA, that only occurs if the user types:
    value.set(run, float('inf'), 'csa')
Now that deserves a RelaxError!  The Rex value is limited by the R2,
specifically Rex < R2.  Hence an Rex value of inf is only possible if
the measured R2 is inf.  Again this deserves a RelaxError as that is
impossible.  The bond length value on the other hand can generate an
inf (and then maybe later NaN) by the user typing:
    value.set(run, 0, 'bond length')
This is unquestionably a RelaxError.

In the chi-squared equation, NaN can be generated if the measure
relaxation value is inf and the back calculated value is also inf.  An
error of zero will generate a value of inf as well.  These are all
FATAL errors - if the user sends garbage in, then a RelaxError prior
to any calculations is justified.  The user must fix their input data
before even starting the model-free analysis.

The biggest problem is the unit vector parallel to the XH bond vector.
 This is used in the chi-squared calculation for the prolate spheroid,
oblate spheroid, and ellipsoid.  If the length of this vector is zero
- that will generate a NaN.  Only two scenarios can trigger this.  The
first is what occured in bug #6503 and that is where the X and H atoms
are set to the same PDB atom name.  This should be caught and a
RelaxError generated when this occurs.  The second is if the PDB file
is broken and the H and X atoms lie ontop of eachother.  This again
should be caught and a RelaxError thrown.  The position where to catch
this is when calculating the unit vector - just measure the length of
the XH bond vector and if zero, RelaxError:  The PDB file is corrupt.
Missing protons in an X-ray structure isn't a problem, relax already
catches this one.

These are the only points of failure within the model-free problem
whereby a NaN or Inf value can be generated.  This covers the
parameters and values of: S2, te, S2f, tf, S2s, ts, tm, Rex, CSA, and
the bond length.  It also covers the input relaxation data and the
input PDB file and corresponding X and H atoms.  Because of this, if
we catch the NaN and Inf after, as you said, the grid search and also
after minimisation, then throwing an error will stop the calculation
at the very start (not the fifth day of a week long calculation ;).


I'm resonably convinced by this as far as it goes - NaN in Chi2 almost
certainly reflects a serious error in the input, and will likely be
evident from the very begining of the calculation. As such I no longer
have concerns about Chi2 == NaN raising RelaxError. The only exception
would be the case of a corrupt PDB file, where not all residues are
affected. The current behaviour for missing protons is to print a
warning, deselect the affected residue, and continue. I believe a
similar approach should also apply in the case of bad proton positions.

I'm still not sure that other variables won't become NaN at some point.
One example might be internal variables in the optimisation routines. I
can't give specific examples, but concievably might become NaN or INF as
a result of under- or overflow. Clearly this means that this specific
optimisation will fail, but its not clear to me that it is necessarily a
globally fatal error. 


Chris

To summarise my opinions:

To catch the NaN:  I think this is useful, though not necessary.
Avoiding fpconst.py as a dependency would be best.  If Numpy has a
function to catch Python native floating point values of NaN - then
migrating to Numpy is worth a go.  Otherwise migrating to Numpy isn't
an issue for this problem.


I believe catching NaN is necessary for defined performance of model
selection, and useful to avoid wasting an awful lot of time minimising
NaN. I think Numpy will be useful here.


I agree.  But because the change to Numpy is disruptive it should
occur in the 1.3 line rather than the stablised 1.2 line.

What to do when NaNs occur:  RelaxError!


RelaxError is appropriate when the NaN signals an unrecoverable state,
eg. if the diffusion tensor contains NaN. On the other hand an isolated
NaN should result in the relevant model/residue being deselected and a
warning to highlight the fact. Obviously this more context dependent
response involves more work, but I don't think it needs to be fully
implimented all at once - as you rightly point out this is a rare
occurence.


An isolated NaN will still stop the calculation at the very start, if
an error is thrown.  Then the source of the problem - garbage input -
can be fixed.  There are not many places where this can occur and we
should be able to catch and prevent them all of them.  However
catching the NaN or Inf just in case can be done after the grid search
and after minimisation.  This is relevant if someone wants to
something bizarre with the program.

Of course relax is yours, and I'm happy to recognise your benevolent
dictatorship


Everyone's input is appreciated - it makes the program better - and
this thread has been perfect for sorting out the numerous issues.  The
only sticking point I can still see with this discussion is whether to
throw an error or continue with the calculation.  As I outlined above,
the only way I can see this happening is if the input data is garbage.
 If there is another way that NaN can be produced and not encountered
at the start, I'm willing to change my opinion.

Although I own the copyright on may of relax's files and can have the
final say on what goes into the repository, the program belongs to
everyone and I will accept most code (unless it is problematic or
violates someone else's copyright).  Because of the GPL all developers
and all users are the real owners of the program.  It is even possible
to take the file http://svn.gna.org/daily/relax.dump.gz and start a
new and independent fork of relax with all the revision history
preserved (although that would be very counter-productive for end
users).  Anyway, one day I may get tired and stop developing relax.
In that situation either I could appoint someone to take over or we
could write a constitution and set up a voting system to determine who
will be the leader of the relax project.

Edward

Re: [bug #6503] Uncaught nan in xh_vect

Header

Content

Related Messages