mailRe: bug? previous vs. current model test in full_analysis


Others Months | Index by Date | Thread Index
>>   [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Header


Content

Posted by Edward d'Auvergne on September 17, 2007 - 22:12:
Hi,

In your previous post
(https://mail.gna.org/public/relax-users/2007-09/msg00011.html,
Message-id: <59CF5ABD-2724-4547-B9E7-0F66E545F368@xxxxxxxxx>) I think
you were spot on with the diagnosis.  The reading of the results files
with None in all model positions will create variables called 'model'
with the value set to None.  Then the string comparison will fail
unless these are skipped as well.  Well done on picking up the exact
cause of the failure.  This important change will need to go into
relax before a new release with the more advanced 'full_analysis.py'
script.

If you would like to post the changes, I can add these to the
repository.  I'll need to check the change carefully first though, and
it may only require two lines in the script to be changed.  The best
way would be to attach a patch to a bug report.  If you could create a
bug report with a simple summary, that would be appreciated.  Then how
you report the changes is up to you.  If you change a checked out copy
of the SVN repository and type 'svn diff > patch', you'll get the
changes in a patch file which I can then check and commit to the
repository with your name attached.

Thanks,

Edward


On 9/17/07, Douglas Kojetin <douglas.kojetin@xxxxxxxxx> wrote:
As a followup, my changes to full_analysis.py solved my problem.  I
will clean up my code and post it within the next day or so.  Would
you prefer that I attach the script as an attachment, or inline in an
email, or provide a patch, or change the CVS code myself?

Doug


On Sep 17, 2007, at 11:48 AM, Edward d'Auvergne wrote:

Hi,

The problem is likely to be due to a circular looping around very
similar solutions close to the universal solution, as I have defined
in:

d'Auvergne EJ, Gooley PR. Set theory formulation of the model-free
problem and the diffusion seeded model-free paradigm. Mol Biosyst.
2007 Jul;3(7):483-94.

If you can't get the paper, have a look at Chapter 5 of my PhD
thesis at
http://eprints.infodiv.unimelb.edu.au/archive/00002799/.  The problem
is the interlinked mathematical optimisation and statistical model
selection where we are trying to minimise different quantities.  For
mathematical optimisation this is the chi-squared value.  For model
selection this is the quantity known as the descrepancy.  These can
not be optimised together as mathematical optimisation works in a
single fixed-dimension space or universe whereas model selection
operates across multiple spaces with different dimensions.  See the
paper for a more comprehensive description of the issue.

You should be able to see this if you look at the end of iterations.
If you have 160 iterations, look after iteration 20 (or maybe even 30
or further).  Until then, you will not have reached the circular loop.
 After that point you will be able to exactly quatify this circular
loop.  You'll be able to determine its periodicity, which residues are
involved (probably only 1), and whether the diffusion tensor changes
as model selection changes.

I mentioned all of this already in my post at
https://mail.gna.org/public/relax-users/2007-07/msg00001.html
(Message-id:
<7f080ed10707090104r60453839of0b9df58022501d2@xxxxxxxxxxxxxx>)
in response to your original post
(https://mail.gna.org/public/relax-users/2007-06/msg00004.html,
Message-id: <2ED509BE-FA07-4ED0-96F0-E4E612405DC7@xxxxxxxxx>).

I have a few more points about the tests you have done but to work out
what is happening with the printouts, it would be very useful to have
your modified 'full_analysis.py' script attached.



On 9/17/07, Douglas Kojetin <douglas.kojetin@xxxxxxxxx> wrote:
Hi,

I'm unsure if this is a bug in full_analysis.py, in the internal
relax code, or user error.  The optimization of the 'sphere' model
will not converge, now after 160+ rounds.   The chi-squared test has
converged (long, long ago):

See above.


"" from output
         Chi-squared test:
             chi2 (k-1): 100.77647517006251
             chi2 (k):   100.77647517006251
             The chi-squared value has converged.
""

However, the identical model-free models test does has not converged:

"" from output
         Identical model-free models test:
             The model-free models have not converged.

         Identical parameter test:
             The model-free models haven't converged hence the
parameters haven't converged.
""

Something that confuses me is that the output files in the round_??/
aic directory suggest that, for example, the round_160 and round_161
AIC model selections are equivalent.  Here are the models for the
first few residues:

Between these 2 rounds, are you sure that all models for all residues
are identical?  From your data that you posted at
https://mail.gna.org/public/relax-users/2007-06/msg00017.html
(Message-id: <B52B2F07-8B2C-4FAE-B270-902C2A1B3619@xxxxxxxxx>), I
would guess that this is not the case and one or two residues actually
do change in their model selections.


""
         1 None None
         2 None None
         3 None None
         4 m2 m2
         5 m2 m2
         6 m2 m2
         7 m2 m2
         8 m2 m2
         9 m4 m4
         10 m1 m1
         11 None None
         12 m2 m2
         13 m2 m2
         14 m1 m1
         15 m2 m2
         16 m3 m3
         17 m3 m3
         18 None None
""

However, I modified the full_analysis.py protocol to print the
differences in the model selection, within the 'Identical model-free
model test' section of the 'convergence' definition. Here is the
beginning of the output (which only contains differences between the
previous and current rounds):

""
         residue 1: prev=None curr=m2
         residue 2: prev=None curr=m2
         residue 3: prev=None curr=m2
         residue 6: prev=m2 curr=m4
         residue 7: prev=m2 curr=m1
         residue 9: prev=m4 curr=m2
         residue 11: prev=None curr=m2
         residue 12: prev=m2 curr=m3
         residue 13: prev=m2 curr=m3
         residue 15: prev=m2 curr=m1
         residue 16: prev=m3 curr=m2
         residue 17: prev=m3 curr=m1
         residue 18: prev=None curr=m3
""

This output is quite strange.  I would need to see the
full_analysis.py script to do more with this.


There should be no data for residues 1-3, 11 and 18 (None), however
the 'Identical model-free model test' seems as if it ignores residues
for which 'None' was selected in the curr_model call in the following
code:

""
         # Create a string representation of the model-free models of
the previous run.
         prev_models = ''
         for i in xrange(len(self.relax.data.res['previous'])):
             if hasattr(self.relax.data.res['previous'][i], 'model'):
                 #prev_models = prev_models + self.relax.data.res
['previous'][i].model
                 prev_models = prev_models + ' ' +
self.relax.data.res
['previous'][i].model

         # Create a string representation of the model-free models of
the current run.
         curr_models = ''
         for i in xrange(len(self.relax.data.res[run])):
             if hasattr(self.relax.data.res[run][i], 'model'):
                 #curr_models = curr_models + self.relax.data.res
[run]
[i].model
                 curr_models = curr_models + ' ' +
self.relax.data.res
[run][i].model
""

As residues 1-3, 11 and 18 are deselected, then they will not have the
attribute 'model' and hence will not be placed in the prev_models or
curr_models string (which are then compared).


For what it's worth, I have residues 1,2,3,11 and 18 in the file
'unresolved' which is read by the full_analysis.py protocol.  I
created a separate sequence file (variable = SEQUENCE) that contains
all residues (those with data and those without), instead of using a
data file (noe data, in the default full_analysis.py file).  However,
these residues are not specified in the data (r1, r2 and noe) files,
as I did not have data for them.  Should I add them but place 'None'
in the data and error columns?  Could that be causing the problems?
Or should I create a bug report for this?

I'm not sure what you mean by the file '(variable = SEQUENCE)'
statement.  I would need to see the full_analysis.py script to
understand this.  I would assume a file called the value in the
variable 'SEQUENCE'.  In which case, this should not be a problem.  As
the data is missing from the files containing the relaxation data,
these residues will not be used in model-free analysis.  They will be
automatically deselected.  There is not need to add empty data for
these spin systems into the relaxation data files.  As I said before
at the top of this message and at
https://mail.gna.org/public/relax-users/2007-07/msg00001.html
(Message-id:
<7f080ed10707090104r60453839of0b9df58022501d2@xxxxxxxxxxxxxx>),
the problem is almost guaranteed to be a circular loop of equivalent
solutions circling around the universal solution - the solution
defined within the universal set (the union of all global models
(diffusion tensor + all model-free models of all residues)).  If you
have hit this circular problem, then I suggest you stop running relax.
 What I would then do is identify the spin system (or systems) causing
the loop, and what the differences are between all members of the
loop.  This you can do by plotting the data and maybe using the
program diff on uncompressed versions of the results files.  It's
likely that the differences are small and inconsequential.  I hope
this helps.

Regards,

Edward





Related Messages


Powered by MHonArc, Updated Tue Sep 18 01:21:15 2007