2009/10/9 Sébastien Morin <sebastien.morin.1@xxxxxxxxx>:
Hi Ed, I think reporting the cause of the error (maybe stating 'n=...' and 'k=...' and telling the dataset is too small), in addition to mentioning the use of AIC as an alternative might give a better RelaxError message and help the user... Would something like what follows be adequate to replace line 76 of 'generic_fns/model_selection.py', in functionaicc() ? =========== if n > (k+1): return chi2 + 2.0*k + 2.0*k*(k + 1.0) / (n - k - 1.0) elif n == (k+1): raise RelaxError("The dataset is too small for the model selected: n=" + n + " and k=" + k + ". This situation creates a fatal division by 0 since:\nAICc = chi2 + 2.0*k + 2.0*k*(k + 1.0) / (n - k - 1.0).\n\nPlease try AIC model selection instead.")
I really like this idea. I would slightly reword and reformat this though (eg 'model selected' is not technically correct), so maybe: raise RelaxError("The size of the dataset, n=%s, is too small for this model of size k=%s. This situation causes a fatal division by zero as:\nAICc = chi2 + 2k + 2k*(k + 1) / (n - k - 1).\n\nPlease use AIC model selection instead." % (n, k))
elif n < (k+1): raise RelaxError("The dataset is too small for the model selected: n=" + n + " and k=" + k + ". This situation gives a negative (nonsense) AICc score since:\nAICc = chi2 + 2.0*k + 2.0*k*(k + 1.0) / (n - k - 1.0).\n\nPlease try AIC model selection instead.")
Also for this error: raise RelaxError("The size of the dataset, n=%s, is too small for this model of size k=%s. This situation produces a negative, and hence nonsense, AICc score as:\nAICc = chi2 + 2k + 2k*(k + 1) / (n - k - 1).\n\nPlease use AIC model selection instead." % (n, k)) Just some small wording changes. But I like the idea of 2 separate errors to clearly explain what went wrong to the user! Would you like to make these changes to the code? Cheers, Edward
=========== What do you think ? Best regards, Séb Edward d'Auvergne wrote:Hi, The idea was that the use of single field strength data would cause a non-fatal RelaxWarning at the point of optimisation (the grid search, mimimisation, and maybe again during model selection). The fatal RelaxError would occur only if AICc is being used as, although this is designed for small data sets, such a ridiculously small data set was never invisaged. Even 2 field strength data is probably smaller than what the AICc authors were designing for. In this case, AIC model selection would be better and could be mentioned in the RelaxError message. Regards, Edward 2009/10/8 Sébastien Morin <sebastien.morin.1@xxxxxxxxx>:Hi Ed, I agree with you, but I am thinking about one potential problem with that... When a new user gets an error preventing him to use this new program, this user might feel discouraged and just give up trying to use this wonderful new program and, instead, stick to his old deprecated program with which he is used to work... What I mean is that the warnings should be there (absolutely !), but should not prevent the users to do an analysis... Is there a way we could include strong warnings (that may be writen in the results file, or be included in a second results file called something like 'Warnings_you_really_should_take_into_account'), without having relax crash ? Séb Edward d'Auvergne wrote:Hi, Firstly I would prefer to throw a relax warning along the lines of: RelaxWarning: Using single field strength data is bad - very, very BAD! Go read the literature to find out why, and don't even think about publishing the resultant nonsense! if someone does something like this :P Seriously though, I am considering something along those lines but maybe not so harsh. For those that continue to insist, I can catch these issues for people using AICc. I would recommend throwing a RelaxError in this case as it is 100% fatal for model selection. What do you think? Regards, Edward 2009/10/8 Sébastien Morin <sebastien.morin.1@xxxxxxxxx>:Hi, I recently used the script 'palmer.py' with a single magnetic field dataset (n=3) and tested AICc model selection (during stage 2). I faced a problem of division by zero for models with two parameters (such as models 'm2' and 'm3') since: AICc = chi2 + 2.0*k + 2.0*k*(k + 1.0) / (n - k - 1.0) Also, when models had 3 parameters, the division was by -1, which yielded negative AICc scores that relax ranked very well based on their very small number... The errors appeared as follows: ================================= Model-free model of spin ':28&:GLU'. Data pipe Num_params_(k) Num_data_sets_(n) Chi2 Criterion m5 3 3 2.16490 -15.83510 m4 3 3 2.27420 -15.72580 m1 1 3 2.27420 8.27420 Traceback (most recent call last): File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/relax", line 418, in <module> Relax() File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/relax", line 127, in __init__ self.interpreter.run(self.script_file) File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/prompt/interpreter.py", line 276, in run return run_script(intro=self.__intro_string, local=self.local, script_file=script_file, quit=self.__quit_flag, show_script=self.__show_script, raise_relax_error=self.__raise_relax_error) File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/prompt/interpreter.py", line 537, in run_script return console.interact(intro, local, script_file, quit, show_script=show_script, raise_relax_error=raise_relax_error) File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/prompt/interpreter.py", line 433, in interact_script execfile(script_file, local) File "./palmer.py", line 166, in <module> File "./palmer.py", line 118, in exec_stage_2 File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/prompt/model_selection.py", line 132, in model_selection model_selection.select(method=method, modsel_pipe=modsel_pipe, pipes=pipes) File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/generic_fns/model_selection.py", line 273, in select crit = formula(chi2, float(k), float(n)) File "/home/semor/pse-4/collaborations/relax/relax-1.3.4/generic_fns/model_selection.py", line 76, in aicc return chi2 + 2.0*k + 2.0*k*(k + 1.0) / (n - k - 1.0) ZeroDivisionError: float division ================================= I think it might be useful if there could be a warning message telling when overfitting happens (division by 0 or by a negative number). Also, if a division by zero occurs, the AICc score should be marked something as 'NA (0)'. Moreover, when the division is by a negative number, the AICc score should be marked something as 'NA (1)', with the number in parentheses indicating the actual overfitting fold... Of course, any 'NA' score should be prevented from serving as a model selector, i.e. no models should be selected using such a score... These improvements could be useful to people living on the edge of overfitting (single field data, for example), but could also serve when multiple field data was acquired bu a few residues have only data at one field (due to magnetic field dependent peak overlapping, for example)... What do you think ? Séb :) -- Sébastien Morin PhD Student S. Gagné NMR Laboratory Université Laval & PROTEO Québec, Canada _______________________________________________ relax (http://nmr-relax.com) This is the relax-devel mailing list relax-devel@xxxxxxx To unsubscribe from this list, get a password reminder, or change your subscription options, visit the list information page at https://mail.gna.org/listinfo/relax-devel-- Sébastien Morin PhD Student S. Gagné NMR Laboratory Université Laval & PROTEO Québec, Canada _______________________________________________ relax (http://nmr-relax.com) This is the relax-devel mailing list relax-devel@xxxxxxx To unsubscribe from this list, get a password reminder, or change your subscription options, visit the list information page at https://mail.gna.org/listinfo/relax-devel-- Sébastien Morin PhD Student S. Gagné NMR Laboratory Université Laval & PROTEO Québec, Canada