Dear Greg,

> >
> > when training a solubility model (see
> > http://code.google.com/p/rdkit/wiki/TrainAThreeClassSolubilityModel
> >
> > I run into the problem that three different confusion matrices are
> > outputted.
> >
> > I wonder what is the origin of these confusion matrices. Even though x-
and
> > y-axis might be mixed up, the diagonal entries should always be the
same.
> > Thus, confusion matrices make me confused...
>
> nicely put. :-)
>
> Here's the story:
> The first confusion matrix, the one that results from calling:
> ScreenComposite.ShowVoteResults(range(len(pts)), pts, cmp, 3,
> 0,errorEstimate=True)
> contains the out-of-bag predictions of the composite model for your
points.
>
> The next one, which comes from this bit of code:
> t = BuildSigTree(pts,nPossibleRes=3,maxDepth=3)
>
> # simple results report:
> confusionMat=numpy.zeros((3,3),numpy.int)
> for pt in pts:
>     confusionMat[pt[-1]][t.ClassifyExample(pt)]+=1
> print confusionMat
>
> Is actually the confusion matrix for a single decision tree. It's has
> essentially no connection to the first one at all.

Thanks for your explanation (I have added those to the Wiki). Now I am a
little less confused, but some confusion remains..

I have added the code how I generate "my own confusion matrix" to the Wiki.
In my understanding, my function uses the predictions from the out-of-bag
prediction. But I guess that I have overlooked some nasty detail.

Cheers & Thanks,
Paul


P.S.: When comparing the results with a PipelinePilot-based Bayesian
catagorization model (ECFP_4 & standard settings), I'm surprised to see
that the PipelinePilot model is significantly better. I thought that the
MorganFingerprints are comparable to the ECFPs and would have assumed that
the model quality is in a similar range.


>
> The last one I can't help with because you don't show the code where
> you assign the "solu_class" property to the molecules that go in the
> SD file. If I had to guess, and it's just a guess, I would bet that
> you calculated it by having the composite generate a prediction for
> each point in your training set, but not using the out-of-bag error
> estimate. This generates a better confusion matrix since it's testing
> using the training set, but it's not a whole lot better since there's
> not a huge amount of overfitting going on.
>

This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://disclaimer.merck.de to access the German, French, Spanish and
Portuguese versions of this disclaimer.


------------------------------------------------------------------------------
Got Input?   Slashdot Needs You.
Take our quick survey online.  Come on, we don't ask for help often.
Plus, you'll get a chance to win $100 to spend on ThinkGeek.
http://p.sf.net/sfu/slashdot-survey
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to