Re: [Rdkit-discuss] how to come to a good model

Stiefl, Nikolaus Mon, 10 Oct 2011 01:49:06 -0700

Hi Paul,

I'd agree on Greg's comment - if this is for a hErg model then you will not 
have a lot of luck to make a reasonable model purely with physchem properties. 
I guess having information on ionizability could be of help.

Another one to test - in case you need to make a 3 class model - would be to 
modify your class borders so you will only predict really likely and really 
unlikely molecules and increase the size of your "I have no idea" class.

Hope this helps
Nik

-----Original Message-----
From: Greg Landrum [mailto:[email protected]] 
Sent: Saturday, October 08, 2011 4:21 PM
To: [email protected]
Cc: [email protected]
Subject: Re: [Rdkit-discuss] how to come to a good model

On Fri, Oct 7, 2011 at 12:31 PM,  <[email protected]> wrote:
>
> Dear RDKitters,
>
> I'm in the process of training a 3-class decision tree model. I have roughly
> about 1500 compounds with an almost equal distribution of the 3 classes.

<snip>

>
> In all cases, the statistics is really bad: about 50 percent are
> misclassified, e.g.:
> "
>         *** Vote Results ***
> misclassified: 580/1180 (%49.15)        580/1180 (%49.15)
>
> average correct confidence:    0.7837
> average incorrect confidence:  0.7528
> "
>
> Interestingly, there is a really small difference between the average
> confidence level for the correct as well as the incorrect classifications.
> As far as I got it this tells me that the model is really bad - an
> information I already got by the vote results themselves.
>
>
> Which parameters are worthhile to test?

We talked about this at the Knime OSD meeting already, but I think
it's worth repeating for the community: I believe that prediction of
hERG binding is too challenging for simple descriptors like the
physicochemical descriptors the RDKit provides or the standard Morgan
fingerprint. This is particularly true if you're trying to build a
three-class model (which is much more difficult than a two-class
model).

One suggestion would be to try doing a two class model (either combine
two of your classes together or use only classes 0 and 2 in the
training) and see if that helps. Another would be try using different
descriptors. You might be able to get something useful with the
FeatMorgan fingerprints (similar to the FCFP fingerprints).

-greg

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] how to come to a good model

Reply via email to