Hi Paul, I'd agree on Greg's comment - if this is for a hErg model then you will not have a lot of luck to make a reasonable model purely with physchem properties. I guess having information on ionizability could be of help.
Another one to test - in case you need to make a 3 class model - would be to modify your class borders so you will only predict really likely and really unlikely molecules and increase the size of your "I have no idea" class. Hope this helps Nik -----Original Message----- From: Greg Landrum [mailto:[email protected]] Sent: Saturday, October 08, 2011 4:21 PM To: [email protected] Cc: [email protected] Subject: Re: [Rdkit-discuss] how to come to a good model On Fri, Oct 7, 2011 at 12:31 PM, <[email protected]> wrote: > > Dear RDKitters, > > I'm in the process of training a 3-class decision tree model. I have roughly > about 1500 compounds with an almost equal distribution of the 3 classes. <snip> > > In all cases, the statistics is really bad: about 50 percent are > misclassified, e.g.: > " > *** Vote Results *** > misclassified: 580/1180 (%49.15) 580/1180 (%49.15) > > average correct confidence: 0.7837 > average incorrect confidence: 0.7528 > " > > Interestingly, there is a really small difference between the average > confidence level for the correct as well as the incorrect classifications. > As far as I got it this tells me that the model is really bad - an > information I already got by the vote results themselves. > > > Which parameters are worthhile to test? We talked about this at the Knime OSD meeting already, but I think it's worth repeating for the community: I believe that prediction of hERG binding is too challenging for simple descriptors like the physicochemical descriptors the RDKit provides or the standard Morgan fingerprint. This is particularly true if you're trying to build a three-class model (which is much more difficult than a two-class model). One suggestion would be to try doing a two class model (either combine two of your classes together or use only classes 0 and 2 in the training) and see if that helps. Another would be try using different descriptors. You might be able to get something useful with the FeatMorgan fingerprints (similar to the FCFP fingerprints). -greg ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2dcopy2 _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1 _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

