Hi Everybody, Hope things are going well. I wanted to send a quick update on our service (the last one may have been many months ago) to share some of our learnings for building our service (its Python/Django based). I also have a small recruiting pitch at the end of the note :)
I know that there has been quite a bit of discussion around machine learning, AI and other statistical based approaches to text and data mining on this mailing list. I wanted to share our experience while building our service ( http://www.orglex.com ) As some of you may know, from my previous emails, we aggregate all types of content (e.g. News, Blogs, Jobs, People etc...) focused on Industries and Organizations (e.g. You can see a hub here http://www.orglex.com/hubs/clinical-trials ). Our content aggregation is completely automated and very relevant to the topic at hand (See another example at www.orglex.com/hubs/venture-capital ). While we believe that we have achieved a far greater degree of relevance than anything existing in the market, we still need to make a lot of progress. We also white label our aggregated content and we have a very important validation from one of the leading technology and venture capital blog networks, VentureBeat- http://www.venturebeat.com/vc-news/ . Additionally, the traffic to our site has also been growing quite fast over the past few months. To achieve this level of relevance, we experimented, implemented and iterated with many purely algorithmic techniques (e.g. TFIDF, Bayesian methods etc...) to assign and tag our content. However, we were not satisfied with the relevance of the content and had to apply a hybrid approach. One of the issues with a purely algorthmic approach is that it works broadly for generic content (e.g. using topical/document similarities see here-http://blogoscoped.com/archive/2006-07-28-n49.html ) but has decay issues for narrow topics. Our platform has many pieces to it but some of the important elements are: =>We utilize industry specific semantic ontologies to help the system appropriately understand the content. This is classic semantic web stuff ( http://www.semanticweb.com/article.php/3721831 ). =>We understand the importance of relevant sources within an industry and appropriately weight them while looking at the source Given this hybrid approach, we are able to keep the relevance to a very high quality and yet automate the process. We are still a small team but given all this exciting progress, we are looking to expand our team by 1-2 more people. We have a preference for people with 0-2 years of experience but wont hold experience against folks :) Looking forward to hearing feedback from folks. Nik
_______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers