Calculating a IDF value for a document collection

2013-01-15 Thread Kasun Perera
I have set of documents separated in to doc_sections (d) that are again separated in to (n) number of sentences. There is an ontology that I’m using to calculate similarity between definitions of ontology terms vs doc_sections. The documents are indexed at sentence level, so each sentence is a docu

Re: Lucene-MoreLikethis

2013-01-15 Thread Jack Krupansky
There are lots of parameters you can adjust, but the defaults essentially assume that you have a fairly large corpus and aren't interested in low-frequency terms. So, try MoreLikeThis#setMinDocFreq. The default is 5. You don't have any terms in your example with a doc freq over 2. Also, try

Lucene-MoreLikethis

2013-01-15 Thread Thomas Keller
Hey, I have a question about "MoreLikeThis" in Lucene, Java. I built up an index and want to find similar documents. But I always get no results for my query, mlt.like(1) is always empty. Can anyone find my mistake? Here is an example. (I use Lucene 4.0) public class HelloLucene { public

Re: The best way get highest frequency term from index

2013-01-15 Thread Ian Lea
java org.apache.lucene.misc.HighFreqTerms indexdir 1 field That's for 4.0, in lucene-misc-4.0.0.jar. It has been around for ages but may have had a different package name in earlier releases. I've no idea how it works and luckily don't need to. You can look at the source if you need to know.

The best way get highest frequency term from index

2013-01-15 Thread 장용석
Hi. What is the best way get highest frequency term from index? I think for this, using PriorityQueue and cut off lower frequency term. But this way need performing loop as all term's count. Is there better way get highest frequency term? Thanks.! -- DEV용식 http://devyongsik.tistory.com

Why Lucene 4.0.0 FilterAtomicReader terms method is final?

2013-01-15 Thread Jean-Claude Dauphin
Hi, I don't understand why the FilterAtomicReader method is declared as final while the TestFilterAtomicReader test overrides the terms method. I may have missed something, any help would be welcome Best wishes, JCD -- Jean-Claude Dauphin jc.daup...@gmail.com jc.daup...@afus.unesco.org http

Re: extensive minor garbage collection when using RAMDirectory on Lucune 3.6.2

2013-01-15 Thread Michael McCandless
RAMDirectory generally has high GC cost for a large index because it always allocates byte[1024] as its "pages". See eg http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html But, you are hitting lots of new gen garbage, which is different. ExactPhraseScorer is new f

Re: Lucene 4.0 WhitespaceAnalyzer problem

2013-01-15 Thread Alon Muchnick
hi Maxim , you need to reset the tokenStream before the while loop - tokenStream .reset () check out http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html look under "invoking the analyzer" : "ts.reset(); // Resets this stream to the beginning. (Required)"

Lucene 4.0 WhitespaceAnalyzer problem

2013-01-15 Thread Maksym Krasovskiy
Hi! I try to use WhitespaceAnalyzer from Lucene 4.0 for splitting strings to words. I wrote smal test: @Test public void whitespaceAnalyzerTest() throws IOException { String string = "sdfdsf sdfsdf sd sdf "; Analyzer wa = new WhitespaceAnalyzer(Version.LUCENE_40); TokenStream tokenStre

Re: Reg Lucene Naive Bayesian classifier.

2013-01-15 Thread Tommaso Teofili
2013/1/15 VIGNESH S > Hi All, > > Thanks for your replies.. > > Actually I am trying to classify the email mail data in to categories > and also spam mails .. I have tried clustering but it is not useful > since we can not control categories. > > I am looking for a light weight implementation whi

Re: Reg Lucene Naive Bayesian classifier.

2013-01-15 Thread Alan Woodward
Hi Vignesh, You might want to have a look at something we put together last year: http://www.flax.co.uk/blog/2012/06/12/clade-a-freely-available-open-source-taxonomy-and-autoclassification-tool/. Alan Woodward a...@flax.co.uk On 15 Jan 2013, at 05:33, VIGNESH S wrote: > Hi All, > > Thanks