date:20130115

Calculating a IDF value for a document collection

2013-01-15 Thread Kasun Perera

I have set of documents separated in to doc_sections (d) that are again separated in to (n) number of sentences. There is an ontology that I’m using to calculate similarity between definitions of ontology terms vs doc_sections. The documents are indexed at sentence level, so each sentence is a docu

Re: Lucene-MoreLikethis

2013-01-15 Thread Jack Krupansky

There are lots of parameters you can adjust, but the defaults essentially assume that you have a fairly large corpus and aren't interested in low-frequency terms. So, try MoreLikeThis#setMinDocFreq. The default is 5. You don't have any terms in your example with a doc freq over 2. Also, try

Lucene-MoreLikethis

2013-01-15 Thread Thomas Keller

Hey, I have a question about "MoreLikeThis" in Lucene, Java. I built up an index and want to find similar documents. But I always get no results for my query, mlt.like(1) is always empty. Can anyone find my mistake? Here is an example. (I use Lucene 4.0) public class HelloLucene { public

Re: The best way get highest frequency term from index

2013-01-15 Thread Ian Lea

java org.apache.lucene.misc.HighFreqTerms indexdir 1 field That's for 4.0, in lucene-misc-4.0.0.jar. It has been around for ages but may have had a different package name in earlier releases. I've no idea how it works and luckily don't need to. You can look at the source if you need to know.

The best way get highest frequency term from index

2013-01-15 Thread 장용석

Hi. What is the best way get highest frequency term from index? I think for this, using PriorityQueue and cut off lower frequency term. But this way need performing loop as all term's count. Is there better way get highest frequency term? Thanks.! -- DEV용식 http://devyongsik.tistory.com

Why Lucene 4.0.0 FilterAtomicReader terms method is final?

2013-01-15 Thread Jean-Claude Dauphin

Hi, I don't understand why the FilterAtomicReader method is declared as final while the TestFilterAtomicReader test overrides the terms method. I may have missed something, any help would be welcome Best wishes, JCD -- Jean-Claude Dauphin jc.daup...@gmail.com jc.daup...@afus.unesco.org http

Re: extensive minor garbage collection when using RAMDirectory on Lucune 3.6.2

2013-01-15 Thread Michael McCandless

RAMDirectory generally has high GC cost for a large index because it always allocates byte[1024] as its "pages". See eg http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html But, you are hitting lots of new gen garbage, which is different. ExactPhraseScorer is new f

Re: Lucene 4.0 WhitespaceAnalyzer problem

2013-01-15 Thread Alon Muchnick

hi Maxim , you need to reset the tokenStream before the while loop - tokenStream .reset () check out http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html look under "invoking the analyzer" : "ts.reset(); // Resets this stream to the beginning. (Required)"

Lucene 4.0 WhitespaceAnalyzer problem

2013-01-15 Thread Maksym Krasovskiy

Hi! I try to use WhitespaceAnalyzer from Lucene 4.0 for splitting strings to words. I wrote smal test: @Test public void whitespaceAnalyzerTest() throws IOException { String string = "sdfdsf sdfsdf sd sdf "; Analyzer wa = new WhitespaceAnalyzer(Version.LUCENE_40); TokenStream tokenStre

Re: Reg Lucene Naive Bayesian classifier.

2013-01-15 Thread Tommaso Teofili

2013/1/15 VIGNESH S > Hi All, > > Thanks for your replies.. > > Actually I am trying to classify the email mail data in to categories > and also spam mails .. I have tried clustering but it is not useful > since we can not control categories. > > I am looking for a light weight implementation whi

Re: Reg Lucene Naive Bayesian classifier.

2013-01-15 Thread Alan Woodward

Hi Vignesh, You might want to have a look at something we put together last year: http://www.flax.co.uk/blog/2012/06/12/clade-a-freely-available-open-source-taxonomy-and-autoclassification-tool/. Alan Woodward a...@flax.co.uk On 15 Jan 2013, at 05:33, VIGNESH S wrote: > Hi All, > > Thanks

Calculating a IDF value for a document collection

Re: Lucene-MoreLikethis

Lucene-MoreLikethis

Re: The best way get highest frequency term from index

The best way get highest frequency term from index

Why Lucene 4.0.0 FilterAtomicReader terms method is final?

Re: extensive minor garbage collection when using RAMDirectory on Lucune 3.6.2

Re: Lucene 4.0 WhitespaceAnalyzer problem

Lucene 4.0 WhitespaceAnalyzer problem

Re: Reg Lucene Naive Bayesian classifier.

Re: Reg Lucene Naive Bayesian classifier.

11 matches

Site Navigation

Mail list logo

Footer information