I have set of documents separated in to doc_sections (d) that are again
separated in to (n) number of sentences. There is an ontology that I’m
using to calculate similarity between definitions of ontology terms vs
doc_sections.
The documents are indexed at sentence level, so each sentence is a docu
There are lots of parameters you can adjust, but the defaults essentially
assume that you have a fairly large corpus and aren't interested in
low-frequency terms.
So, try MoreLikeThis#setMinDocFreq. The default is 5. You don't have any
terms in your example with a doc freq over 2.
Also, try
Hey,
I have a question about "MoreLikeThis" in Lucene, Java. I built up an index and
want to find similar documents. But I always get no results for my query,
mlt.like(1) is always empty. Can anyone find my mistake? Here is an example. (I
use Lucene 4.0)
public class HelloLucene {
public
java org.apache.lucene.misc.HighFreqTerms indexdir 1 field
That's for 4.0, in lucene-misc-4.0.0.jar. It has been around for ages
but may have had a different package name in earlier releases.
I've no idea how it works and luckily don't need to. You can look at
the source if you need to know.
Hi.
What is the best way get highest frequency term from index?
I think for this, using PriorityQueue and cut off lower frequency term.
But this way need performing loop as all term's count.
Is there better way get highest frequency term?
Thanks.!
--
DEV용식
http://devyongsik.tistory.com
Hi,
I don't understand why the FilterAtomicReader method is declared as final
while the TestFilterAtomicReader test overrides the terms method.
I may have missed something, any help would be welcome
Best wishes,
JCD
--
Jean-Claude Dauphin
jc.daup...@gmail.com
jc.daup...@afus.unesco.org
http
RAMDirectory generally has high GC cost for a large index because it always
allocates byte[1024] as its "pages". See eg
http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html
But, you are hitting lots of new gen garbage, which is different.
ExactPhraseScorer is new f
hi Maxim ,
you need to reset the tokenStream before the while loop - tokenStream .reset
()
check out
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html
look under "invoking the analyzer" :
"ts.reset(); // Resets this stream to the beginning. (Required)"
Hi!
I try to use WhitespaceAnalyzer from Lucene 4.0 for splitting strings to words.
I wrote smal test:
@Test
public void whitespaceAnalyzerTest() throws IOException {
String string = "sdfdsf sdfsdf sd sdf ";
Analyzer wa = new WhitespaceAnalyzer(Version.LUCENE_40);
TokenStream tokenStre
2013/1/15 VIGNESH S
> Hi All,
>
> Thanks for your replies..
>
> Actually I am trying to classify the email mail data in to categories
> and also spam mails .. I have tried clustering but it is not useful
> since we can not control categories.
>
> I am looking for a light weight implementation whi
Hi Vignesh,
You might want to have a look at something we put together last year:
http://www.flax.co.uk/blog/2012/06/12/clade-a-freely-available-open-source-taxonomy-and-autoclassification-tool/.
Alan Woodward
a...@flax.co.uk
On 15 Jan 2013, at 05:33, VIGNESH S wrote:
> Hi All,
>
> Thanks
11 matches
Mail list logo