Re: Document Term matrix

2014-11-11 Thread Ahmet Arslan
Hi, Mahout and Carrot2 can cluster the documents from lucene index. ahmet On Tuesday, November 11, 2014 10:37 PM, Elshaimaa Ali wrote: Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matri

Re: Document Term matrix

2014-11-11 Thread Paul Libbrecht
The project semanticvectors might be doing what you are looking for. paul On 11 nov. 2014, at 22:37, parnab kumar wrote: > hi, > > While indexing the documents , store the Term Vectors for the content > field. Now for each document you will have an array of terms and their > corresponding fre

Re: Document Term matrix

2014-11-11 Thread parnab kumar
hi, While indexing the documents , store the Term Vectors for the content field. Now for each document you will have an array of terms and their corresponding frequency in the document. Using the Index Reader you can retrieve this term vectors. Similarity between two documents can be computed as

Document Term matrix

2014-11-11 Thread Elshaimaa Ali
Hi All, I have a Lucene index built with Lucene 4.9 for 584 text documents, I need to extract a Document-term matrix, and Document Document similarity matrix in-order to use it to cluster the documents. My questions:1- How can I extract the matrix and compute the similarity between documents in

RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O'Shea
Ahmet, Yes that is quite true. But as this is only a proof of concept application, I'm prepared for things to be 'imperfect'. Martin O'Shea. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 11 Nov 2014 18 26 To: java-user@lucene.apache.org Subject: Re: How

Re: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Ahmet Arslan
Hi, With that analyser, your searches (for same word, but different capitalised) could return different results. Ahmet On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea wrote: In the end I edited the code of the StandardAnalyzer and the SnowballAnalyzer to disable the calls to the LowerCa

RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O'Shea
In the end I edited the code of the StandardAnalyzer and the SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to work. -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] Sent: 10 Nov 2014 15 19 To: java-user@lucene.apache.org Subject: Re: How

回复: How to map lucene scores to range from 0~100?

2014-11-11 Thread Harry Yu
Hi Rajendra, Thanks for your reply. Normalization is good way to solve it. But there is problem, if normalize by your formula, the score of top one doc would be 100. Although it map score range from 0~100, but the score maybe not show the similarity between query and hit docs. My system is t

Re: How to map lucene scores to range from 0~100?

2014-11-11 Thread Rajendra Rao
Harry , basically converting score into range 0 to 100 require normalization(dividing each score with highest record and multiply by .100) .but this score does n't represent matching %. On Tue, Nov 11, 2014 at 7:48 PM, Harry Yu <502437...@qq.com> wrote: > Hi everyone, > > > I met a new trouble.

How to map lucene scores to range from 0~100?

2014-11-11 Thread Harry Yu
Hi everyone, I met a new trouble. In my system, we should score the doc range from 0 to 100. There are some easy ways to map lucene scores to this scope. Thanks for your help~ Yu

Re: Index keeps growing, then shrinks on restart

2014-11-11 Thread Rob Nikander
On Tue, Nov 11, 2014 at 4:26 AM, Ian Lea wrote: > Telling us the version of lucene and the OS you're running on is > always a good idea. > Oops, yes. Lucene 4.10.0, Linux. A guess here is that you aren't closing index readers, so the JVM will > be holding on to deleted files until it exits. >

Re: How to improve the performance in Lucene when query is long?

2014-11-11 Thread Ahmet Arslan
Hi Harry, May be you can use BooleanQuery#setMinimumNumberShouldMatch method. What happens when you use set it to half of the numTerms? ahmet On Tuesday, November 11, 2014 8:35 AM, Harry Yu <502437...@qq.com> wrote: Hi everyone, I have been using Lucene to build a POI searching & geocoding

Re: Index keeps growing, then shrinks on restart

2014-11-11 Thread Ian Lea
Telling us the version of lucene and the OS you're running on is always a good idea. A guess here is that you aren't closing index readers, so the JVM will be holding on to deleted files until it exits. A combination of du, ls, and lsof commands should prove it, or just losf: run it against the j