about the performance of search with filter

2006-04-21 Thread Sen Zhou
Hi, all I want to know the different between the search without rangefilter and the search with rangefilter. Is the letter more slow than the latter? Thanks Sen Zhou - To unsubscribe, e-mail: [EMAIL PROTECTED] For additio

Re: Synonyms ...

2006-04-21 Thread Yonik Seeley
On 4/21/06, Dragon Fly <[EMAIL PROTECTED]> wrote: > I don't think the SynonymAnalyzer described in LIA would work because > some of my "synonyms" contain multiple words. The SynonymFilter in Solr can handle multi-word synonyms. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters http://in

Re: Reuters

2006-04-21 Thread Marvin Humphrey
On Apr 21, 2006, at 11:56 AM, Malcolm Clark wrote: has anyone attempted to index/search the Reuters collection which consists of SGML? Mine seems to run through the process okay but alas I'm left with nothing in the index when I check with Luke or my own Search Engine. Anyone got any hints

Re: using boolean operators with the PhraseQuery

2006-04-21 Thread Vishal Bathija
Hi, I am trying to get the frequency of a phrase using the SpanNearQuery. How can I use SpanNearQuery for boolean queries. The code I have is for a single query. How can I extend this for multiple queries SpanTermQuery[] phrase = new SpanTermQuery[phraseTerms.length]; for(int termCount=0; termCou

Re: Reuters

2006-04-21 Thread Malcolm Clark
Okay converting to XML sounds like a great option. Thanks, Malcolm - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reuters

2006-04-21 Thread Lorenzo Viscanti
Some months ago I created an index from the reuters collection. I converted the SGML files to XML using a tool that I've found somewhere on the net (just google for it), then I parsed the files to create the index, using a standard DOM parser. If you have problems parsing the SGML files I think you

Reuters

2006-04-21 Thread Malcolm Clark
Hi all, I didn't know whether to add this to the thread asking about TREC indexing or start a new one. Anyway, has anyone attempted to index/search the Reuters collection which consists of SGML? Mine seems to run through the process okay but alas I'm left with nothing in the index when I check w

Synonyms ...

2006-04-21 Thread Dragon Fly
Hi, What is the best way to implement the following? Document 1 contains the following text: "THE CZECH REPUBLIC ORGANIZATION" Document 2 contains the following text: "THE CZE ORGANISATION" Synonym rules: (1) CZECH REPUBLIC --> CZE (2) CZE --> CZECH REPUBLIC (3) ORGANIZATION --> ORG, ORGA

Re: generating document vectors

2006-04-21 Thread Grant Ingersoll
See the "Lucene In Action" book or my ApacheCon talk at http://www.cnlp.org/apachecon2005. Both of these have examples. trupti mulajkar wrote: hi can anyone suggest how to can generate document and query vectors containing the term frequency from Lucene index. i need it to implement vector s

generating document vectors

2006-04-21 Thread trupti mulajkar
hi can anyone suggest how to can generate document and query vectors containing the term frequency from Lucene index. i need it to implement vector space model using Wordnet. cheers, trupti mulajkar MSc Advanced Computer Science

Re: Lucene, TREC, and WT10G

2006-04-21 Thread trupti mulajkar
Lucene can index the trec documents, but depends how you want to index them. If you want to index the sub files in the TREC DAtA then you have to modify the IndexFiles.java to read the tags else you can index them normally. cheers, trupti mulajkar Quoting thanh nguyen <[EMAIL PROTECTED]>: > Hi

RE: Lucene - FileFormat

2006-04-21 Thread Dmitry Goldenberg
Simon, I wonder if using Zoe might do the trick - http://guests.evectors.it/zoe/ Have you tried it? - Dmitry From: Fisheye [mailto:[EMAIL PROTECTED] Sent: Fri 4/21/2006 7:23 AM To: java-user@lucene.apache.org Subject: Lucene - FileFormat Im trying to const

Lucene, TREC, and WT10G

2006-04-21 Thread thanh nguyen
Hi all, Did anyone use Lucene to index WT10G? Can it index WT10G in compressed format (.gz) or we have to unzip it first? Further more, does Lucene support TREC format? I mean can it receive a topic file like " 1 abc def " and produce a results file which we can use with trec_eval program? A

similar ArrayIndexOutOfBoundsException on searching and optimizing

2006-04-21 Thread Adam Constabaris
This is a puzzler, I'm not sure if I'm doing something wrong or whether I have a poisoned document, a corrupted index (failing to close my IndexModifier properly?) or what. The setup is this: I have two processes (the backend and frontend of a CMS) that run in two different VMs -- both use Luc

Lucene - FileFormat

2006-04-21 Thread Fisheye
Im trying to construct a plaintext parser for different file formats like ms word, excel, powerpoint, rich text format, plain text, html, pdf etc. I use the known libraries PDFBox, POI and some parts from AtLeap...and now I should support the OpenOffice formats and the more important msg-fromat (

Re: Most used words

2006-04-21 Thread Daniel Cortes
Thks for the reply, perhaps to use something like in Luke is the best option. My idea to do is a TAGcloud (see the example in this page) for every group(field group with the id) and every portal (with the id). The problem is that I think do reader.terms() is not the best option in my case, becau

Re: Most used words

2006-04-21 Thread Kapil Chhabra
Hi, If I have correctly understood your question, you want the terms in a field with the maximum number of occurences. Try luke [www.getopt.org/*luke*/]. Or else in case you are not able to initialize graphical content on your system. You may use the following script. src/org/getopt/luke/HighF