Re: Performance and FS block size

2006-02-10 Thread Byron Miller
Otis, If i'm not mistaken block size especially on ext3 becomes an issue when you hit a peak amount of total blocks and lose performance on inode lookup vs that of of Reiserfs.. for example you may gain performance by going to 4k vs 1k on ext3 however Reiserfs at that block level size should be xx

Re: Build vs. Buy?

2006-02-10 Thread jwang
The reason we don't use Google appliance is that our company doesn't give recommendations on OSs or Hardwares to run, it would looke a little wierd if we say, oh, you have to buy this hardware for our search engine, but for our core technology, feel free to deploy it anywhere you want. It just

Re: Performance and FS block size

2006-02-10 Thread Michael D. Curtin
Otis Gospodnetic wrote: Michael, Actually, one more thing - you said you changed the store/BufferedIndexOutput.BUFFER_SIZE from 1024 to 4096 and that turned out to yield the fastest indexing. Does your FS block size also happen to be 4k (dumpe2fs output) on that FC3 box? If so, I wonder if

Re: Performance and FS block size

2006-02-10 Thread Otis Gospodnetic
Michael, Actually, one more thing - you said you changed the store/BufferedIndexOutput.BUFFER_SIZE from 1024 to 4096 and that turned out to yield the fastest indexing. Does your FS block size also happen to be 4k (dumpe2fs output) on that FC3 box? If so, I wonder if this is more than just a

Re: Performance and FS block size

2006-02-10 Thread Otis Gospodnetic
Hi, Thanks for the speedy answer, this is good to know. However, i was wondering about the FS block size consider a Linux box: $ dumpe2fs /dev/sda1 | grep "Block size" dumpe2fs 1.36 (05-Feb-2005) Block size: 1024 That shows /dev/sda1 has blocks 1k in size. I don't think these

Re: Performance and FS block size

2006-02-10 Thread Michael D. Curtin
Otis Gospodnetic wrote: Hi, I'm wondering if anyone has tested Lucene indexing/search performance with different file system block sizes? I just realized one of the servers where I run a lot of Lucene indexing and searching has an FS with blocks of only 1K in size (typically they are 4k or

Re: 1.9 lucene version

2006-02-10 Thread Otis Gospodnetic
I answered it yesterday, please check the archives... Otis - Original Message From: "Aigner, Thomas" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Fri 10 Feb 2006 09:25:14 AM EST Subject: RE: 1.9 lucene version Anyone have a comment on the below message? -Original Me

Performance and FS block size

2006-02-10 Thread Otis Gospodnetic
Hi, I'm wondering if anyone has tested Lucene indexing/search performance with different file system block sizes? I just realized one of the servers where I run a lot of Lucene indexing and searching has an FS with blocks of only 1K in size (typically they are 4k or 8k, I believe), so I starte

Re: Build vs. Buy?

2006-02-10 Thread jian chen
For reading word document as text, you can try AntiWord. I have written a simplified Lucene that does Max words match. For example, if you are searching for aa, bb, cc, then, the document that contains all words (aa, bb, cc) will be definitely ranked higher than documents containing either aa, bb

Re: Can PDFBox or POI handle multi-byte characters with different enc odings?

2006-02-10 Thread Ben Litchfield
PDFBox can handle multi-byte encodings. There are a couple recent fixes for CJK languages that are not part of 0.7.2 but are part of the nightly build. Ben On Fri, 10 Feb 2006, Zhang, Lisheng wrote: > Hi, > > Currently we are using PDFBox to process PDF files and > POI to process DOC/XLS fil

Re: QueryParser behaviour ..

2006-02-10 Thread Chris Hostetter
: I built a wrong query string "word1,word2,word3" instead of "word1 : word2 word3" : therefore I got a wrong query: field:"word1 word2 word3" instead of : field:word1 field:word2 field:word3. : : Is this an espected behaviour? : I used Standard analyzer, probably therefore, the comas were re

Can PDFBox or POI handle multi-byte characters with different enc odings?

2006-02-10 Thread Zhang, Lisheng
Hi, Currently we are using PDFBox to process PDF files and POI to process DOC/XLS files, before send strings to lucene for indexing, Does any one know if PDFBox or POI can process multi- byte characters like Japanese with various encodings (whatever specified in PDF or DOC)? Thanks very much for

Re: query formulation

2006-02-10 Thread Yonik Seeley
On 2/10/06, Rajesh Munavalli <[EMAIL PROTECTED]> wrote: > However I also want to retrieve those documents (in order) where one or more > of the terms is missing from either of the fields. i.e, BooleanQuery.setMinimumNumberShouldMatch() in the development version (1.9) of Lucene may help out in tha

query formulation

2006-02-10 Thread Rajesh Munavalli
Does anyone have a good way to formulate the query in terms of performance as well as ordering of retrieved documents for the following query? Query: "field1:t1 t2 t3 t4 AND field2:t5 t6 t7" I want to achieve the following * The document which matches the query exactly in both the fields gets ran

Re: Word files & Build vs. Buy?

2006-02-10 Thread Christiaan Fluit
Dmitry Goldenberg wrote: Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc? I just committed a WordPerfectE

RE: 1.9 lucene version

2006-02-10 Thread Aigner, Thomas
Anyone have a comment on the below message? -Original Message- From: Aigner, Thomas Sent: Wednesday, February 08, 2006 11:50 AM To: java-user@lucene.apache.org Subject: 1.9 lucene version Hello all, I have a couple of questions for the community about the 1.9 Lucene version.

Re: Changing default QueryParser operator from OR to AND

2006-02-10 Thread Erik Hatcher
On Feb 10, 2006, at 4:37 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: IF QueryParser gets a phrase with a number of words (ie: "here are words") it uses the implicit operator OR - "here OR are OR words". LIA on p94 says the operator "by default is OR", implying that there may be some

QueryParser behaviour ..

2006-02-10 Thread sergiu gordea
Hi all, I built a wrong query string "word1,word2,word3" instead of "word1 word2 word3" therefore I got a wrong query: field:"word1 word2 word3" instead of field:word1 field:word2 field:word3. Is this an espected behaviour? I used Standard analyzer, probably therefore, the comas were repl

RE: Changing default QueryParser operator from OR to AND

2006-02-10 Thread Iain Willis
Hi, Instead of using the static parse() method of QueryParser, you will need to create a new instance, and the call setOperator(DEFAULT_OPERATOR_AND); Iain www.ardentia.com the home of NetSearch -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 10 February 20

Changing default QueryParser operator from OR to AND

2006-02-10 Thread Tim.Wright
Hi guys, IF QueryParser gets a phrase with a number of words (ie: "here are words") it uses the implicit operator OR - "here OR are OR words". LIA on p94 says the operator "by default is OR", implying that there may be some way to change this. We'd really like the default to be AND. Is that pos

RE: lucene & ejbs

2006-02-10 Thread Ramana Jelda
HI, I am doing the same. My design contains. Index Repository: is responsible for keep up to index.:)Index-configurator. Index-manager: Real CRUD indexing.. And ofcourse index-searcher: I want results ..:) U know very well->Searcher and Indexer are both separate functionalities.. So the reason wh

Re: Help: tweaking search - reducing IDF skew and implementing score cutoff

2006-02-10 Thread Chris Lamprecht
> 2. If I choose to sort the results by date, then recent documents with > very very low relevancy (say the words searched appears only in > content, and not in title/bylines/summary fields that are boosted > higher) are still shown relatively high in the list, and I wish to > omit them in general.