Re: How do YOU detect corrupt indexes?

2007-08-02 Thread Daniel Noll
On Friday 03 August 2007 16:03:22 Doron Cohen wrote: > What is the anticipated cause of corruption? Malicious? > Hardware fault? This somewhat reminds of discussions in > the list about encrypting the index. See LUCENE-737 > and a discussion pointed by it. One of the opinions > there was that encry

Re: How do YOU detect corrupt indexes?

2007-08-02 Thread Dmitry
Not sure how exactly understand corrupted indexes in the sense that could not read / use indexes or something else.. thanks DT www.ejinz.com EjinZ Search Engine - Original Message - From: "Doron Cohen" <[EMAIL PROTECTED]> To: Sent: Friday, August 03, 2007 1:03 AM Subject: Re: How do

Re: How do YOU detect corrupt indexes?

2007-08-02 Thread Doron Cohen
What is the anticipated cause of corruption? Malicious? Hardware fault? This somewhat reminds of discussions in the list about encrypting the index. See LUCENE-737 and a discussion pointed by it. One of the opinions there was that encryption should be handled at a lower level (OS/FS). Wouldn't that

Re: Performance improvements using writer.delete vs reader.delete

2007-08-02 Thread Doron Cohen
Andreas Knecht wrote: > We're considering to use the new IndexWriter.deleteDocuments call rather > than the IndexReader.delete call. Are there any performance > improvements that this may provide, other than the benefit of not having > to switch between readers/writers? > > We've looked at LUCENE

Performance improvements using writer.delete vs reader.delete

2007-08-02 Thread Andreas Knecht
Hi, We're considering to use the new IndexWriter.deleteDocuments call rather than the IndexReader.delete call. Are there any performance improvements that this may provide, other than the benefit of not having to switch between readers/writers? We've looked at LUCENE-565, but there's no cle

Re: Can I do boosting based on term postions?

2007-08-02 Thread Shailendra Sharma
I am doing implementation of SpanTermQuery for you, give me today. Sorry, I was out for meetings for 2 days. Enjoy, Shailendra On 8/3/07, Cedric Ho <[EMAIL PROTECTED]> wrote: > > Hi Paul, > > Isn't SpanFirstQuery only match those with position less than a > certain end position? > > I am rather l

Re: Can I do boosting based on term postions?

2007-08-02 Thread Cedric Ho
Hi Paul, Isn't SpanFirstQuery only match those with position less than a certain end position? I am rather looking for a query that would score a document higher for terms appear near the start but not totally discard those with terms appear near the end. Regards, Cedric On 8/2/07, Paul Elschot

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mark Miller
If you are just retrieving your custom id and you have more stored fields (and they are not tiny) you certainly do want to use a field selector. I would suggest SetBasedFieldSelector. - Mark testn wrote: Hi, Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an ability t

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Daniel Noll
On Thursday 02 August 2007 19:28:48 Mohammad Norouzi wrote: > you should not store them in an Array structure since they will take up the > memory. > the BitSet is the best structure to store them You can't store strings in a BitSet. What I would do is return a List but make a custom subclass of

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Michael McCandless
"Zach Bailey" <[EMAIL PROTECTED]> wrote: > Unfortunately, I am not sure the leader of the project would feel good > about running code from trunk, save without an explicit endorsement from > a majority of the developers or contributors for that particular code > (do those people keep up with t

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Michael McCandless
I have been meaning to write up a Wiki page on this general topic but have not quite made time yet ... Sharing an index with a shared filesystem will work, however there are some caveats: * This is somewhat unchartered territory because it's fairly recent fixes to Lucene that have enabled

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey
Mark, Thanks so much for your response. Unfortunately, I am not sure the leader of the project would feel good about running code from trunk, save without an explicit endorsement from a majority of the developers or contributors for that particular code (do those people keep up with this list

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey
Rajesh, I forgot to mention this, but we did investigate this option as well and even prototyped it for an internal project. It ended up being too slow for us. It was adding a lot of overhead even to small updates, IIRC, mainly due to the fact that the index was essentially stored as a files

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Rajesh parab
One more alternative, though I am not sure if anyone is using it. Apache Compass has added a plug-in to allow storing Lucene index files inside the database. This should work in clustered environment as all nodes will share the same database instance. I am not sure the impact it will have on perf

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Mark Miller
Some quick info: NFS should work, but I think youll want to be working off the trunk. Also, Sharing an index over NFS is supposed to be slow. The standard so far, if you are not partitioning the index, is to use a unix/linux filesystem and hardlinks + rsync to efficiently share index changes

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey
Thanks for your response -- Based on my understanding, hadoop and nutch are essentially the same thing, with nutch being derived from hadoop, and are primarily intended to be standalone applications. We are not looking for a standalone application, rather we must use a framework to implement

How do YOU detect corrupt indexes?

2007-08-02 Thread Joe R
Hello, I've been asked to devise some way to discover and correct data in Lucene indexes that have been "corrupted." The word "corrupt", in this case, has a few different meanings, some of which strike me as exceedingly difficult to grok. What concerns me are the cases where we don't know that

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread testn
Why don't you check out Hadoop and Nutch? It should provide what you are looking for. Zach Bailey wrote: > > Hi, > > It's been a couple of days now and I haven't heard anything on this > topic, while there has been substantial list traffic otherwise. > > Am I asking in the wrong place? Was I

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey
Hi, It's been a couple of days now and I haven't heard anything on this topic, while there has been substantial list traffic otherwise. Am I asking in the wrong place? Was I unclear? I know there are people out there that have used/are using Lucene in a clustered environment. I am just looki

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Ben Litchfield
In terms of PDF documents... PDFBox should work just fine with any latin based languages; at this time certain PDFs that have CJK characters can pose some issues. In general english/french/spanish should be fine. Some PDFs use custom encodings that make it impossible to extract text and

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Grant Ingersoll
Hey Michael, Have you given it a try? I would think they would work, but haven't actually done it. Setup a small test that reads in a PDF in French or Spanish and give it a try. You might have to worry about encodings or something, but the structure of the files should be the same, i.

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread testn
Check out.. http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e heybluez wrote: > > Yea, I have seen those. I guess the question is what do you all use to > extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and > so on? This is what I

Re: Do AND + OR Search in Lucene

2007-08-02 Thread Erick Erickson
Alternatively, construct a parenthesized query that reflects what you want. If you do, make sure that OR is capitalized, or make REAL SURE you understand the Lucene syntax and construct your query with that syntax. Erick On 8/2/07, testn <[EMAIL PROTECTED]> wrote: > > > You can create two queries

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Michael J. Prichard
Yea, I have seen those. I guess the question is what do you all use to extract text from Word, Excel, PPT and PDF? Can I use POI, PDFBox and so on? This is what I use now to extract english. Thanks, Michael testn wrote: If you can extract token stream from those files already, you can simp

RE: IndexReader deletes more that expected

2007-08-02 Thread Ridwan Habbal
Yes, you are right, thanks for the great reply! I skimmed it so quickly today, so re-read it now, and got the point you mean. I just tried Lucene 2.2.0 (I was using 2.0.0) and i could do add, delete and update docs so smoothly! Based on my tests i did so far, similar to tests I presented in my f

Re: LUCENE-843 Release

2007-08-02 Thread testn
Thanks! Will look forward to 2.3 then. Michael McCandless-2 wrote: > > > Honestly I don't really think this is a good idea. > > While LUCENE-843 has proven stable so far (knock on wood!), it is > still a major change and I do worry (less with time :) that maybe I > broke something subtle some

Re: Using Nutch APIs in Lucene

2007-08-02 Thread Grant Ingersoll
Just use Nutch. If you look in the Crawl.java class in Nutch, you can pretty easily tear out the appropriate pieces. Question is, do you really need all of that? If so, why not just use Nutch? -Grant On Aug 2, 2007, at 2:32 AM, Srinivasarao Vundavalli wrote: How can we use nutch APIs in

Re: LUCENE-843 Release

2007-08-02 Thread Michael McCandless
Honestly I don't really think this is a good idea. While LUCENE-843 has proven stable so far (knock on wood!), it is still a major change and I do worry (less with time :) that maybe I broke something subtle somewhere. While a few brave people have tested the trunk in their production worlds and

Re: Solr's NumberUtils doesnt work

2007-08-02 Thread testn
How did you encode your integer into String? Did you use int2sortableStr? is_maximum wrote: > > Hi > I am using NumberUtils to encode and decode numbers while indexing and > searching, when I am going to decode the number retrieved from an index it > throws exception for some fields > the exce

Re: LUCENE-843 Release

2007-08-02 Thread testn
Mike, as a committer, what do you think? Thanks! Peter Keegan wrote: > > I've built a production index with this patch and done some query stress > testing with no problems. > I'd give it a thumbs up. > > Peter > > On 7/30/07, testn <[EMAIL PROTECTED]> wrote: >> >> >> Hi guys, >> >> Do you t

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread testn
If you can extract token stream from those files already, you can simply use different analyzers to analyze those token stream appropriately. Check out Lucen-contrib analyzers at http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/ heybluez wr

RE: High CPU usage duing index and search

2007-08-02 Thread testn
20,000 queries continuously? Sounds a bit too much. Can you elaborate more what you need to do? Probably you won't need that many queries. Chew Yee Chuang wrote: > > Hi, > > Thanks for the link provided, actually I've go through those article when > I > developing the index and search functio

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread testn
Hi, Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an ability to help you load only the first field in the document without loading all the fields. After that, you can keep the whole document if you like. It should help improve performance better. is_maximum wrote: >

Re: Do AND + OR Search in Lucene

2007-08-02 Thread testn
You can create two queries from two query parser, one with AND and the other one with OR. After you create both of them, you call setBoost() to give different boost level and then join them together using BooleanQuery with option BooleanClause.Occur.SHOULD. That should do the trick. askarzaidi w

Do AND + OR Search in Lucene

2007-08-02 Thread Askar Zaidi
Hey Guys, Quick question: I do this in my code for searching: queryParser.setDefaultOperator(QueryParser.Operator.AND); Lucene is OR by default so I change it to AND for my requirements. Now, I have a requirement to do OR as well. I mean while doing AND I'd like to include results from OR too .

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi
you should not store them in an Array structure since they will take up the memory. the BitSet is the best structure to store them On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: > > > Heres my index structure : > > Document -> contract ID -id (index AND store) > -> paramName

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi
yes it decrease the performance but the only solution. I've spent many weeks to find best way to retrive my own IDs but find this way as last one now I am storing the ids in a BitSet structure and it's fast enough public void collect(...){ idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnI

RE: Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar
Heres my index structure : Document -> contract ID -id (index AND store) -> paramName -name (index AND store) -> paramValue -value (index BUT NOT store) When I get back 2 hits, each document contains ID and paramName, I have no interest in paramN

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar
Hi, The solution you suggested will definitely work but will definitely slow down my search by an order of magnitude. The problem I am trying to solve is performance, thats why I need the collection of IDs and not the whole documents. - thanks for the prompt reply. is_maximum wrote: > > y

RE: Getting only the Ids, not the whole documents.

2007-08-02 Thread Chhabra, Kapil
What is the structure of your index? If you havnt already, then add a new field to your index that stores the contractId. For all other fields, set the "store" flag to false while indexing. You can now safely retrieve the value of this contractId field based on your search results. Regards, kapil

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi
yes if you extend your class from HitCollector and override the collect() mthod with following signature you can get IDs public void collect(int id, float score) On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote: > > > Hi all, > >Can I get just a list of document Ids given a search criteria ? To >

Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar
Hi all, Can I get just a list of document Ids given a search criteria ? To elaborate here is my situation: I store 2 contracts in the file system index each with some parameterName and Value. Given a search criterion - (paramValue='draft'). I need to get just an ArrayList of Strings conta

RE: IndexReader deletes more that expected

2007-08-02 Thread Ridwan Habbal
Yes, you are correct, I close indexWriter and then add more docs. What's wrong? it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX SEARCHERS after calling close on the indexWriter. As for creating new IndexWriter, I tried to, however i suffered of the lock exception

RE: IndexReader deletes more that expected

2007-08-02 Thread Ridwan Habbal
> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 > 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > > If I'm reading this correctly, there's something a little wonky here. In> > your example code, you close the IndexWriter and

RE: IndexReader deletes more than expected

2007-08-02 Thread Ridwan Habbal
> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 > 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > > If I'm reading this correctly, there's something a little wonky here. In> > your example code, you close the IndexWriter and the

Re: Can I do boosting based on term postions?

2007-08-02 Thread Paul Elschot
Cedric, SpanFirstQuery could be a solution without payloads. You may want to give it your own Similarity.sloppyFreq() . Regards, Paul Elschot On Thursday 02 August 2007 04:07, Cedric Ho wrote: > Thanks for the quick response =) > > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote: > > Yes