Re: Multisearcher will maintain index order sorting?

2008-10-23 Thread Ganesh
Any commets are suggestions are greatly appreciated. Regards Ganesh - Original Message - From: "Ganesh" <[EMAIL PROTECTED]> To: Sent: Thursday, October 23, 2008 3:45 PM Subject: Re: Multisearcher will maintain index order sorting? Multisearcher after performing search on second inde

RE: How to use regexQuery along with fuzzy logic capabilities

2008-10-23 Thread Agrawal, Aashish (IT)
any comments / help on this question ? thanks, Aashish Hi, I want to use lucene for a simple search engine. If I use the code like this, QueryParser parser = new QueryParser(field, analyzer); Query query = parser.parse(line); searcher.search(query) above code doesn't give me regular expr

Re: Question about QueryParser

2008-10-23 Thread James liu
thks steve, i get it. 2008/10/24 Steven A Rowe <[EMAIL PROTECTED]> > Hi James, > > On 10/23/2008 at 8:30 AM, James liu wrote: > > public class AnalyzerTest { > >@Test > >public void test() throws ParseException { > >QueryParser parser = new MultiFieldQueryParser(new > String[]{"ti

Any Spanish analyzer available?

2008-10-23 Thread Zhang, Lisheng
Hi, Is there any Spanish analyzer available for lucene applications? I did not see any in lucene 2.4.0 contribute folders. Thanks very much for helps, Lisheng - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Michael McCandless
Glen Newton wrote: 2008/10/23 Michael McCandless <[EMAIL PROTECTED]>: Mark Miller wrote: Glen Newton wrote: 2008/10/23 Mark Miller <[EMAIL PROTECTED]>: It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just usin

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Michael McCandless <[EMAIL PROTECTED]>: > > Mark Miller wrote: > >> Glen Newton wrote: >>> >>> 2008/10/23 Mark Miller <[EMAIL PROTECTED]>: >>> It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just u

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Michael McCandless
Mark Miller wrote: Glen Newton wrote: 2008/10/23 Mark Miller <[EMAIL PROTECTED]>: It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty effi

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Michael McCandless
Also, could you kill your process with -QUIT (on Linux; maybe there is something analogous on Windows?) when you see the threads hanging? That will give a stack dump for every thread. Mike Grant Ingersoll wrote: Can you describe your process a bit more? Are you measuring just the Luce

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Mark Miller
Glen Newton wrote: 2008/10/23 Mark Miller <[EMAIL PROTECTED]>: It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty efficient, and there isn't

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Grant Ingersoll
Can you describe your process a bit more? Are you measuring just the Lucene part or the whole ingestion part as well? If it's the latter, how do you know the issue is in Lucene? PDF extraction is annoying at best and highly problematic at its worst. Not saying it isn't Lucene, but I've

RE: Question about QueryParser

2008-10-23 Thread Steven A Rowe
Hi James, On 10/23/2008 at 8:30 AM, James liu wrote: > public class AnalyzerTest { >@Test >public void test() throws ParseException { >QueryParser parser = new MultiFieldQueryParser(new String[]{"title", > "body"}, new StandardAnalyzer()); >Query query1 = parser.parse("中文"

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
2008/10/23 Mark Miller <[EMAIL PROTECTED]>: > It sounds like you might have some thread synchronization issues outside of > Lucene. To simplify things a bit, you might try just using one IndexWriter. > If I remember right, the IndexWriter is now pretty efficient, and there > isn't much need to inde

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Mark Miller
It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty efficient, and there isn't much need to index to smaller indexes and then merge. There is a

Re: Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Glen Newton
You might want to look at my indexing of 6.4 million PDF articles, full-text and metadata. It resulted in an 83GB index taking 20.5 hours to run. It uses multiple writers, is massively multithreaded. More info here: http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html Che

Multi -threaded indexing of large number of PDF documents

2008-10-23 Thread Sudarsan, Sithu D.
Hi, We are trying to index large collection of PDF documents, sizes varying from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for text extraction) and on Windows as well as CentOS Linux. Used java -Xms and -Xmx options, both at 1080m, even though we have 4GB on Windows and 32 GB

Re: Lucene Payload

2008-10-23 Thread Grant Ingersoll
You can search the archives for some background info. Also, Michael Busch has a nice presentation from ApacheCon at http://people.apache.org/~buschmi/apachecon/AdvancedIndexingLuceneAtlanta07.ppt Basically, the payload allows you to associate an arbitrary byte array with 1 or more terms. O

Lucene Payload

2008-10-23 Thread Anshul jain
Hi all, Has anyone used the payload functionality in Lucene? I would really appreciate if someone can provide an explain using a code or something. Thanks, Anshul

Re: Combining keyword queries with database-style queries

2008-10-23 Thread Erick Erickson
Well, assuming that token_count is an indexed field in your documents (i.e. not something you're computing on the fly), just use a RangeQuery for the numeric part. Actually, you probably want to use ConstantScoreRangeQuery... The only thing you have to watch is that Lucene does a lexical compare,

Re: Question about QueryParser

2008-10-23 Thread Erick Erickson
It looks to me like you've got a space between the characters in the second example Best Erick 2008/10/23 James liu <[EMAIL PROTECTED]> > public class AnalyzerTest { > @Test > public void test() throws ParseException { > QueryParser parser = new MultiFieldQueryParser(new String[]{"

Re: Combining keyword queries with database-style queries

2008-10-23 Thread mathieu
Compass handles that nicely. You can first query, lucene and building a IN (...) in your SQL db. Or you can ask your SQL first, and handling it with a bitset in Lucene. M. On Thu, 23 Oct 2008 14:27:53 +0200, Niels Ott <[EMAIL PROTECTED]> wrote: > Hi everybody, > > I need to query for documents

Combining keyword queries with database-style queries

2008-10-23 Thread Niels Ott
Hi everybody, I need to query for documents not only for search terms but also for numeric values (or other general types). Let me try to explain with a hypothetical example. Assuming there is a value for the number words in each document (or the number of person names, or whatever), I would wan

Question about QueryParser

2008-10-23 Thread James liu
public class AnalyzerTest { @Test public void test() throws ParseException { QueryParser parser = new MultiFieldQueryParser(new String[]{"title", "body"}, new StandardAnalyzer()); Query query1 = parser.parse("中文"); Query query2 = parser.parse("中 文"); System.out.pri

Re: Multisearcher will maintain index order sorting?

2008-10-23 Thread Ganesh
Multisearcher after performing search on second index, adds the resultant docid with the maxdocid of the first index. In my case it would be 3. After incrementing the docid, the document is inserted in to the FieldDocSortedHitQueue. FieldDocSortedHitQueue is an extension of priority queue shoul

Re: Multisearcher will maintain index order sorting?

2008-10-23 Thread Hadi Forghani
because when you want to find X of second index, shoud pass docId=3 to MultiSearcher and MultiSearcher can Find Sub Search of this Document with calculation length of all subSearcher. for example when you get doc with DocID 3(Second X), multisearch (see the code of multisearcher doc(int i)), mines

Re: Multisearcher will maintain index order sorting?

2008-10-23 Thread Ganesh
In IndexA there are 3 docs DocID, Terms 0,X 1,X Y 2,X Z In IndexB there are 3 docs DocID, Terms 0,X 1,X Y 2,X Z When i do sort on indexed order using Multisearcher and ParallelMultiSearcher, it returns the result 0,X 3,X 1,X Y 4,X Y 2,X Z 5,X Z But it should be in the order of 0,1,2,3,4,5. Co

Re: Query Expansion Module for Lucene based on BM25 ranking function

2008-10-23 Thread Joaquin Perez Iglesias
Hi Grant and Jose, just to give some more details, as Jose said avg_length is precalculated at indexing time using an specific Similarity class. Basically this can be done through the lengthNorm method, for each document and field the total length is stored, when the indexing process is finish

Re: Multisearcher will maintain index order sorting?

2008-10-23 Thread Ganesh
Multisearcher and ParallelMultiSearcher, when requested to sort on doc (indexed order), it merges the result by docID of each DB. Regards Ganesh - Original Message - From: "Paul Smith" <[EMAIL PROTECTED]> To: Sent: Thursday, October 23, 2008 10:57 AM Subject: Re: Multisearcher will m