Re: Can this regex be done?

2009-09-02 Thread Michael Thomsen
Because some of the queries that I have to convert (without modifying them, unfortunately) have a half literally a page of statements expressed like that that, if expanded, would equal a several page long lucene query. On Wed, Sep 2, 2009 at 6:42 PM, Luis Alves wrote: > Why can't you use a OR? got

Re: Can this regex be done?

2009-09-02 Thread Luis Alves
Why can't you use a OR? gotham OR gothic Is it possible to translate this sort of Perl regex into a lucene query: /goth(am|ic)/ Where the only results that would be returned would be gotham or gothic? Thanks, Mike - To unsub

Re: Can this regex be done?

2009-09-02 Thread Robert Muir
Have you tried the regex package in lucene's contrib? http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/regex/package-summary.html there are several implementations, I am not sure if one really exactly "perl compatible", but for your example I think it will do the trick. On Wed, S

Can this regex be done?

2009-09-02 Thread Michael Thomsen
Is it possible to translate this sort of Perl regex into a lucene query: /goth(am|ic)/ Where the only results that would be returned would be gotham or gothic? Thanks, Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucen

RE: New "Stream closed" exception with Java 6

2009-09-02 Thread Chris Bamford
Hi Grant, I have now followed Daniel's advice and catch the exception with: try { indexWriter.addDocument(doc); } catch (CorruptIndexException ex) { throw new IndexerException ("CorruptIndexException on doc: " + doc.toString(), ex); } catch (IOException ex) {

Use of tika for parsing, offsets questions

2009-09-02 Thread David Causse
Hi, If I use tika for parsing HTML code and inject parsed String to a lucene analyzer. What about the offset information for KWIC and return to text (like the google cache view)? how can I keep track of the offsets between tika parser and lucene analyzer? What are the solutions/ideas to do a sort

Re: First result in the group

2009-09-02 Thread mark harwood
See "DuplicateFilter" in contrib. http://markmail.org/message/lsvnpu7mwhht3a4p Cheers Mark - Original Message From: Ganesh To: java-user@lucene.apache.org Sent: Wednesday, 2 September, 2009 12:38:35 Subject: Re: First result in the group I have a field called category and all docume

Re: First result in the group

2009-09-02 Thread Shai Erera
I see ... the solution I have in mind is not simple, but it follows the Collector approach. Index categories as payloads of documents such that there is one field (cats:all for example) that includes a posting list for all documents, each has the categories it is associated w/ in its payload: cats:

Re: First result in the group

2009-09-02 Thread Ganesh
I have a field called category and all documents will have belong to some category( say some belong to X and some Y etc). The field values may change dynamically. I want the search results to be filterted to retrieve one document per category. This is similar to 'group by' feature in database.

Re: First result in the group

2009-09-02 Thread Shai Erera
What do you mean by "first result in the group"? What is a group? On Wed, Sep 2, 2009 at 1:36 PM, Ganesh wrote: > Hello all, > > I want to retrieve the first result in the group. How to acheive this? > Currently i am parsing all the results, using a hash and avoiding duplicate > entries. > > Is

Re: lucene on amazon s3

2009-09-02 Thread Michael McCandless
So long as you can ensure, external to Lucene, that only one IndexWriter is open at once on the index, you can disable all of Lucene's normal locking. But you must be certain: if you accidentally allow two IndexWriter's to be open at once, it will quickly corrupt the index. Beyond locking, Lucene

First result in the group

2009-09-02 Thread Ganesh
Hello all, I want to retrieve the first result in the group. How to acheive this? Currently i am parsing all the results, using a hash and avoiding duplicate entries. Is there any better way? Regards Ganesh Send instant messages to your online friends http://in.messenger.yahoo.com --

Re: lucene on amazon s3

2009-09-02 Thread Simon Willnauer
Hey there, AFAIK this problem on S3 has not been solved but anyway there might be other solutions to overcome this problem. As you are running on amazon anyway you might wanna consider to have some locking service like ZooKeeper (http://hadoop.apache.org/zookeeper/) which could help you with other

Re: Deletion of words in articles of Wikipedia

2009-09-02 Thread mark harwood
>>I need to start off with this project where we can find the ranking of >>controversial articles. Could anyone kindly help me how to start? Check out the wikipedia "logging" dumps which contain the reasons for actions on page titles (including ip blocks and deletes) but without the bulk of the

lucene on amazon s3

2009-09-02 Thread prasenjit
I am exploring the possibility of creating large lucene indices via ec2/s3. Till now I have got only teh following url : http://www.kimchy.org/lucene-and-amazon-s3/ But still dont know whether the lucene locking problem ( on a distributed FS like S3/DFS ) is fixed or not. Any information is great

Re: Performance diffs between filter.bits() and searcher.docFreq()

2009-09-02 Thread Konstantyn Smirnov
hossman wrote: > > "the second", and "no" > Thanks for that. Concerning the *theoretical* performance difference, for the mid-size index, what will it be in % roughly? Are there any way to make indexReader.docFreqs() reflect the changes faster, i.e. without the need to optimize()? - Kon