Re: Lucene as a primary datastore

2010-01-19 Thread Ganesh
Thanks Otis. The download link sent via email has file called cemail. There is no extn. I tried with html,pdf but it is not opening properly. Regards Ganesh - Original Message - From: "Otis Gospodnetic" To: Sent: Wednesday, January 20, 2010 11:54 AM Subject: Re: Lucene as a primary da

Re: Lucene as a primary datastore

2010-01-19 Thread Otis Gospodnetic
Have you seen the "Hot Backups with Lucene" paper available via http://www.manning.com/hatcher3/ ? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Ganesh > To: java-user@lucene.apache.org > Sent: Wed, January 20, 2010 1:13:21 AM > Subjec

Re: Lucene as a primary datastore

2010-01-19 Thread Ganesh
We have data in compound files and we use Lucene as primary database. Its working great and much faster with millions of records. The only issue, I face is with sorting. Lucene sorting consumes good amount of memory. I don't know much about the MySQL/PostgreSQL database, and how they behave with

Re: Lucene as a primary datastore

2010-01-19 Thread Otis Gospodnetic
You are not alone, Guido. It's a good question. In my experience, Lucene is as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not corrupt it. Of course, even with the most expensive databases, you'd want to make backups. The same goes with Lucene. Nowadays, one way

Re: Tag Index patch (LUCENE-1292) status?

2010-01-19 Thread Jason Rutherglen
Hi Chris, It's not actively being worked on. Are you interested in working on it? Jason On Tue, Jan 19, 2010 at 4:42 PM, Chris Harris wrote: > I'm interested in the Tag Index patch (LUCENE-1292), in particular > because of how it enables you to modify certain fields without > reindexing a whol

Lucene as a primary datastore

2010-01-19 Thread Guido Bartolucci
I know that the primary use case for Lucene is as an index of data that can be reconstructed (e.g., from a relational database or from spidering your corporate intranet). But, I'm curious if anyone uses Lucene as their primary datastore for their gold data. Is it good enough? Would anyone conside

Re: incremental document field update

2010-01-19 Thread Babak Farhang
> I see -- so your file format allows you to append to the same file > without affecting prior readers? We never do that in Lucene today > (all files are "write once"). Yes. For the most part it only appends. The exception is when the log's entry count is updated (when the appends actually "commi

Re: Unary Operators and Operator Precedence

2010-01-19 Thread Ahmet Arslan
> Here are some questions about unary > operators and operator precedence or default order of > operation. > > We all know the importance of order of operation of binary > operators (ones that operate on two operands) such as AND > and OR. We know how to impose express order of operation by > grou

Tag Index patch (LUCENE-1292) status?

2010-01-19 Thread Chris Harris
I'm interested in the Tag Index patch (LUCENE-1292), in particular because of how it enables you to modify certain fields without reindexing a whole document. However, that issue is marked Lucene 2.3.1 and hasn't been updated since July 2008. Can anyone provide any status updates on this patch? Que

Re: Hints on implementing XQuery full-text search

2010-01-19 Thread Chris Hostetter
: I'm about to embark on implementing the full-text search feature of XQuery: Good luck with that. Here's some quick suggestions on how i'd try to tackle the things you asked about, w/o putting much thought into... : title ftcontains "usability" occurs at least 2 times assuming this is

Lucene 2.9.1 'read past EOF' IOException under system load

2010-01-19 Thread Franz Garsombke
We have been able to expose this exception under system load but NOT with individual requests. Lucene version is 2.9.1. These indexed files are being read over NFS. Java version is: Java(TM) SE Runtime Environment (build 1.6.0_11-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mod

Re: unique term identifiers

2010-01-19 Thread Grant Ingersoll
Have a look at Mahout (Lucene sister project), which can create SparseVectors from Lucene term vectors where the entries are the term id and the "weight" of the term. Trivial to replicate what is done in Mahout for LibSVM or ARFF or whatever. On Jan 18, 2010, at 9:07 AM, Solt, Illés wrote: >

Re: Proximity of More than Single Words?

2010-01-19 Thread Ahmet Arslan
> For proximity expressions, the query > parser documentation says, "use the tilde, "~", symbol at > the end of a Phrase." It gives the example "jakarta > apache"~10 > > Does this mean that proximity can only be operated on > single words enquoted in quotation marks? Yes if you are using QueryPar

Re: Unary Operators and Operator Precedence

2010-01-19 Thread Marvin Humphrey
> 3.) Does grouping or nesting affect results with unary operators? Does > using unary operators with binary operators affect results. For example, > in the query: > > (+a +b) OR c > > has the "required" effect of the + (plus) operator been eliminated by > the OR operator, so that nevermin

Unary Operators and Operator Precedence

2010-01-19 Thread T. R. Halvorson
Here are some questions about unary operators and operator precedence or default order of operation. We all know the importance of order of operation of binary operators (ones that operate on two operands) such as AND and OR. We know how to impose express order of operation by grouping and nes

Re: Indexing and Searching linked files

2010-01-19 Thread Erick Erickson
What's a reasonable upper limit on the number of files? Because I think it would be simpler, at least to start, to allow your field to be larger (say, 1B tokens, 1,000 files of 1M tokens each), but restrict the input of each file to 1M tokens per file. The most elegant way would probably be to subc

Re: Indexing and Searching linked files

2010-01-19 Thread Danil ŢORIN
You can simple index both "files" and "cards" into same index (no need for 2 indexes) Lucene easily support documents of different structure. You may add some boosting per field or document, and tune similarity to get most important stuff in top. On Tue, Jan 19, 2010 at 16:35, Anna Hunecke wro

Proximity of More than Single Words?

2010-01-19 Thread T. R. Halvorson
For proximity expressions, the query parser documentation says, "use the tilde, "~", symbol at the end of a Phrase." It gives the example "jakarta apache"~10 Does this mean that proximity can only be operated on single words enquoted in quotation marks? To clarify the question by comparision,

Re: Indexing and Searching linked files

2010-01-19 Thread Anna Hunecke
The field size is restricted to 1 million tokens, because of the very reasons you mentioned. So, even if I have one separate field for the content of a file, I might reach the limit if the file is really big. But I can't help that. What I want to avoid is that the whole content of some files can

Re: Indexing and Searching linked files

2010-01-19 Thread Erick Erickson
What field size limit are you talking about here? Because 10,000 tokens is the default, but you can increase it to Integer.MAX_VALUE. So are you really talking billions of tokens here? Your index quickly becomes unmanageable if you're allowing it to grow by such increments. One can argue, IMO, th

Re: PhraseQuery with term positions

2010-01-19 Thread Avi Rosenschein
Index is pretty large (50GB, divided into 8 shards). I'm afraid I would start running into memory issues by adding the stop words (though it is definitely something I would like to test at some point). My question was more to try to understand if this was known behavior in lucene, since I can't re

Re: PhraseQuery with term positions

2010-01-19 Thread Erick Erickson
How big is your index? Because the simplest thing would be to just not remove stopwords at index or query time. Perhaps in a duplicate field depending upon your needs. Erick On Tue, Jan 19, 2010 at 6:50 AM, Avi Rosenschein wrote: > Hi, > > I am using PhraseQuery with explicitly set term position

Indexing and Searching linked files

2010-01-19 Thread Anna Hunecke
Hi! I have been working with Lucene for a while now. So far, I found helpful tips on this list, so I hope somebody can help me with my problem: In our app information is grouped in so-called cards. Now, it should be made possible to also search on files linked to the cards. You can link arbitrar

PhraseQuery with term positions

2010-01-19 Thread Avi Rosenschein
Hi, I am using PhraseQuery with explicitly set term positions and slop=0, in order to skip stop words. The field in my index is indexed with TermVector positions. When I do a query with stop words skipped, for example "internet for research" (translated into PhraseQuery: "internet ? research"), I

Re: incremental document field update

2010-01-19 Thread Michael McCandless
On Tue, Jan 19, 2010 at 1:32 AM, Babak Farhang wrote: >> This is about multiple sessions with the writer.  Ie, open writer, >> update a few docs, close.  Do the same again, but, that 2nd session >> cannot overwrite the same files from the first one, since readers may >> have those files open.  The