Re: i18n numbers

2009-03-26 Thread Marcel Overdijk
That would make sense yes. But the problem is I'm having a general query filed. I don't know user entered String or a number, or what he meant... Is 2008 a year (number) or part of an address String e.g. keeping the address. Or maybe he's combining stuff like "Potter 19,99" Robert Muir wrote:

Re: MergePolicy public but SegmentInfos package protected?

2009-03-26 Thread Marvin Humphrey
On Thu, Mar 26, 2009 at 07:06:26AM -0400, Michael McCandless wrote: > We'd need to add a few methods to IndexReader, Eep. IndexReader's too big as it is. > eg querying whether > compound file format is in use, whether separate norms are stored, > "get me total size in bytes of all files" (or

Re: i18n numbers

2009-03-26 Thread Robert Muir
marcel, I'd suggest parsing/display numbers in a locale-sensitive way with NumberFormat (be sure to supply correct locale)... and keeping them in the index one consistent way (i.e. 19.99) On Thu, Mar 26, 2009 at 6:03 PM, Marcel Overdijk wrote: > > Thanks for your reply. > > It's indeed a webap

Re: Syncing lucene index with a database

2009-03-26 Thread Tim Williams
On Thu, Mar 26, 2009 at 6:28 PM, Matt Schraeder wrote: > I'm new to Lucene and just beginning my project of adding it to our web > app.  We are indexing data from a MS SQL 2000 database and building > full-text search from it. > > Everything I have read says that building the index is a resource h

Re: i18n numbers

2009-03-26 Thread Chris Lu
Marcel, First of all, do you really want the user to search price:19.99 ? Maybe you should use some logic like price>=19.99? If so, you should use range query to handle this case. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www

Re: Syncing lucene index with a database

2009-03-26 Thread Chris Lu
There are many things you need to synchronize with database. Besides just changed fields, you may need to deal with deleted database records, etc. In general, it's not efficient to pull over data that's changing often.and may not have much effect on search. It'll overload Lucene unnecessarily

Re: i18n numbers

2009-03-26 Thread Marcel Overdijk
Thanks for your reply. It's indeed a webapp with a html front-end. I agree letting end-user enter a Lucene query might not what you want. Probably I will be using an "all" index which indexes all fields of my entity. So in the book example including book title, isbn, price, author.firstname, aut

Re: i18n numbers

2009-03-26 Thread Erick Erickson
What does the front end look like? Is it a web page or a custom app? And do you expect your users to actually enter the field name? I'd be reluctant to allow any but the geekiest of users to enter the Lucene syntax (i.e. the field names). Users shouldn't know anything about the underlying structure

Re: Syncing lucene index with a database

2009-03-26 Thread Erick Erickson
You've got a great grasp of the issues, comments below. But before you do, a lot of this kind if thing is incorporated in SOLR, which is build on Lucene. Particularly updating an index then using it. So you might take a look over there. It even has a DataImportHandler... WARNING: I've only been mo

i18n numbers

2009-03-26 Thread Marcel Overdijk
First of all I'm new into Lucene. I'm experimenting right now with it in combination with Hibernate Search. What I'm wondering is of I can index numbers related to i18n. E.g. I have a Book entity with a price attribute. A book with a price of 19.99 can be found while searching for price:19.99.

Syncing lucene index with a database

2009-03-26 Thread Matt Schraeder
I'm new to Lucene and just beginning my project of adding it to our web app. We are indexing data from a MS SQL 2000 database and building full-text search from it. Everything I have read says that building the index is a resource heavy operation so we should use it sparingly. For the most part

AUTO: Zhou Lin Dai is out of the office. (returning 2009-03-30)

2009-03-26 Thread Zhou Lin Dai
I am out of the office until 2009-03-30.. I will check emails at night. For anything emergent, you can call my cell phone (86) 131 6290 0375. Note: This is an automated response to your message Re: Memory Leak? sent on 26/3/09 22:34:46. This is the only notification you will receive while this

Re: Deadlock with concurrent merges and IndexWriter [Lucene 2.4]

2009-03-26 Thread Michael McCandless
OK I opened LUCENE-1573 for this. Mike On Thu, Mar 26, 2009 at 8:48 AM, Jeremy Volkman wrote: > The indexer thread was part of a worker pool. I "stopped" the pool which > interrupted all of the worker threads. So, the interruption came from my > code. > > I didn't notice whether one CPU was pegg

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-26 Thread Michael McCandless
Another thing is to limit the max # merge threads CMS will run at once. It defaults to 3 now. Mike On Thu, Mar 26, 2009 at 2:08 PM, Jason Rutherglen wrote: > I used the NoMergePolicy to build the index as I noticed the indexing is > faster, meaning the system simply creates large multi-megabyte

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-26 Thread Jason Rutherglen
I used the NoMergePolicy to build the index as I noticed the indexing is faster, meaning the system simply creates large multi-megabyte segments in the ram buffer, flushes them out and doesn't worry about merging which causes massive disk trashing. I am pondering some benchmarks to find the optima

Re: question about grouping text

2009-03-26 Thread Otis Gospodnetic
Hi, I'm not aware of anything in LingPipe that would do the Q&A part, though LP (and GATE) may have the building blocks for what you need. For example, they both must have sentence boundary detection/sentence chunking, which might be one of the first sub-tasks you'd need to do to begin findin

Re: Lucene index architecture question

2009-03-26 Thread Mindaugas Žakšauskas
I don't think you can write to the same index (file) from multiple locations at the same time and expect predictable behaviour. Afficionados will correct me if I'm wrong, but I think pessimistic locking file system (think NTFS) would simply not allow this, optimistic locking (think ext3) would resu

Re: Memory Leak?

2009-03-26 Thread Michael McCandless
OK thanks for bringing closure. Mike On Thu, Mar 26, 2009 at 8:37 AM, Chetan Shah wrote: > > Ok. I was able to conclude that the I am getting OOME due to my usage of HTML > Parser to get the HTML title and HTML text. I display 10 results per page > and therefore end up calling the org.apache.luc

Re: Lucene index architecture question

2009-03-26 Thread kgeeva
Thank you guys for the reply. Solr seems to be a good solution for distributed indexes but the app is already written with a Lucene index. So I had a question on Ian's answer as to going for 2 indexes. My app is on a weblogic cluster with two servers. The app is installed on both the servers. Wha

Re: Deadlock with concurrent merges and IndexWriter [Lucene 2.4]

2009-03-26 Thread Jeremy Volkman
The indexer thread was part of a worker pool. I "stopped" the pool which interrupted all of the worker threads. So, the interruption came from my code. I didn't notice whether one CPU was pegged, however I did take a series of JVM stack dumps and each one showed the finishMerges thread in the RUNN

Re: Memory Leak?

2009-03-26 Thread Chetan Shah
Ok. I was able to conclude that the I am getting OOME due to my usage of HTML Parser to get the HTML title and HTML text. I display 10 results per page and therefore end up calling the org.apache.lucene.demo.html.HTMLParser 10 times. I modified my code to store the title and html summary in the

Re: Deadlock with concurrent merges and IndexWriter [Lucene 2.4]

2009-03-26 Thread Michael McCandless
OK I like this theory, and I think it can cause a spin loop in doWait (do you see one CPU pegged?), and starvation in the merging thread. Do you know who called Thread.interrupt() in your case? Does your code do that explicitly somewhere? IndexWriter is not doing the right thing when the thread

Re: Deadlock with concurrent merges and IndexWriter [Lucene 2.4]

2009-03-26 Thread Jeremy Volkman
Hi Michael, I originally wasn't thinking correctly about the doWait() method releasing the monitor. I was thinking about it more of a sleep method instead (which would not release the monitor). Regardless, I think I've pinpointed the problem. In my stacktrace, "Indexing Thread" had been interrupt

Re: MergePolicy public but SegmentInfos package protected?

2009-03-26 Thread Michael McCandless
Marvin Humphrey wrote: > On Wed, Mar 25, 2009 at 06:15:35AM -0400, Michael McCandless wrote: > >> I'm torn.  MergePolicy (and MergeScheduler) are "expected" to be >> something expert users could alter; their API is designed to be >> exposed & stable.  I think they should be visilbe in the javadocs

Re: Deadlock with concurrent merges and IndexWriter [Lucene 2.4]

2009-03-26 Thread Michael McCandless
Are there any other threads running? Can you post their stack traces too? Are you sure nothing is happening? EG, if you look in the index, do you see files slowly increasing in size (indicating there is a merge running). These two traces are actually normal. The ArticleIngestor thread is tryin

Re: question about grouping text

2009-03-26 Thread Amin Mohammed-Coleman
Hi I was wondering if soemthing like LingPipe or Gate (for text extraction) might be an idea? I've started looking at it and I'm just thinking it may be applicable (I maybe wrong). Cheers Amin On Wed, Mar 25, 2009 at 4:18 PM, Grant Ingersoll wrote: > Hi MFM, > > This comes down to a preprocess