Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Michael McCandless
On Fri, Mar 27, 2009 at 4:22 PM, Marvin Humphrey wrote: > I think the difference here is that Lucene gets to use multiple threads within > one process, while Lucy has to at least be capable of using a multiple-process > concurrency model in order to support real-time search for non-threaded hosts

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
On Fri, Mar 27, 2009 at 03:59:09PM -0400, Michael McCandless wrote: > >> Why must merge policy be made public for realtime search? [In Lucy] > > > > Because real-time search under Lucy needs to be able to operate using > > multiple > > write processes, since threads will not always be available.

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Michael McCandless
On Fri, Mar 27, 2009 at 1:12 PM, Marvin Humphrey wrote: >> Why must merge policy be made public for realtime search? [In Lucy] > > Because real-time search under Lucy needs to be able to operate using multiple > write processes, since threads will not always be available. > > You need to be able

Re: Encoding detection free software?

2009-03-27 Thread Robert Muir
i've been using the one in icu for some time... http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html On Fri, Mar 27, 2009 at 2:57 PM, Zhang, Lisheng < lisheng.zh...@broadvision.com> wrote: > Hi, > > What's the best free tool for encoding detection? For example we have > a AS

Encoding detection free software?

2009-03-27 Thread Zhang, Lisheng
Hi, What's the best free tool for encoding detection? For example we have a ASCII file README.txt, which needs to be indexed, but we need to know its encoding before we can convert it to Java String. I saw some free tools on the market, but have no experiences with any of them yet? What is the be

Re: Free software for language detection

2009-03-27 Thread Boris Aleksandrovsky
Lisheng, You might want to look at the Nutch LanguageID plugin (http://wiki.apache.org/nutch/LanguageIdentifier) too. Cheers, Boris On Fri, Mar 27, 2009 at 10:22 AM, Zhang, Lisheng wrote: > Thanks very much! > > -Original Message- > From: jochen.sc...@gmail.com [mailto:jochen.sc...@gmai

RE: Free software for language detection

2009-03-27 Thread Zhang, Lisheng
Thanks very much! -Original Message- From: jochen.sc...@gmail.com [mailto:jochen.sc...@gmail.com]on Behalf Of Jochen Frey Sent: Friday, March 27, 2009 10:04 AM To: java-user@lucene.apache.org Subject: Re: Free software for language detection Lisheng, Here's a package you could take a lo

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
On Fri, Mar 27, 2009 at 12:39:05PM -0400, Michael McCandless wrote: > Why must merge policy be made public for realtime search? [In Lucy] Because real-time search under Lucy needs to be able to operate using multiple write processes, since threads will not always be available. You need to be abl

Re: Free software for language detection

2009-03-27 Thread Jochen Frey
Lisheng, Here's a package you could take a look at. I have used it in the past and it worked reasonably well. Let me know what else you find and how it works for you. http://www.olivo.net/software/lc4j/ Good luck! Jochen Frey On Fri, Mar 27, 2009 at 9:54 AM, Zhang, Lisheng < lisheng.zh...@broad

Free software for language detection

2009-03-27 Thread Zhang, Lisheng
Hi, Are you aware of any free software for language detection (given certain text, see if it is French, or Japanese)? I saw Bob Carpenter's previous mail which explained the principle nicely, but could not locate free tools? Thanks very much for helps, Lisheng --

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Michael McCandless
On Fri, Mar 27, 2009 at 12:13 PM, Marvin Humphrey wrote: >> Whereas in Lucene neither MultiSegmentReader nor SegmentReader is public. > > I had thought making SegmentReader public was at least under consideration as > part of the implementation for segment-centric sorted search, but I guess it >

Re: Syncing lucene index with a database

2009-03-27 Thread Matt Schraeder
Thanks a million. You really helped me get on the right direction. I'm going to start building some test cases this afternoon so I can really begin to get my hands dirty and see how everything works. It's good to know that there isn't one "right" way for my purposes, which is mostly what I wanted

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
On Fri, Mar 27, 2009 at 11:09:09AM -0400, Michael McCandless wrote: > Whereas in Lucene neither MultiSegmentReader nor SegmentReader is public. I had thought making SegmentReader public was at least under consideration as part of the implementation for segment-centric sorted search, but I guess i

Re: Unable to improve performance

2009-03-27 Thread Paul Taylor
Michael McCandless wrote: Are you opening your IndexReader with readOnly=true? If not, you're likely hitting contention on the "isDeleted" method. When you run with a "normal" directory, either on a traditional hard drive or SSD device, do you use NIOFSDirectory? That removes contention, but,

Re: Unable to improve performance

2009-03-27 Thread Simon Willnauer
ReadOnly option was introduce with 2.4 from javadoc: "...as of 2.4, it's possible to open a read-only IndexReader using one of the static open methods that accepts the boolean readOnly parameter." http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/index/IndexReader.html#open(org.apache

Re: Unable to improve performance

2009-03-27 Thread Michael McCandless
Alas, it's new as of 2.4. Can you upgrade? Mike On Fri, Mar 27, 2009 at 11:55 AM, wrote: >> > How can I open it "readonly"? >> >> See the javadocs for IndexReader. > > I did it already for 2.3 - cannot find readonly > > > - >

RE: Unable to improve performance

2009-03-27 Thread spring
> > How can I open it "readonly"? > > See the javadocs for IndexReader. I did it already for 2.3 - cannot find readonly - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-use

Re: Unable to improve performance

2009-03-27 Thread Ian Lea
>> Are you opening your IndexReader with readOnly=true?  If not, you're >> likely hitting contention on the "isDeleted" method. > > How can I open it "readonly"? See the javadocs for IndexReader. -- Ian. - To unsubscribe, e-mail

RE: Unable to improve performance

2009-03-27 Thread spring
> Are you opening your IndexReader with readOnly=true? If not, you're > likely hitting contention on the "isDeleted" method. How can I open it "readonly"? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For ad

Re: Syncing lucene index with a database

2009-03-27 Thread Erick Erickson
Yes, updating a document in Lucene is "expensive" for two reasons: 1> deleting and adding a document does mean there's internal work being done. But it's not all *that* expensive. So this really comes down to how many records you expect to update every 15 minutes. You've gotta try it. 2

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Michael McCandless
On Fri, Mar 27, 2009 at 9:48 AM, Marvin Humphrey wrote: > Every indexer opens a PolyReader [analogous to MultiSegmentReader], even when > there's no data in the index, or a single segment. (I modified PolyReader so > that 1-segment and 0-segment states were officially valid for this purpose.) OK

Re: Syncing lucene index with a database

2009-03-27 Thread Matt Schraeder
I'm going to try and cover all replies so far, but for the most part this first one since it had the most help so far. Thanks to everyone who replied so far, you've given me a lot of great ideas to think about and look into. I'm going to begin some small test indexes with our data so we have somet

Re: i18n numbers

2009-03-27 Thread Robert Muir
this is really no problem at all... use RBBI to identify runs of numbers in your query string, and then replace them with the normalized version. you will need icu jar for this. String userQuery = "Potter 19,99"; Locale locale = new Locale("nl"); RuleBasedBreakIterator bi = (RuleBa

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Marvin Humphrey
On Fri, Mar 27, 2009 at 08:08:20AM -0400, Michael McCandless wrote: > Actually, how will Lucy do this? Every indexer opens a PolyReader [analogous to MultiSegmentReader], even when there's no data in the index, or a single segment. (I modified PolyReader so that 1-segment and 0-segment states wer

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Michael McCandless
Actually, how will Lucy do this? Even though Lucy's SegmentReader is lighter weight, it still seems like you shouldn't be opening them in the writer (except for realtime search)? What's your plan? Are you going to simply make the segment metadata public? Mike On Thu, Mar 26, 2009 at 9:51 PM, M

Re: Unable to improve performance

2009-03-27 Thread Michael McCandless
Also, see here for other ideas that may help: http://wiki.apache.org/lucene-java/ImproveSearchingSpeed I just updated that page with readOnly IndexReader & NIOFSDirectory. Mike On Fri, Mar 27, 2009 at 7:07 AM, Paul Taylor wrote: > Hi > > I am trying to run the performance tests against luc

Re: Unable to improve performance

2009-03-27 Thread Michael McCandless
Are you opening your IndexReader with readOnly=true? If not, you're likely hitting contention on the "isDeleted" method. When you run with a "normal" directory, either on a traditional hard drive or SSD device, do you use NIOFSDirectory? That removes contention, but, it only works on non-Windows

Unable to improve performance

2009-03-27 Thread Paul Taylor
Hi I am trying to run the performance tests against lucene, and am suprised about the results. I have a test that creates a queue of queries, and a number of threads. The threads run concurrently getting the next query available, peforming a query on the index and taking the top hits. The in

Re: Scores between words. Boosting?

2009-03-27 Thread liat oren
Yes, I have scoring, I will start with the 3 that are the closest. By the way, I saw that Lucene also has Synonyms. Any reason you suggested Solr? Thanks, Liat 2009/3/24 Grant Ingersoll > Do you have any info that helps you narrow down how many to choose, like > some type of ranking of the syno

Re: MergePolicy public but SegmentInfos package protected?

2009-03-27 Thread Michael McCandless
On Thu, Mar 26, 2009 at 9:51 PM, Marvin Humphrey wrote: >> eg querying whether >> compound file format is in use, whether separate norms are stored, >> "get me total size in bytes of all files" (or maybe just "get me all >> files", plus utility method somewhere to add up the sizes), so this >> ap

RE: i18n numbers

2009-03-27 Thread Marcel Overdijk
That's an interesting idea yes. Thanks! Daan de Wit wrote: > > Maybe you can create a filter that parses numeric tokens to their > locale-specific counterpart, and then search for both the converted and > the unconverted token. > > Daan > >> -Original Message- >> From: Marcel Overdij

Re: Syncing lucene index with a database

2009-03-27 Thread Amin Mohammed-Coleman
Hi I was going to suggest looking at hibernate search. It comes with event listeners that modify your indexes when the persistent entity changes. It use lucene under the hood so if you need to access lucene the you can. Indexing can be done sync or async and the documentation shows how to

RE: i18n numbers

2009-03-27 Thread Daan de Wit
Maybe you can create a filter that parses numeric tokens to their locale-specific counterpart, and then search for both the converted and the unconverted token. Daan > -Original Message- > From: Marcel Overdijk [mailto:marceloverd...@gmail.com] > Sent: vrijdag 27 maart 2009 7:55 > To: jav