Re: Build RAMDirectory on FSDirectory, and then synchronzing the two

2012-01-11 Thread dyzc
That lies in that my apps add indexes to those in RAM rather than update them. So the size doubled. Seem not related to the OpenMode.CREATE option. -- Original -- From: "Ian Lea"; Date: Wed, Jan 11, 2012 05:20 PM To: "java-user"; Subject: Re: Build RAMDire

Re: is it possible to index wiki markup files?

2012-01-11 Thread Reyna Melara
Thanks to all that have done a reply to my question. Send regards, Reyna 2012/1/11 Michael Wechner > Maybe Tika is also of help to you > > http://tika.apache.org/ > > HTH > > Michael > > Am 11.01.12 20:13, schrieb Reyna Melara: > >> Hi, my name is Reyna Melara I'm a PhD student form Mexico, an

Re: Unsubscribe failure

2012-01-11 Thread Erick Erickson
If your e-mail client sends things in anything but plain text, you might try switching the format to plain text. I've had the spam filter reject formatted e-mail before... May not be relevant, but it's worth a try. Best Erick On Wed, Jan 11, 2012 at 12:44 PM, Bennett, Tony wrote: > I tried to u

Is it necessary to create a new searcher?

2012-01-11 Thread Cheng
I am currently using the following statement at the end of each index writing, although I don't know if the writing modifies the indexes or not: is = new IndexSearcher(IndexReader.openIfChanged(ir)); # is -> IndexSearcher, ir-> IndexReader My question is how expensive to create a searcher insta

Re: is it possible to index wiki markup files?

2012-01-11 Thread Michael Wechner
Maybe Tika is also of help to you http://tika.apache.org/ HTH Michael Am 11.01.12 20:13, schrieb Reyna Melara: Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a set of 11,051,447 files with txt extension but the content of each file is in fact in wiki format, I want and

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Cheng
Will do thanks On Wed, Jan 11, 2012 at 3:37 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Yes, it's best to share one IndexSearcher/IndexReader across all > threads... and if you ever find evidence this hurts concurrency then > please post back :) > > Mike McCandless > > http://blo

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
Yes, it's best to share one IndexSearcher/IndexReader across all threads... and if you ever find evidence this hurts concurrency then please post back :) Mike McCandless http://blog.mikemccandless.com On Wed, Jan 11, 2012 at 3:29 PM, Cheng wrote: > Will do if I see a perf gain. > > The other is

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Cheng
Will do if I see a perf gain. The other issue is that in each thread my apps will not only do indexing but searching. That means I will have to pass through the ram directory instance, along with the writer instance, to every thread so that the searcher can be built on. Should I create a same rea

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
Yes that would work fine but you should see a net perf loss by doing so (once you include time to flush/sync the RAMDir to an FSDir). If you see a perf gain then please report back! Mike McCandless http://blog.mikemccandless.com On Wed, Jan 11, 2012 at 3:09 PM, Cheng wrote: > Can I create

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Cheng
Can I create a RAMDirectory based writer and have it work cross all threads? In the sense, I would like to use RAMDirectory every where and have the RAMDirectory written to FSDirectory in the end. I suppose that should work, right? On Wed, Jan 11, 2012 at 2:31 PM, Michael McCandless < luc...@mik

RE: is it possible to index wiki markup files?

2012-01-11 Thread karl.wright
You might be interested in looking at ManifoldCF for getting your documents into Solr. See http://incubator.apache.org/connectors for more details. Karl -Original Message- From: ext Reyna Melara [mailto:reynamel...@gmail.com] Sent: Wednesday, January 11, 2012 2:13 PM To: java-user@luc

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
On Wed, Jan 11, 2012 at 1:32 PM, dyzc2010 wrote: > Mike, do you mean if I create a FSDirectory based writer in first place, then > the writer should be used in every thread rather than create a new > RAMDirectory based writer in that thread? Right. > What about I do want to use RAMDirectory t

Re: is it possible to index wiki markup files?

2012-01-11 Thread Ivan Brusic
Hi Reyna, I have never used it, but there is a WikipediaTokenizer defined in the analyzer contrib: http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/wikipedia/WikipediaTokenizer.html You can find a test case for this tokenizer in the source code. Hopefully othe

is it possible to index wiki markup files?

2012-01-11 Thread Reyna Melara
Hi, my name is Reyna Melara I'm a PhD student form Mexico, and I have a set of 11,051,447 files with txt extension but the content of each file is in fact in wiki format, I want and I need them to be indexed, but I don't know if I have to convert this content to flat text, I have been reading and I

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread dyzc2010
Mike, do you mean if I create a FSDirectory based writer in first place, then the writer should be used in every thread rather than create a new RAMDirectory based writer in that thread? What about I do want to use RAMDirectory to speed up the index and search processes? --

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
You shouldn't have to write first to intermediate RAMDirectorys anymore just share a single IndexWriter instance across all of your threads. Mike McCandless http://blog.mikemccandless.com On Wed, Jan 11, 2012 at 12:19 PM, Cheng wrote: > I have read a lot about IndexWriter and multi-threadin

Unsubscribe failure

2012-01-11 Thread Bennett, Tony
I tried to unsubscribe from this list, without success. I sent an email to 'java-user-unsubscr...@lucene.apache.org', I received the "please confirm" response, requesting that I send an email to: java-user-uc.1326295748.hcdpoljefehgobokinbd-Bennett.Tony=con-way@lucene.apache.org I did so, a

Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Cheng
I have read a lot about IndexWriter and multi-threading over the Internet. It seems to me that the normal practice is: 1) use a same indexwriter instance for multiple threads; 2) create an individual RAMDirectory per threads; 3) use addIndexes(Directory[]) methods to add to a local drive folder al

Call for Submission Berlin Buzzwords 2012all for Submission Berlin Buzzwords - http://berlinbuzzwords.de

2012-01-11 Thread Simon Willnauer
Call for Submission Berlin Buzzwords 2012 - Search, Store, Scale  -- June 4 / 5. 2012 The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:  * IR / Search - Lucene, Solr, katta, ElasticSearch or comparable solutions  * NoSQL - like CouchDB,

Re: shared instance of IndexWriter doesn't improve proformance

2012-01-11 Thread Michael McCandless
I think it's hard to compare the results here? In test 1 (single IW shared across threads) you end up with one index. In test 2 (private IW per thread) you end up with N indexes, which to be "fair" need to be merged down into one index (eg with .addIndexes)? Or seen another way, test 1 should ha

Large data set or data corpus

2012-01-11 Thread findbestopensource
Hello all, Recently i saw couple of discussions in LinkedIn group about generating large data set or data corpus. I have compiled the same in to an article. Hope it would be helpful. If you have any other links where we could get large data set for free, please reply to this mail thread, i will up

Re: SIGSEGV when indexing documents.

2012-01-11 Thread Dawid Weiss
No clue, I am not a hardware expert. Removing memory extensions one by one (or binary-searching for the faulty one)? Dawid On Wed, Jan 11, 2012 at 10:47 AM, Frank Moss wrote: > The same with IBM J9. The dump file is attached. > > It seems to be HW related. Recently, we have added more RAM. We ac

Re: Build RAMDirectory on FSDirectory, and then synchronzing the two

2012-01-11 Thread Ian Lea
> I tried  IndexWriterConfig.OpenMode CREATE, and the size is doubled. Prove it. -- Ian. > The only way that is effective is the writer's deleteAll() methods. > > On Mon, Jan 9, 2012 at 5:23 AM, Ian Lea wrote: > >> If you load an existing disk index into a RAMDirectory, make some >> changes in

Re: shared instance of IndexWriter doesn't improve proformance

2012-01-11 Thread Ian Lea
Contention. There is always a limit somewhere, I/O, CPU, memory, locks, ... Use your OS tools or java profiling/logging/debugging to find out what is going on - or just go with what works for you. If you're doing something like loading data read from a database, it is my experience that the bott

Re: SIGSEGV when indexing documents.

2012-01-11 Thread Dawid Weiss
Opps, yes, sorry -- I only quickly looked at the invocation line on stack overflow and overlooked it. -Xms4g shouldn't make any difference. Dawid On Wed, Jan 11, 2012 at 10:02 AM, Frank Moss wrote: > 4gb is the initial heap size. Are you  thinking about Xss?  I will try it > as well as the rest

Re: SIGSEGV when indexing documents.

2012-01-11 Thread Frank Moss
4gb is the initial heap size. Are you thinking about Xss? I will try it as well as the rest of your suggestions and post back the results. Thanks. On Wed, Jan 11, 2012 at 9:56 AM, Dawid Weiss wrote: > The dump you're getting indicates a sigserv in a garbage collection. > This isn't unlikely

Re: SIGSEGV when indexing documents.

2012-01-11 Thread Dawid Weiss
The dump you're getting indicates a sigserv in a garbage collection. This isn't unlikely (there are bugs in there as well), but less likely than a hardware error on your side... at least in my opinion. I would experiment with the following: 1) do you really need a 4gb max stack? Seems weird to me.