Query Regarding Parent Query

2014-02-20 Thread Priyanka Tufchi
Hi All I have been experimenting with parent child relation code in Apache Lucene using ParentBlockJoinQuery .Can any one explain me if I don't add ParentQuery in indexSearcher and simply search by childQuery What will happen .I tried it using two docs and it give equal score to both docs. Than

RE: [Suggestions Required] 110 Concurrency users indexing on Lucene dont finish in 200 ms.

2014-02-20 Thread sree
A correction on the system details posted earlier... Windows runs on : 1CPU, 1 cores. 8 GB RAM Linux runs on : 1 CPU , 2 cores, 8 GB RAM -- View this message in context: http://lucene.472066.n3.nabble.com/Suggestions-Required-110-Concurrency-users-indexing-on-Lucene-dont-finish-in-200-ms-

RE: [Suggestions Required] 110 Concurrency users indexing on Lucene dont finish in 200 ms.

2014-02-20 Thread Umashanker, Srividhya
Mike, More info - Windows on an average takes 192 ms for 1 thread to index 100 json documents Links on an average takes 711 ms for 1 thread to index 100 json documents (same set of data) We have set the heap size to 124 MB in both cases and runs on JDK 7 Windows runs on : 2CPU,

Re: Limiting the fields a user can query on

2014-02-20 Thread Jamie Johnson
I would be fine with throwing a parse exception or excluding the particular clause. I will look at the StandardQueryNodeProcessorPipeline as well as Hoss' suggestion. Thank you very much! On Thu, Feb 20, 2014 at 4:20 AM, Trejkaz wrote: > On Thu, Feb 20, 2014 at 1:43 PM, Jamie Johnson wrote:

Re: Sending a document to IndexWriter field by field

2014-02-20 Thread Michael McCandless
Yes, all postings for the entire doc are held in RAM data structures ... you could make your own indexing chain to somehow change this behavior, but I don't think that's an easy task. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 20, 2014 at 4:02 PM, Igor Shalyminov wrote: > Mike,

Re: Sending a document to IndexWriter field by field

2014-02-20 Thread Igor Shalyminov
Mike, thank you! So eventually this amount of data must stay entirely in RAM (as postings) before flushing to disk? Can it be hacked?) The documents themselves (that I will deliver to user) are of a regular size, but features that I generate grow combinatorially in size and blow the index up i

Re: Sending a document to IndexWriter field by field

2014-02-20 Thread Michael McCandless
Yes, in 4.x IndexWriter now takes an Iterable that enumerates the fields one at a time. You can also pass a Reader to a Field. That said, there will still be massive RAM required by IW to hold the inverted postings for that one document, likely much more RAM than the original document's String co

Sending a document to IndexWriter field by field

2014-02-20 Thread Igor Shalyminov
Hello! I'va faced a problem of indexing huge documents. The indexing itself goes allright, but when the document processing becomes concurrent, OutOfMemories start appearing (even with heap of about 32GB). The issue, as I see it, is that I have to create a Document instance to send it to IndexW

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Ahmet Arslan
Hi Greet, I suggest you to do these kind of transformation on query time only. Don't interfere with the index. This is way is more flexible. You can disable/enable on the fly, change your list without re-indexing.  Just an imaginary example : When user passes String as International Businessma

Re: Extending StandardTokenizer Jflex to not split on '/'

2014-02-20 Thread Diego Fernandez
Thanks again for the help. Upon further investigation I found out we weren't using our custom version of the analyzer, which explains why it wasn't doing what I thought it should. When I have time to get back to it I'll reconfigure it to use our tokenizer. Diego Fernandez - 爱国 Software Engine

Re: NRT indexing and ControlledRealTimeReopenThread

2014-02-20 Thread Hans Lund
I've created https://issues.apache.org/jira/browse/LUCENE-5461, and attached a small test that shows the error it a setup similar to what I would like to run The 1% is a overestimation - it seems to be related to concurrent commit on the index writer Hans Lund On Thu, Feb 20, 2014 at 2:04 PM, M

Re: NRT indexing and ControlledRealTimeReopenThread

2014-02-20 Thread Michael McCandless
On Thu, Feb 20, 2014 at 7:52 AM, Hans Lund wrote: > Ok, thats also what I expected, but not what I observed ;-) Ahh, not good. > For the very huge majority of index updates reopens are not an issue, > minutes will be very fine. A very few updates are done 'interactively' and > must be in RT (or

Re: NRT indexing and ControlledRealTimeReopenThread

2014-02-20 Thread Hans Lund
Ok, thats also what I expected, but not what I observed ;-) For the very huge majority of index updates reopens are not an issue, minutes will be very fine. A very few updates are done 'interactively' and must be in RT (or as close as possible). I don't know if this is a rare use case - but we do

Re: [Suggestions Required] 110 Concurrency users indexing on Lucene dont finish in 200 ms.

2014-02-20 Thread Michael McCandless
Can you summarize what you observed? What was the net throughput difference on Windows vs Linux? Was everything else identical (same hardware, same JVM, etc.)? Mike McCandless http://blog.mikemccandless.com On Thu, Feb 20, 2014 at 5:23 AM, sree wrote: > We tried a standalone program to index

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Michael McCandless
If you already know the set of phrases you need to detect then you can use Lucene's SynonymFilter to spot them and insert a new token. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies wrote: > It sounds like you've been asked to implement Named E

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives. On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote: > On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar > wrote: > > Hi, > > > My requirement

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Yann-Erwan Perio
On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar wrote: Hi, > My requirement is it should have capabilities to match multiple words as > one token. for example. When user passes String as International Business > machine logo or IBM logo it should return International Business Machine as > one tok

Re: NRT indexing and ControlledRealTimeReopenThread

2014-02-20 Thread Michael McCandless
It is intended that there are two different stale times. When a specific generation is requested, we wait for the minStaleSec since the last reopen; this is to prevent too-frequent reopens when specific gens are requested. The maxStaleSec is how long we wait between reopens for the "normal" perio

Re: [Suggestions Required] 110 Concurrency users indexing on Lucene dont finish in 200 ms.

2014-02-20 Thread sree
We tried a standalone program to index 50 documents from 100 threads concurrently. We executed the program for 1000 threads with 10 mins delay to avoid jvm warming issue as you suggested in last post. Also we are running with restricted heap size i.e 124MB ( as our product is running in linux with

Custom Tokenizer/Analyzer

2014-02-20 Thread Geet Gangwar
Hi, I have a requirement to write a custom tokenizer using Lucene framework. My requirement is it should have capabilities to match multiple words as one token. for example. When user passes String as International Business machine logo or IBM logo it should return International Business Machine

NRT indexing and ControlledRealTimeReopenThread

2014-02-20 Thread Hans Lund
Hi all I'm a bit unsure about the intended function of the ControlledRealTimeReopenThread in a NRT context - especially regarding stale times. As of now if you are waiting for a generation to become refreshed, it looks like the stale time is either the min stale time or the max stale time. Is thi

Re: Limiting the fields a user can query on

2014-02-20 Thread Trejkaz
On Thu, Feb 20, 2014 at 1:43 PM, Jamie Johnson wrote: > Is there a way to limit the fields a user can query by when using the > standard query parser or a way to get all fields/terms that make up a query > without writing custom code for each query subclass? If you mean StandardQueryParser, you c