Concurrency and multiple merge threads

2012-02-18 Thread Benson Margulies
Using Lucene 3.5.0, on a 32-core machine, I have coded something shaped like: make a writer on a RAMDirectory. start: Create a near-real-time searcher from it. farm work out to multiple threads, each of which performs a search and retrieves some docs. When all are done, write some new do

Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
3.5.0: I passed a fixed size executor service with one thread, and then with two threads, to the IndexSearcher constructor. It hung. With three threads, it didn't work, but I got different results than when I don't pass in an executor service at all. Is this expected? Should the javadoc say som

Counting all the hits with parallel searching

2012-02-19 Thread Benson Margulies
If I have a lot of segments, and an executor service in my searcher, the following runs out of memory instantly, building giant heaps. Is there another way to express this? Should I file a JIRA that the parallel code should have some graceful behavior? int longestMentionFreq = searcher.search(long

Re: Counting all the hits with parallel searching

2012-02-19 Thread Benson Margulies
thanks, that's what I needed. On Feb 19, 2012, at 9:51 AM, Robert Muir wrote: > On Sun, Feb 19, 2012 at 9:21 AM, Benson Margulies > wrote: >> If I have a lot of segments, and an executor service in my searcher, >> the following runs out of memory instantly, building g

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
19, 2012 at 9:08 AM, Benson Margulies > wrote: >> 3.5.0: I passed a fixed size executor service with one thread, and >> then with two threads, to the IndexSearcher constructor. >> >> It hung. >> >> With three threads, it didn't work, but I got different

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
and there was a dumb typo. 1 thread: hang 2 threads: hang 3 or more: no hang On Feb 19, 2012, at 10:40 AM, Robert Muir wrote: > On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies > wrote: >> 3.5.0: I passed a fixed size executor service with one thread, and >> then with t

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
Conveniently, all the 'wrong-result' problems disappeared when I followed your advice about counting hits. On Sun, Feb 19, 2012 at 10:39 AM, Robert Muir wrote: > On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies > wrote: >> 3.5.0:  I passed a fixed size executor servic

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
See https://issues.apache.org/jira/browse/LUCENE-3803 for an example of the hang. I think this nets out to pilot error, but maybe Javadoc could protect the next person from making the same mistake. - To unsubscribe, e-mail: java-u

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
t; > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -Original Message- >> From: Benson Margulies [mailto:bimargul...@gmail.com] >> Sent: Monday, February 20, 2012 1:47 AM >> To:

Here a merge thread, there a merge thread ...

2012-02-19 Thread Benson Margulies
A long-running program of mine (which Uwe's read a model of) slowly keeps adding merge threads. I count 22 at the moment. Each one shows up, runs for a bit, and then goes to sleep for, seemingly ever. I don't do anything explicit to control merging behavior. They name themselves "Lucene Merge Thre

Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-20 Thread Benson Margulies
On Sun, Feb 19, 2012 at 10:39 PM, Trejkaz wrote: > On Mon, Feb 20, 2012 at 12:07 PM, Uwe Schindler wrote: >> See my response. The problem is not in Lucene; its in general a problem of >> fixed >> thread pools that execute other callables from within a callable running at >> the >> moment in the

Re: Here a merge thread, there a merge thread ...

2012-02-24 Thread Benson Margulies
s needed, > allows that thread to do another merge (if one is immediately > available), else the thread exits. They seem to exit eventually, but not quite as soon as they arrive. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sun, Feb 19, 2012 at 9:05

Updating a document.

2012-03-04 Thread Benson Margulies
I am walking down the document in an index by number, and I find that I want to update one. The updateDocument API only works on queries and terms, not numbers. So I can call remove and add, but, then, what's the document's number after that? Or is that not a meaningful question until I make a new

Return value (or lack thereof) from IndexWriter.deleteDocuments

2012-03-04 Thread Benson Margulies
Is there a reason why this doesn't return a count? Would a JIRA requesting same be viewed with any sympathy? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.

which fields are included in similarity?

2012-03-04 Thread Benson Margulies
TopDocs top = searcher.search(contextQuery, filter, maxDocsToRetrieve); Which document fields are included in the calculation of the scores in the returned items? All fields? All fields touched in the query? Would I need a custom Similarity to exclude some?

What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
Sorry, I'm coming up empty in Google here. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
atomic). SlowCompositeReaderWrapper (LUCENE-2597) can be >  used to emulate atomic readers on top of composites. >  Please review MIGRATE.txt for information how to migrate old code. >  (Uwe Schindler, Robert Muir, Mike McCandless) > > -Original Message- > From: Benson Margulie

Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
To reduce noise slightly I'll stay on this thread. I'm looking at this file, and not seeing a pointer to what to do about QueryParser. Are jar file rearrangements supposed to be in that file? I think that I don't have the right jar yet; all I'm seeing is the 'surround' package. --

Re: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Benson Margulies
rserToken -> o.a.l.queryparser.classic.Token >  - o.a.l.queryParser.QueryParserTokenMgrError -> > o.a.l.queryparser.classic.TokenMgrError > > > -Original Message- > From: Benson Margulies [mailto:bimargul...@gmail.com] > Sent: Monday, March 05, 2012 11:15 AM > To: java-user@luce

Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
I've posted a self-contained test case to github of a mystery. git://github.com/bimargulies/lucene-4-update-case.git The code can be seen at https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java. I write a doc to an index,

A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail:

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
of MultiFields.getFields(indexReader).iterator(); which I came up with by fishing around for myself? > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message- >>

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
gt; Mike McCandless > > http://blog.mikemccandless.com > > On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies > wrote: >> Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt >> appears to be missing one critical hint. If you have existing code >>

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, I see, I didn't read far enough down. Well, the patch still repairs a bug in the code fragment relative to the Term enumeration. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, ouch, there's no SegmentReader.getReader, I was reading IndexWriter. Sorry. On Tue, Mar 6, 2012 at 9:14 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler wrote: >> AtomicReader.fields() --

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
u proceed to do TermQueries on "value-1". This term won't > exist... TermQuery etc that take Term don't analyze any text. > > Instead usually higher-level things like QueryParsers analyze text into Terms. > > On Tue, Mar 6, 2012 at 8:35 AM, Benson Margulies >

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote: >> I think the issue is that your analyzer is standardanalyzer, yet field >> text value is "value-1" > > Robert, > > Why is this f

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
tool, I think that MultiFields will be fine. --benson > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Benson Margulies [mailto:bimargul...@gmail.co

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote: > On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies > wrote: >> On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote: >>> I think the issue is that your analyzer is standardanalyzer, yet field >>> text value is "

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
; >> Hmm something is up here... I'll dig.  Seems like we are somehow analyzing >> StringField when we shouldn't... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote: >

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
gt;> figuring out how to replace it. For my purposes, which are a dev tool, I >> think >> that MultiFields will be fine. >> >> --benson >> >> >> > >> > Uwe >> > >> > - >> > Uwe Schindler >> > H.-H

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
a suggestion for sneaking around this in the mean time? > > On Tue, Mar 6, 2012 at 9:58 AM, Benson Margulies > wrote: >> On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler wrote: >>> String field is analyzed, but with KeywordTokenizer, so all should be fine. >> >> I f

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Benson Margulies
fileformat.info On Mar 30, 2012, at 1:04 PM, Denis Brodeur wrote: > Thanks Robert. That makes sense. Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > > On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote: > >> On F

Repeatability of results

2012-04-02 Thread Benson Margulies
We've observed something that, in some ways, is not surprising. If you take a set of documents that are close in 'score' to some query, and shuffle them in different orders and then see what results you get in what order from the reference query, the scores will vary according to the insertio

Re: Repeatability of results

2012-04-02 Thread Benson Margulies
 And, the score should not change as a function of insertion > order... Well, I assumed that TF-IDF would wiggle. > > Do you have a small test case? SInce this surprises you, I will build a test case. > > Mike McCandless > > http://blog.mikemccandless.com > > On Mo

DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I am trying to solve a problem using DisjunctionMaxQuery. Consider a query like: a:b OR c:d OR e:f OR ... name:richard OR name:dick OR name:dickie OR name:rich ... At most, one of the richard names matches. So the match score gets dragged down by the long list of things that don't match, as the

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir wrote: > On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies > wrote: >> I am trying to solve a problem using DisjunctionMaxQuery. >> >> >> Consider a query like: >> >> a:b OR c:d OR e:f OR ... >> name:

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
Turning on disableCoord for a nested boolean query does not seem to change the overall maxCoord term as displayed in explain. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir wrote: > On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies > wrote: >> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir wrote: >>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies >>> wrote: >>>

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
On Thu, Apr 19, 2012 at 5:10 PM, Robert Muir wrote: > On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies > wrote: >> On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir wrote: >>> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies >>> wrote: >>>> On Thu, A

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
I see why I'm so confused, but I think I need to construct a simpler test case. My top-level BooleanQuery, which has disableCoord=false, has 22 clauses. All but three are ordinary SHOULD TermQueries. the remainder are a spanNear and a nested BooleanQuery, and an empty PhraseQuery (that's a bug).

Re: DisjunctionMaxQuery and scoring

2012-04-19 Thread Benson Margulies
for > BooleanQuery bq = new BooleanQuery(false); >  bq.set*Maximum*NumberShouldMatch(1); > > Is there a good way to accomplish this? > > On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir wrote: > >> On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies >> wrote: >&g

Re: DisjunctionMaxQuery and scoring

2012-04-20 Thread Benson Margulies
Uwe and Robert, Thanks. David and I are two peas in one pod here at Basis. --benson On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler wrote: > Hi, > > Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To > achieve this, you have to change the coord function in your > simila

Payload class

2012-08-29 Thread Benson Margulies
I'm failing to find advice in MIGRATE.txt on how to replace 'new Payload(...)' in migrating to 4.0. What am I missing?

ResourceLoader?

2012-08-29 Thread Benson Margulies
Our Solr 3.x code used init(ResourceLoader) and then called the loader to read a file. What's the new approach to reading content from files in the 'usual place'?

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
That's what I meant, thanks. On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir wrote: > On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies > wrote: > > Our Solr 3.x code used init(ResourceLoader) and then called the loader to > > read a file. > > > > What's

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
I'm confused. Isn't inform/ResourceLoader deprecated? But your example use it? On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir wrote: > On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies > wrote: > > Our Solr 3.x code used init(ResourceLoader) and then called the loa

Using a char filter in solr createComponents

2012-08-29 Thread Benson Margulies
I'm close to the bottom of my list here. I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream method. So now I have to migrate that to createComponents. Can someone give me a shove in the right direction?

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir wrote: > On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies > wrote: > > I'm confused. Isn't inform/ResourceLoader deprecated? But your example > use > > it? > > > > Where is it deprecated? What does the

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
Hang on: [deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in org.apache.solr.util.plugin has been deprecated On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir wrote: > On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies > wrote: > > I'm confused. Isn't

Re: ResourceLoader?

2012-08-29 Thread Benson Margulies
On Wed, Aug 29, 2012 at 10:42 AM, Robert Muir wrote: > Right and what does the @deprecated message say :) > Yes, indeed, sorry. I got caught in a maze of twisty passages and my brain turned off. I'm better now. > > On Wed, Aug 29, 2012 at 10:40 AM, Benson Margulies >

reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I've read the javadoc through a few times, but I confess that I'm still feeling dense. Are all tokenizers responsible for implementing some way of retaining the contents of their reader, so that a call to reset without a call to setReader rewinds? I note that CharTokenizer doesn't implement #reset

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
der) is only on Tokenizer, it means replace the Reader > with a different one to be processed. > The fact that CharTokenizer is doing 'reset()-like-stuff' in here is > bogus IMO, but I dont think it will cause any bugs. Don't emulate it > :) > > On Wed, Aug 29, 201

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
Some interlinear commentary on the doc. * Resets this stream to the beginning. To me this implies a rewind. As previously noted, I don't see how this works for the existing implementations. * As all TokenStreams must be reusable, * any implementations which have state that needs to be re

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
I think I'm beginning to get the idea. Is the following plausible? At the bottom of the stack, there's an actual source of data -- like a tokenizer. For one of those, reset() is a bit silly, and something like setReader is the brains of the operation. Some number of other components may be stacke

Re: reset versus setReader on TokenStream

2012-08-29 Thread Benson Margulies
If I'm following, you've created a division of labor between setReader and reset. We have a tokenizer that has a good deal of state, since it has to split the input into chunks. If I'm following here, you'd recommend that we do nothing special in setReader, but have #reset fix up all the state on

Re: Issue with documentation for org.apache.lucene.analysis.synonym.SynonymMap.Builder.add() method

2012-09-06 Thread Benson Margulies
On Thu, Sep 6, 2012 at 1:59 PM, Robert Muir wrote: > Thanks for reporting this Mark. > > I think it was not intended to have actual null characters here (or > probably anywhere in javadocs). > > Our javadocs checkers should be failing on stuff like this... > > On Thu, Sep 6, 2012 at 1:52 PM, Mark

LookaheadTokenFilter

2013-09-05 Thread Benson Margulies
This useful-looking item is in the test-framework jar. Is there some subtle reason that it isn't in the common analyzer jar? Some reason why I'd regret using it?

LookaheadTokenFilter

2013-09-05 Thread Benson Margulies
I'm trying to work through the logic of reading ahead until I've seen marker for the end of a sentence, then applying some analysis to all of the tokens of the sentence, and then changing some attributes of each token to reflect the results. The queue of tokens for a position is just a State, so t

Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless wrote: > > On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies wrote: > > I'm trying to work through the logic of reading ahead until I've seen > > marker for the end of a sentence, then applying some analysis to

PositionLengthAttribute

2013-09-06 Thread Benson Margulies
I'm confused by the comment about compound components here. If a single token fissions into multiple tokens, then what belongs in the PositionLengthAttribute. I'm wanting to store a fraction in here! Or is the idea to store N in the 'mother' token and then '1' in each of the babies? -

Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
omething simple. public boolean incrementToken() throws IOException { if (positions.getMaxPos() < 0) { peekSentence(); } return nextToken(); } On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies wrote: > On Fri, Sep 6, 2013 at 7:31 AM, Michae

Re: LookaheadTokenFilter

2013-09-06 Thread Benson Margulies
not use it. On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies wrote: > Michael, > > I'm apparently not fully deconfused yet. > > I've got a very simple incrementToken function. It calls peekToken to > stack up the tokens. > > afterPosition is never called; I expe

Re: PositionLengthAttribute

2013-09-06 Thread Benson Margulies
On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir wrote: > its the latter. the way its designed to work i think is illustrated > best in kuromoji analyzer where it heuristically decompounds nouns: > > if it decompounds ABCD into AB + CD, then the tokens are AB and CD. > these both have posinc=1. > howev

Re: PositionLengthAttribute

2013-09-07 Thread Benson Margulies
e offset in the original that might as well be blamed for any given component. On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir wrote: > On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies wrote: >> On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir wrote: >>> its the latter. the way its design

Re: PositionLengthAttribute

2013-09-07 Thread Benson Margulies
On Sat, Sep 7, 2013 at 8:39 AM, Robert Muir wrote: > On Sat, Sep 7, 2013 at 7:44 AM, Benson Margulies wrote: >> In Japanese, compounds are just decompositions of the input string. In >> other languages, compounds can manufacture entire tokens from thin >> air. In those cases

Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
nextToken() calls peekToken(). That seems to prevent my lookahead processing from seeing that item later. Am I missing something? On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies wrote: > I think that the penny just dropped, and I should not be using this class. > > If I call peekToken

Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
f Position), then nextToken() > to emit the buffered tokens, and to insert your own tokens when > afterPosition() is called ... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies wrote: >> nextToken() calls peek

Re: LookaheadTokenFilter

2013-09-07 Thread Benson Margulies
, thanks! > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies wrote: >> I think I had better build you a test case for this situation, and >> attach it to a JIRA. >> >> On Sat, Sep 7, 2013 at 3:33 PM

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
ilter should be useful: there is > a JIRA for it, but it has some unresolved issues > > https://issues.apache.org/jira/browse/LUCENE-4072 > > On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies > wrote: > > Can anyone shed light as to why this is a token filter and not a char &

org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

2013-09-16 Thread Benson Margulies
Can anyone shed light as to why this is a token filter and not a char filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the tokenizer's lookups in its dictionaries are seeing normalized contents.

How to make good use of the multithreaded IndexSearcher?

2013-09-30 Thread Benson Margulies
The multithreaded index searcher fans out across segments. How aggressively does 'optimize' reduce the number of segments? If the segment count goes way down, is there some other way to exploit multiple cores?

Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Benson Margulies
e segment >> structure. >> >> But then again, this need (using concurrent hardware to reduce latency >> of a single query) is somewhat rare; most apps are fine using the >> concurrency across queries rather than within one query. >> >> Mike McCan

Analyzer classes versus the constituent components

2013-10-08 Thread Benson Margulies
Is there some advice around about when it's appropriate to create an Analyzer class, as opposed to just Tokenizer and TokenFilter classes? The advantage of the constituent elements is that they allow the consuming application to add more filters. The only disadvantage I see is that the following i

Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Consider a Lucene index consisting of 10m documents with a total disk footprint of 3G. Consider an application that treats this index as read-only, and runs very complex queries over it. Queries with many terms, some of them 'fuzzy' and 'should' terms and a dismax. And, finally, consider doing all

Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
mccandless.com > > > On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies > wrote: > > Consider a Lucene index consisting of 10m documents with a total disk > > footprint of 3G. Consider an application that treats this index as > > read-only, and runs very complex queries ov

Re: Exploiting a whole lot of memory

2013-10-08 Thread Benson Margulies
Oh, drat, I left out an 's'. I got it now. On Tue, Oct 8, 2013 at 7:40 PM, Benson Margulies wrote: > Mike, where do I find DirectPostingFormat? > > > On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> DirectP

Re: Exploiting a whole lot of memory

2013-10-09 Thread Benson Margulies
de a codec that returns it as the postings guy, is that the whole recipe?. Does it make sense to extend it any further to any of the other codec pieces? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies > wrote

Re: Exploiting a whole lot of memory

2013-10-09 Thread Benson Margulies
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies > wrote: > > On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >>

Re: Exploiting a whole lot of memory

2013-10-10 Thread Benson Margulies
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies > wrote: > > On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >>

Re: Handling special characters in Lucene 4.0

2013-10-20 Thread Benson Margulies
It might be helpful if you would explain, at a higher level, what you are trying to accomplish. Where do these things come from? What higher-level problem are you trying to solve? On Sun, Oct 20, 2013 at 7:12 PM, saisantoshi wrote: > Thanks. > > So, if I understand correctly, StandardAnalyzer won

Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
I'm working on tool that wants to construct analyzers 'at arms length' -- a bit like from a solr schema -- so that multiple dueling analyzers could be in their own class loaders at one time. I want to just define a simple configuration for char filters, tokenizer, and token filter. So it would be,

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
OK, so, here I go again making a public idiot of myself. Could it be that the tokenizer factory is 'relatively recent' as in since 4.1? On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies wrote: > I'm working on tool that wants to construct analyzers 'at arms length'

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
e all in Lucene's analyzers-commons module > (since 4.0). They are no longer part of Solr. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > F

Re: Why is there a token filter factory abstraction but not a tokenizer factory abstraction in Lucene?

2013-10-28 Thread Benson Margulies
ematic. I don't suppose there are some guidelines? On Mon, Oct 28, 2013 at 9:43 AM, Benson Margulies wrote: > Just how 'experimental' is the SPI system at this point, if that's a > reasonable question? > > > On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler wrote

Anyone interested in a worked-out example of the SPIs for analyzer components?

2013-10-28 Thread Benson Margulies
I just built myself a sort of Solr-schema-in-a-test-tube. It's a class that builds a classloader on some JAR files and then uses the SPI mechanism to manufacture Analyzer objects made out of tokenizers and filters. I can make this visible in github, or even attach it to a JIRA, if anyone is intere

new consistency check for token filters in 4.5.1

2013-10-29 Thread Benson Margulies
My token filter has no end() method at all. Am I required to have an end method()? BaseLinguisticsTokenFilterTest.testSegmentationReadings:175->Assert.assertTrue:41->Assert.fail:88 super.end()/clearAttributes() was not called correctly in end() BaseLinguisticsTokenFilterTest.testSpacesInLemma:1

Re: new consistency check for token filters in 4.5.1

2013-10-30 Thread Benson Margulies
u...@thetaphi.de > > > -Original Message- > > From: Benson Margulies [mailto:ben...@basistech.com] > > Sent: Wednesday, October 30, 2013 12:30 AM > > To: java-user@lucene.apache.org > > Subject: new consistency check for token filters in 4.5.1 > > > >

Threads and LuceneTestCase in 3.6.0

2013-10-31 Thread Benson Margulies
I just backported some code to 3.6.0, and it includes tests that use org.apache.lucene.analysis.BaseTokenStreamTestCase#checkRandomData(java.util.Random, org.apache.lucene.analysis.Analyzer, int, int) The tests that use this method fail in 3.6.0 in ways that suggest that multiple threads are hitt

Re: Modify the StandardTokenizerFactory to concatenate all words

2013-11-05 Thread Benson Margulies
How would you expect to recognize that 'Toy Story' is a thing? On Tue, Nov 5, 2013 at 6:32 PM, Kevin wrote: > Currently I'm using StandardTokenizerFactory which tokenizes the words > bases on spaces. For Toy Story it will create tokens toy and story. > Ideally, I would want to extend the functi

Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
There are a handful of binary files in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames ending in .dat. Trailing around in the source, it seems as if at least one of these derives from a source file named "unk.def". In turn, this file comes from a dependency. should the build ge

Re: Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
stored in the dat file. See also the ivy.xml. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Benson Margulies [mailto:ben...@basistech.com] >

Re: Where is the source for the .dat files in Kuromoji?

2013-12-02 Thread Benson Margulies
from git, not from the official release, so I don't know. > > Many thanks, > > Christian Moen > アティリカ株式会社 > http://www.atilika.com > > On Dec 3, 2013, at 2:11 AM, Benson Margulies wrote: > > > There are a handful of binary files in > ./src/resources/org/

How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException fails if incrementToken fails to throw if there's a missing reset. How am I supposed to organize this in a Tokenizer? A quick look at CharTokenizer did not reveal any code for the purpose. --

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
izer.java for the state machine logic. In general you should > not have to do anything if the tokenizer is well-behaved (e.g. close > calls super.close() and so on). > > > > On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies > wrote: > > In 4.6.0, > org.apa

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-07 Thread Benson Margulies
urpose. i think its confusing and contributes to bugs that you have > to have logic in e.g. the ctor THEN ALSO in reset(). > > if someone does it correctly in the ctor, but they only test "one > time", they might think everything is working.. > > On Tue, Jan 7, 2014 at 3:23

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
, Item 16). > > Hope this helps somebody. > > [1] > http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673 > > Regards, > Mindaugas > > On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies > wrote: > > Yes I Do.

Re: How is incrementToken supposed to detect the lack of reset()?

2014-01-08 Thread Benson Margulies
Sure, why not - I'm just not sure if my approach (of setting reader in > reset()) is preferred over yours (using this.input instead of input in > ctor)? Or are they both equally good? > > m. > > On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies > wrote: > > If y

Re: LUCENE-5388 AbstractMethodError

2014-01-30 Thread Benson Margulies
If you are sensitive to things being committed to trunk, that suggests that you are building your own jars and using the trunk. Are you perfectly sure that you have built, and are using, a consistent set of jars? It looks as if you've got some trunk-y stuff and some 4.6.1 stuff. On Thu, Jan 30,

Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives. On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote: > On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar > wrote: > > Hi, > > > My requirement

  1   2   >