Using Lucene 3.5.0, on a 32-core machine, I have coded something shaped like:
make a writer on a RAMDirectory.
start:
Create a near-real-time searcher from it.
farm work out to multiple threads, each of which performs a search
and retrieves some docs.
When all are done, write some new do
3.5.0: I passed a fixed size executor service with one thread, and
then with two threads, to the IndexSearcher constructor.
It hung.
With three threads, it didn't work, but I got different results than
when I don't pass in an executor service at all.
Is this expected? Should the javadoc say som
If I have a lot of segments, and an executor service in my searcher,
the following runs out of memory instantly, building giant heaps. Is
there another way to express this? Should I file a JIRA that the
parallel code should have some graceful behavior?
int longestMentionFreq = searcher.search(long
thanks, that's what I needed.
On Feb 19, 2012, at 9:51 AM, Robert Muir wrote:
> On Sun, Feb 19, 2012 at 9:21 AM, Benson Margulies
> wrote:
>> If I have a lot of segments, and an executor service in my searcher,
>> the following runs out of memory instantly, building g
19, 2012 at 9:08 AM, Benson Margulies
> wrote:
>> 3.5.0: I passed a fixed size executor service with one thread, and
>> then with two threads, to the IndexSearcher constructor.
>>
>> It hung.
>>
>> With three threads, it didn't work, but I got different
and there was a dumb typo.
1 thread: hang
2 threads: hang
3 or more: no hang
On Feb 19, 2012, at 10:40 AM, Robert Muir wrote:
> On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies
> wrote:
>> 3.5.0: I passed a fixed size executor service with one thread, and
>> then with t
Conveniently, all the 'wrong-result' problems disappeared when I
followed your advice about counting hits.
On Sun, Feb 19, 2012 at 10:39 AM, Robert Muir wrote:
> On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies
> wrote:
>> 3.5.0: I passed a fixed size executor servic
See https://issues.apache.org/jira/browse/LUCENE-3803 for an example
of the hang. I think this nets out to pilot error, but maybe Javadoc
could protect the next person from making the same mistake.
-
To unsubscribe, e-mail: java-u
t;
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: Benson Margulies [mailto:bimargul...@gmail.com]
>> Sent: Monday, February 20, 2012 1:47 AM
>> To:
A long-running program of mine (which Uwe's read a model of) slowly
keeps adding merge threads. I count 22 at the moment. Each one shows
up, runs for a bit, and then goes to sleep for, seemingly ever. I
don't do anything explicit to control merging behavior.
They name themselves "Lucene Merge Thre
On Sun, Feb 19, 2012 at 10:39 PM, Trejkaz wrote:
> On Mon, Feb 20, 2012 at 12:07 PM, Uwe Schindler wrote:
>> See my response. The problem is not in Lucene; its in general a problem of
>> fixed
>> thread pools that execute other callables from within a callable running at
>> the
>> moment in the
s needed,
> allows that thread to do another merge (if one is immediately
> available), else the thread exits.
They seem to exit eventually, but not quite as soon as they arrive.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sun, Feb 19, 2012 at 9:05
I am walking down the document in an index by number, and I find that
I want to update one. The updateDocument API only works on queries and
terms, not numbers.
So I can call remove and add, but, then, what's the document's number
after that? Or is that not a meaningful question until I make a new
Is there a reason why this doesn't return a count? Would a JIRA
requesting same be viewed with any sympathy?
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.
TopDocs top = searcher.search(contextQuery, filter, maxDocsToRetrieve);
Which document fields are included in the calculation of the scores in
the returned items? All fields? All fields touched in the query? Would
I need a custom Similarity to exclude some?
Sorry, I'm coming up empty in Google here.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
atomic). SlowCompositeReaderWrapper (LUCENE-2597) can be
> used to emulate atomic readers on top of composites.
> Please review MIGRATE.txt for information how to migrate old code.
> (Uwe Schindler, Robert Muir, Mike McCandless)
>
> -Original Message-
> From: Benson Margulie
To reduce noise slightly I'll stay on this thread.
I'm looking at this file, and not seeing a pointer to what to do about
QueryParser. Are jar file rearrangements supposed to be in that file?
I think that I don't have the right jar yet; all I'm seeing is the
'surround' package.
--
rserToken -> o.a.l.queryparser.classic.Token
> - o.a.l.queryParser.QueryParserTokenMgrError ->
> o.a.l.queryparser.classic.TokenMgrError
>
>
> -Original Message-
> From: Benson Margulies [mailto:bimargul...@gmail.com]
> Sent: Monday, March 05, 2012 11:15 AM
> To: java-user@luce
I've posted a self-contained test case to github of a mystery.
git://github.com/bimargulies/lucene-4-update-case.git
The code can be seen at
https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java.
I write a doc to an index,
Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt
appears to be missing one critical hint. If you have existing code
that called IndexReader.terms(), where do you start to get a
FieldsEnum?
-
To unsubscribe, e-mail:
of
MultiFields.getFields(indexReader).iterator();
which I came up with by fishing around for myself?
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -----Original Message-
>>
gt; Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies
> wrote:
>> Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt
>> appears to be missing one critical hint. If you have existing code
>>
Oh, I see, I didn't read far enough down. Well, the patch still
repairs a bug in the code fragment relative to the Term enumeration.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail
Oh, ouch, there's no SegmentReader.getReader, I was reading IndexWriter. Sorry.
On Tue, Mar 6, 2012 at 9:14 AM, Benson Margulies wrote:
> On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler wrote:
>> AtomicReader.fields()
--
u proceed to do TermQueries on "value-1". This term won't
> exist... TermQuery etc that take Term don't analyze any text.
>
> Instead usually higher-level things like QueryParsers analyze text into Terms.
>
> On Tue, Mar 6, 2012 at 8:35 AM, Benson Margulies
>
On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies wrote:
> On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote:
>> I think the issue is that your analyzer is standardanalyzer, yet field
>> text value is "value-1"
>
> Robert,
>
> Why is this f
tool, I think that MultiFields will be fine.
--benson
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Benson Margulies [mailto:bimargul...@gmail.co
On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote:
> On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies
> wrote:
>> On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote:
>>> I think the issue is that your analyzer is standardanalyzer, yet field
>>> text value is "
;
>> Hmm something is up here... I'll dig. Seems like we are somehow analyzing
>> StringField when we shouldn't...
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote:
>
gt;> figuring out how to replace it. For my purposes, which are a dev tool, I
>> think
>> that MultiFields will be fine.
>>
>> --benson
>>
>>
>> >
>> > Uwe
>> >
>> > -
>> > Uwe Schindler
>> > H.-H
a suggestion for sneaking around this in the mean time?
>
> On Tue, Mar 6, 2012 at 9:58 AM, Benson Margulies
> wrote:
>> On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler wrote:
>>> String field is analyzed, but with KeywordTokenizer, so all should be fine.
>>
>> I f
fileformat.info
On Mar 30, 2012, at 1:04 PM, Denis Brodeur wrote:
> Thanks Robert. That makes sense. Do you have a link handy where I can
> find this information? i.e. word boundary/punctuation for any unicode
> character set?
>
> On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote:
>
>> On F
We've observed something that, in some ways, is not surprising.
If you take a set of documents that are close in 'score' to some query,
and shuffle them in different orders
and then see what results you get in what order from the reference query,
the scores will vary according to the insertio
And, the score should not change as a function of insertion
> order...
Well, I assumed that TF-IDF would wiggle.
>
> Do you have a small test case?
SInce this surprises you, I will build a test case.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mo
I am trying to solve a problem using DisjunctionMaxQuery.
Consider a query like:
a:b OR c:d OR e:f OR ...
name:richard OR name:dick OR name:dickie OR name:rich ...
At most, one of the richard names matches. So the match score gets
dragged down by the long list of things that don't match, as the
On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir wrote:
> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies
> wrote:
>> I am trying to solve a problem using DisjunctionMaxQuery.
>>
>>
>> Consider a query like:
>>
>> a:b OR c:d OR e:f OR ...
>> name:
Turning on disableCoord for a nested boolean query does not seem to
change the overall maxCoord term as displayed in explain.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-
On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir wrote:
> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies
> wrote:
>> On Thu, Apr 19, 2012 at 1:34 PM, Robert Muir wrote:
>>> On Thu, Apr 19, 2012 at 1:26 PM, Benson Margulies
>>> wrote:
>>>
On Thu, Apr 19, 2012 at 5:10 PM, Robert Muir wrote:
> On Thu, Apr 19, 2012 at 5:05 PM, Benson Margulies
> wrote:
>> On Thu, Apr 19, 2012 at 4:21 PM, Robert Muir wrote:
>>> On Thu, Apr 19, 2012 at 3:49 PM, Benson Margulies
>>> wrote:
>>>> On Thu, A
I see why I'm so confused, but I think I need to construct a simpler test case.
My top-level BooleanQuery, which has disableCoord=false, has 22
clauses. All but three are ordinary SHOULD TermQueries. the remainder
are a spanNear and a nested BooleanQuery, and an empty PhraseQuery
(that's a bug).
for
> BooleanQuery bq = new BooleanQuery(false);
> bq.set*Maximum*NumberShouldMatch(1);
>
> Is there a good way to accomplish this?
>
> On Thu, Apr 19, 2012 at 7:37 PM, Robert Muir wrote:
>
>> On Thu, Apr 19, 2012 at 6:36 PM, Benson Margulies
>> wrote:
>&g
Uwe and Robert,
Thanks. David and I are two peas in one pod here at Basis.
--benson
On Fri, Apr 20, 2012 at 2:33 AM, Uwe Schindler wrote:
> Hi,
>
> Ah sorry, I misunderstood, you wanted to score the duplicate match lower! To
> achieve this, you have to change the coord function in your
> simila
I'm failing to find advice in MIGRATE.txt on how to replace 'new
Payload(...)' in migrating to 4.0. What am I missing?
Our Solr 3.x code used init(ResourceLoader) and then called the loader to
read a file.
What's the new approach to reading content from files in the 'usual place'?
That's what I meant, thanks.
On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir wrote:
> On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies
> wrote:
> > Our Solr 3.x code used init(ResourceLoader) and then called the loader to
> > read a file.
> >
> > What's
I'm confused. Isn't inform/ResourceLoader deprecated? But your example use
it?
On Wed, Aug 29, 2012 at 10:20 AM, Robert Muir wrote:
> On Wed, Aug 29, 2012 at 10:10 AM, Benson Margulies
> wrote:
> > Our Solr 3.x code used init(ResourceLoader) and then called the loa
I'm close to the bottom of my list here.
I've got an Analyzer that, in 3.1, set up a CharFilter in the tokenStream
method. So now I have to migrate that to createComponents. Can someone give
me a shove in the right direction?
On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir wrote:
> On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies
> wrote:
> > I'm confused. Isn't inform/ResourceLoader deprecated? But your example
> use
> > it?
> >
>
> Where is it deprecated? What does the
Hang on:
[deprecation] org.apache.solr.util.plugin.ResourceLoaderAware in
org.apache.solr.util.plugin has been deprecated
On Wed, Aug 29, 2012 at 10:30 AM, Robert Muir wrote:
> On Wed, Aug 29, 2012 at 10:27 AM, Benson Margulies
> wrote:
> > I'm confused. Isn't
On Wed, Aug 29, 2012 at 10:42 AM, Robert Muir wrote:
> Right and what does the @deprecated message say :)
>
Yes, indeed, sorry. I got caught in a maze of twisty passages and my brain
turned off. I'm better now.
>
> On Wed, Aug 29, 2012 at 10:40 AM, Benson Margulies
>
I've read the javadoc through a few times, but I confess that I'm still
feeling dense.
Are all tokenizers responsible for implementing some way of retaining the
contents of their reader, so that a call to reset without a call to
setReader rewinds? I note that CharTokenizer doesn't implement #reset
der) is only on Tokenizer, it means replace the Reader
> with a different one to be processed.
> The fact that CharTokenizer is doing 'reset()-like-stuff' in here is
> bogus IMO, but I dont think it will cause any bugs. Don't emulate it
> :)
>
> On Wed, Aug 29, 201
Some interlinear commentary on the doc.
* Resets this stream to the beginning.
To me this implies a rewind. As previously noted, I don't see how this
works for the existing implementations.
* As all TokenStreams must be reusable,
* any implementations which have state that needs to be re
I think I'm beginning to get the idea. Is the following plausible?
At the bottom of the stack, there's an actual source of data -- like a
tokenizer. For one of those, reset() is a bit silly, and something like
setReader is the brains of the operation.
Some number of other components may be stacke
If I'm following, you've created a division of labor between setReader and
reset.
We have a tokenizer that has a good deal of state, since it has to split
the input into chunks. If I'm following here, you'd recommend that we do
nothing special in setReader, but have #reset fix up all the state on
On Thu, Sep 6, 2012 at 1:59 PM, Robert Muir wrote:
> Thanks for reporting this Mark.
>
> I think it was not intended to have actual null characters here (or
> probably anywhere in javadocs).
>
> Our javadocs checkers should be failing on stuff like this...
>
> On Thu, Sep 6, 2012 at 1:52 PM, Mark
This useful-looking item is in the test-framework jar. Is there some subtle
reason that it isn't in the common analyzer jar? Some reason why I'd regret
using it?
I'm trying to work through the logic of reading ahead until I've seen
marker for the end of a sentence, then applying some analysis to all of the
tokens of the sentence, and then changing some attributes of each token to
reflect the results.
The queue of tokens for a position is just a State, so t
On Fri, Sep 6, 2013 at 7:31 AM, Michael McCandless
wrote:
>
> On Thu, Sep 5, 2013 at 8:44 PM, Benson Margulies wrote:
> > I'm trying to work through the logic of reading ahead until I've seen
> > marker for the end of a sentence, then applying some analysis to
I'm confused by the comment about compound components here.
If a single token fissions into multiple tokens, then what belongs in
the PositionLengthAttribute. I'm wanting to store a fraction in here!
Or is the idea to store N in the 'mother' token and then '1' in each
of the babies?
-
omething simple.
public boolean incrementToken() throws IOException {
if (positions.getMaxPos() < 0) {
peekSentence();
}
return nextToken();
}
On Fri, Sep 6, 2013 at 8:13 AM, Benson Margulies wrote:
> On Fri, Sep 6, 2013 at 7:31 AM, Michae
not use it.
On Fri, Sep 6, 2013 at 9:10 PM, Benson Margulies wrote:
> Michael,
>
> I'm apparently not fully deconfused yet.
>
> I've got a very simple incrementToken function. It calls peekToken to
> stack up the tokens.
>
> afterPosition is never called; I expe
On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir wrote:
> its the latter. the way its designed to work i think is illustrated
> best in kuromoji analyzer where it heuristically decompounds nouns:
>
> if it decompounds ABCD into AB + CD, then the tokens are AB and CD.
> these both have posinc=1.
> howev
e offset in the original that might as well be blamed for any given
component.
On Fri, Sep 6, 2013 at 9:37 PM, Robert Muir wrote:
> On Fri, Sep 6, 2013 at 9:32 PM, Benson Margulies wrote:
>> On Fri, Sep 6, 2013 at 9:28 PM, Robert Muir wrote:
>>> its the latter. the way its design
On Sat, Sep 7, 2013 at 8:39 AM, Robert Muir wrote:
> On Sat, Sep 7, 2013 at 7:44 AM, Benson Margulies wrote:
>> In Japanese, compounds are just decompositions of the input string. In
>> other languages, compounds can manufacture entire tokens from thin
>> air. In those cases
nextToken() calls peekToken(). That seems to prevent my lookahead
processing from seeing that item later. Am I missing something?
On Fri, Sep 6, 2013 at 9:15 PM, Benson Margulies wrote:
> I think that the penny just dropped, and I should not be using this class.
>
> If I call peekToken
f Position), then nextToken()
> to emit the buffered tokens, and to insert your own tokens when
> afterPosition() is called ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Sep 7, 2013 at 1:10 PM, Benson Margulies wrote:
>> nextToken() calls peek
, thanks!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Sep 7, 2013 at 3:40 PM, Benson Margulies wrote:
>> I think I had better build you a test case for this situation, and
>> attach it to a JIRA.
>>
>> On Sat, Sep 7, 2013 at 3:33 PM
ilter should be useful: there is
> a JIRA for it, but it has some unresolved issues
>
> https://issues.apache.org/jira/browse/LUCENE-4072
>
> On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies
> wrote:
> > Can anyone shed light as to why this is a token filter and not a char
&
Can anyone shed light as to why this is a token filter and not a char
filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the
tokenizer's lookups in its dictionaries are seeing normalized contents.
The multithreaded index searcher fans out across segments. How aggressively
does 'optimize' reduce the number of segments? If the segment count goes
way down, is there some other way to exploit multiple cores?
e segment
>> structure.
>>
>> But then again, this need (using concurrent hardware to reduce latency
>> of a single query) is somewhat rare; most apps are fine using the
>> concurrency across queries rather than within one query.
>>
>> Mike McCan
Is there some advice around about when it's appropriate to create an
Analyzer class, as opposed to just Tokenizer and TokenFilter classes?
The advantage of the constituent elements is that they allow the
consuming application to add more filters. The only disadvantage I see
is that the following i
Consider a Lucene index consisting of 10m documents with a total disk
footprint of 3G. Consider an application that treats this index as
read-only, and runs very complex queries over it. Queries with many terms,
some of them 'fuzzy' and 'should' terms and a dismax. And, finally,
consider doing all
mccandless.com
>
>
> On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies
> wrote:
> > Consider a Lucene index consisting of 10m documents with a total disk
> > footprint of 3G. Consider an application that treats this index as
> > read-only, and runs very complex queries ov
Oh, drat, I left out an 's'. I got it now.
On Tue, Oct 8, 2013 at 7:40 PM, Benson Margulies wrote:
> Mike, where do I find DirectPostingFormat?
>
>
> On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> DirectP
de a codec that returns it as the postings guy, is that the whole
recipe?. Does it make sense to extend it any further to any of the other
codec pieces?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Oct 8, 2013 at 5:45 PM, Benson Margulies
> wrote
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies
> wrote:
> > On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >>
On Wed, Oct 9, 2013 at 7:18 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Wed, Oct 9, 2013 at 7:13 PM, Benson Margulies
> wrote:
> > On Tue, Oct 8, 2013 at 5:50 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >>
It might be helpful if you would explain, at a higher level, what you
are trying to accomplish. Where do these things come from? What
higher-level problem are you trying to solve?
On Sun, Oct 20, 2013 at 7:12 PM, saisantoshi wrote:
> Thanks.
>
> So, if I understand correctly, StandardAnalyzer won
I'm working on tool that wants to construct analyzers 'at arms length' -- a
bit like from a solr schema -- so that multiple dueling analyzers could be
in their own class loaders at one time. I want to just define a simple
configuration for char filters, tokenizer, and token filter. So it would
be,
OK, so, here I go again making a public idiot of myself. Could it be that
the tokenizer factory is 'relatively recent' as in since 4.1?
On Mon, Oct 28, 2013 at 7:39 AM, Benson Margulies wrote:
> I'm working on tool that wants to construct analyzers 'at arms length'
e all in Lucene's analyzers-commons module
> (since 4.0). They are no longer part of Solr.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > F
ematic. I don't suppose there are some guidelines?
On Mon, Oct 28, 2013 at 9:43 AM, Benson Margulies wrote:
> Just how 'experimental' is the SPI system at this point, if that's a
> reasonable question?
>
>
> On Mon, Oct 28, 2013 at 8:41 AM, Uwe Schindler wrote
I just built myself a sort of Solr-schema-in-a-test-tube. It's a class that
builds a classloader on some JAR files and then uses the SPI mechanism to
manufacture Analyzer objects made out of tokenizers and filters.
I can make this visible in github, or even attach it to a JIRA, if anyone
is intere
My token filter has no end() method at all. Am I required to have an end
method()?
BaseLinguisticsTokenFilterTest.testSegmentationReadings:175->Assert.assertTrue:41->Assert.fail:88
super.end()/clearAttributes() was not called correctly in end()
BaseLinguisticsTokenFilterTest.testSpacesInLemma:1
u...@thetaphi.de
>
> > -Original Message-
> > From: Benson Margulies [mailto:ben...@basistech.com]
> > Sent: Wednesday, October 30, 2013 12:30 AM
> > To: java-user@lucene.apache.org
> > Subject: new consistency check for token filters in 4.5.1
> >
> >
I just backported some code to 3.6.0, and it includes tests that use
org.apache.lucene.analysis.BaseTokenStreamTestCase#checkRandomData(java.util.Random,
org.apache.lucene.analysis.Analyzer, int, int)
The tests that use this method fail in 3.6.0 in ways that suggest that
multiple threads are hitt
How would you expect to recognize that 'Toy Story' is a thing?
On Tue, Nov 5, 2013 at 6:32 PM, Kevin wrote:
> Currently I'm using StandardTokenizerFactory which tokenizes the words
> bases on spaces. For Toy Story it will create tokens toy and story.
> Ideally, I would want to extend the functi
There are a handful of binary files
in ./src/resources/org/apache/lucene/analysis/ja/dict/ with filenames
ending in .dat.
Trailing around in the source, it seems as if at least one of these derives
from a source file named "unk.def". In turn, this file comes from a
dependency. should the build ge
stored in the dat file. See also the ivy.xml.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Benson Margulies [mailto:ben...@basistech.com]
>
from git, not from the official release,
so I don't know.
>
> Many thanks,
>
> Christian Moen
> アティリカ株式会社
> http://www.atilika.com
>
> On Dec 3, 2013, at 2:11 AM, Benson Margulies wrote:
>
> > There are a handful of binary files in
> ./src/resources/org/
In 4.6.0, org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
fails if incrementToken fails to throw if there's a missing reset.
How am I supposed to organize this in a Tokenizer? A quick look at
CharTokenizer did not reveal any code for the purpose.
--
izer.java for the state machine logic. In general you should
> not have to do anything if the tokenizer is well-behaved (e.g. close
> calls super.close() and so on).
>
>
>
> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies
> wrote:
> > In 4.6.0,
> org.apa
urpose. i think its confusing and contributes to bugs that you have
> to have logic in e.g. the ctor THEN ALSO in reset().
>
> if someone does it correctly in the ctor, but they only test "one
> time", they might think everything is working..
>
> On Tue, Jan 7, 2014 at 3:23
, Item 16).
>
> Hope this helps somebody.
>
> [1]
> http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673
>
> Regards,
> Mindaugas
>
> On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies
> wrote:
> > Yes I Do.
Sure, why not - I'm just not sure if my approach (of setting reader in
> reset()) is preferred over yours (using this.input instead of input in
> ctor)? Or are they both equally good?
>
> m.
>
> On Wed, Jan 8, 2014 at 12:18 PM, Benson Margulies
> wrote:
> > If y
If you are sensitive to things being committed to trunk, that suggests that
you are building your own jars and using the trunk. Are you perfectly sure
that you have built, and are using, a consistent set of jars? It looks as
if you've got some trunk-y stuff and some 4.6.1 stuff.
On Thu, Jan 30,
It sounds like you've been asked to implement Named Entity Recognition.
OpenNLP has some capability here. There are also, um, commercial
alternatives.
On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote:
> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar
> wrote:
>
> Hi,
>
> > My requirement
1 - 100 of 126 matches
Mail list logo