I still think that we're looking at an "XY Problem" here, haggling over a
"solution" when the problem has not been clearly and fully stated.
In particular, rather than parsing straight natural language text, the data
appears to have a structured form. Until the structure is fully defined,
detailing a parser, especially by playing games such as "injecting spaces"
is an exercise in futility. I mean, you MIGHT come up with a solution that
SEEMS to work (at least for SOME cases), and MAY make you happy, but I would
hate to see other Lucene users adopt such an approach to problem solving.
Tell us the full problem and then we can focus on legitimate "solutions".
-- Jack Krupansky
-----Original Message-----
From: Erick Erickson
Sent: Sunday, November 04, 2012 8:06 AM
To: java-user
Subject: Re: using CharFilter to inject a space
Ahh, I don't know of a better way. I can imagine complex solutions
involving something akin to WordDelimiterFilter... and I can imagine that
that would be ridiculously expensive to maintain when there are really
simple solutions like you're looking at.
Mostly I was curious about your use-case....
Erick
On Sat, Nov 3, 2012 at 11:35 PM, Igal @ getRailo.org
<i...@getrailo.org>wrote:
well, my main goal is to use a ShingleFilter that will only take shingles
that are not separated by commas etc.
for example, the phrase:
"red apples, green tomatoes, and brown potatoes"
should yield the shingles "red apples", "green tomatoes", "and brown",
"brown potatoes"; but not "apples green" and not "tomatoes and" as those
are separated by commas.
the problem with the common tokenizers is that they get rid of the commas
so if I use a ShingleFilter after them there's no way to tell if there was
a comma there or not.
(another option I consider is to add an Attribute to specify if there was
a comma before or after a token)
if there's a better way -- I'm open to suggestions,
Igal
On 11/3/2012 8:10 PM, Erick Erickson wrote:
So I've gotta ask... _why_ do you want to inject the spaces?
If it's just to break this up into tokens, wouldn't something like
LetterTokenizer do? Assuming you aren't interested in
leaving in numbers.... Or even StandardTokenizer unless you have
e-mail & etc.
Or what about PatternReplaceCharFilter?
FWIW,
Erick
On Sat, Nov 3, 2012 at 9:22 PM, Igal Sapir <i...@getrailo.org> wrote:
You're right. I'm not sure what I was thinking.
Thanks for all your help,
Igal
On Nov 3, 2012 5:44 PM, "Robert Muir" <rcm...@gmail.com> wrote:
On Sat, Nov 3, 2012 at 8:32 PM, Igal @ getRailo.org <i...@getrailo.org>
wrote:
hi Robert,
thank you for your replies.
I couldn't find much documentation/examples of this, but this is what
I
came
up with (below). is that the way I'm supposed to use the
MappingCharFilter?
You don't need to extend anything.
You also don't want to create a normalizecharmap for each reader
(thats way too heavy)
Just build the NormalizeCharMap once, and pass it to
MappingCharFilter's Constructor.
------------------------------**------------------------------**
---------
To unsubscribe, e-mail:
java-user-unsubscribe@lucene.**apache.org<java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail:
java-user-help@lucene.apache.**org<java-user-h...@lucene.apache.org>
------------------------------**------------------------------**---------
To unsubscribe, e-mail:
java-user-unsubscribe@lucene.**apache.org<java-user-unsubscr...@lucene.apache.org>
For additional commands, e-mail:
java-user-help@lucene.apache.**org<java-user-h...@lucene.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org