Hi Erick
I think you've made some really important observations. Steven has
provided a good regular expression to help with word and non-words. For
the moment I have reverted to my analyser and am going to try doing some
clever pattern matching later. Also, Ill try using a different analyser
Hi Rahil,
Rahil wrote:
> I was just wondering whether there is a
> difference between the regular expression you sent me i.e.
> (i) \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*
>
>and
> (ii) \\b
>
> as they lead to the same output. For example, the string search "testing
> a-new string=3/4
Hi Steve
Thanks for your response. I was just wondering whether there is a
difference between the regular expression you sent me i.e.
(i) \s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*
and
(ii) \\b
as they lead to the same output. For example, the string search "testing
a-new string=3/4
Steven Rowe wrote:
>\s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S))\s*
Oops, here's an improved version to cover the beginning- and
end-of-string non-alphanumeric cases (E.g. "=some text-"):
\s*(?:\b|(?<=\S)(?=\s)|(?<=\s)(?=\S)|\A|\z)\s*
Hi Rahil,
Rahil wrote:
> I couldnt figure out a valid regular expression to write a valid
> Pattern.compile(String regex) which can tokenise a string into "O/E -
> visual acuity R-eye=6/24" into "O","/","E", "-", "visual", "acuity",
> "R", "-", "eye", "=", "6", "/", "24".
The following regular e
My intuition is that you'll have a real problem using regular expressions.
It'll either be incredibly ugly (and unmaintainable) or just won't work
since the regular expression tools tend to throw out the delimiters.
I think you'll be much better off writing your own analyzer (see LIA, the
synonym
Hi Erick
Im having trouble with writing a good regular expression for the
PatternAnalyzer to deal with word and non-word characters.I couldnt
figure out a valid regular expression to write a valid
Pattern.compile(String regex) which can tokenise a string into "O/E -
visual acuity R-eye=6/24"
: I have a custom-built Analyzer where I tokenize all non-whitespace
: characters as well available in the field "TERM" (which is the only
: field being tokenised).
: If I now query my index file for a term "6/12" for instance, I get back
: only ONE result
: instead of TWO. There is another token
Well, I'm not the greatest expert, but a quick look doesn't show me anything
obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you?
Although I don't remember whether WhiteSpaceAnalyzer lowercases or not.
It sure looks like you're getting reasonable results given how you're
tokenizi
Hi Erick
Thanks for your response. There's a lot to chew on in your reply and Im
looking at the suggestions you've made.
Yeah I have Luke installed and have queried my index but there isn't any
great explanation Im getting out of it. A query for "6/12" is sent as
"TERM:6/12" which is quite
Most often, from what I've seen on this e-mail list, unexpected results are
because you're not indexing on the tokens you *think* you're indexing. Or
not searching on them. By that I mean that the analyzers you're using are
behaving in ways you don't expect.
That said, I think you're getting exac
Hi
I have a custom-built Analyzer where I tokenize all non-whitespace
characters as well available in the field "TERM" (which is the only
field being tokenised).
If I now query my index file for a term "6/12" for instance, I get back
only ONE result
SCOREDESCRIPTIONSTATUSCONCEPTID
12 matches
Mail list logo