[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Robert Muir (JIRA) Sun, 07 Nov 2010 09:20:33 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929376#action_12929376
 ]


Robert Muir commented on LUCENE-2167:
-------------------------------------

bq. So we're talking about two separate issues here: a) Lucene's default 
behavior; and b) Lucene's capabilities.

agreed!

bq. For a), you'll have a lot of 'splaining to do if you drop existing 
functionality (e.g. email and hostname "recognition" - where quotes indicate 
"bad" things, right? "Cool"!)

to me recognizing hostnames is specific to what one application might want.
if you recognize www.facebook.com but my app wants to find this with a query of 
'facebook', it cant.
yet if just stick to uax#29, if a user queries on www.facebook.com, and they 
are unsatisfied with the results,
that user can always "refine" their query by searching on "www.facebook.com" 
and they get a phrasequery.
I think this is pretty intuitive and users are used to this... again this is 
just for general defaults...

and again, hostnames are just an example, why do we recognize them and not 
filenames?
yet a lot of people are happy being able to do 'partial filename' matching and 
not the whole path...
users that are unhappy with this 'default' behavior can use double quotes to 
refine their results.

and in both cases, apps that need something more specific can use a custom 
tokenizer.

bq.  Why not? Why shouldn't Lucene be a catch-all for "cool" linguistic stuff?

In this case I think analysis won't meet their needs anyway. a lot of people 
wanting to recognize full urls or proper names (mike's example)
actually want to do this in the 'document build' and dump the extracted 
entities into a separate field, so they can do things like
facet on this field, or find other documents that refer to the same person. 
This is because they are trying to 'find structure in the unstructured',
but it starts to get complicated if we mix this problem with 'feature 
extraction' which is what i think analysis should be.





> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Shyamal Prasad
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, 
> LUCENE-2167-lucene-buildhelper-maven-plugin.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, 
> LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, 
> LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the 
> standard as much as we can with jflex. Then its name would actually make 
> sense.
> Such a transition would involve renaming the old StandardTokenizer to 
> EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff 
> can stay with that EuropeanTokenizer, and it could be used by the european 
> analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

Reply via email to