Another good reference is this one: http://unicode.org/reports/tr29/
Since the latest Lucene uses this for the basis of its text
segmentation, it's worth getting familiar with it.
On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir wrote:
> On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur wrote:
>> Tha
On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur wrote:
> Thanks Robert. That makes sense. Do you have a link handy where I can
> find this information? i.e. word boundary/punctuation for any unicode
> character set?
>
yeah, usually i use
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0
fileformat.info
On Mar 30, 2012, at 1:04 PM, Denis Brodeur wrote:
> Thanks Robert. That makes sense. Do you have a link handy where I can
> find this information? i.e. word boundary/punctuation for any unicode
> character set?
>
> On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote:
>
>> On F
Thanks Robert. That makes sense. Do you have a link handy where I can
find this information? i.e. word boundary/punctuation for any unicode
character set?
On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote:
> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur
> wrote:
> > Hello, I'm currently w
On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur wrote:
> Hello, I'm currently working out some problems when searching for Tibetan
> Characters. More specifically: /u0f10-/u0f19. We are using the
unicode doesn't consider most of these characters part of a word: most
are punctuation and symbols
Hello, I'm currently working out some problems when searching for Tibetan
Characters. More specifically: /u0f10-/u0f19. We are using the
StandardAnalyzer (3.4) and I've narrowed the problem down to
StandardTokenizerImpl throwing away these characters i.e. in
getNextToken(), falls through case1: