Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Brandon Mintern
Another good reference is this one: http://unicode.org/reports/tr29/ Since the latest Lucene uses this for the basis of its text segmentation, it's worth getting familiar with it. On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir wrote: > On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur wrote: >> Tha

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Robert Muir
On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur wrote: > Thanks Robert.  That makes sense.  Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > yeah, usually i use http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Benson Margulies
fileformat.info On Mar 30, 2012, at 1:04 PM, Denis Brodeur wrote: > Thanks Robert. That makes sense. Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > > On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote: > >> On F

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Denis Brodeur
Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote: > On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur > wrote: > > Hello, I'm currently w

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Robert Muir
On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur wrote: > Hello, I'm currently working out some problems when searching for Tibetan > Characters.  More specifically: /u0f10-/u0f19.  We are using the unicode doesn't consider most of these characters part of a word: most are punctuation and symbols

Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Denis Brodeur
Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the StandardAnalyzer (3.4) and I've narrowed the problem down to StandardTokenizerImpl throwing away these characters i.e. in getNextToken(), falls through case1: