Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Brandon Mintern
Another good reference is this one: http://unicode.org/reports/tr29/ Since the latest Lucene uses this for the basis of its text segmentation, it's worth getting familiar with it. On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir wrote: > On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur wrote: >> Tha

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Robert Muir
On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur wrote: > Thanks Robert.  That makes sense.  Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > yeah, usually i use http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Benson Margulies
fileformat.info On Mar 30, 2012, at 1:04 PM, Denis Brodeur wrote: > Thanks Robert. That makes sense. Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > > On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote: > >> On F

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Denis Brodeur
Thanks Robert. That makes sense. Do you have a link handy where I can find this information? i.e. word boundary/punctuation for any unicode character set? On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir wrote: > On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur > wrote: > > Hello, I'm currently w

Re: Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Robert Muir
On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur wrote: > Hello, I'm currently working out some problems when searching for Tibetan > Characters.  More specifically: /u0f10-/u0f19.  We are using the unicode doesn't consider most of these characters part of a word: most are punctuation and symbols

Problems Indexing/Parsing Tibetan Text

2012-03-30 Thread Denis Brodeur
Hello, I'm currently working out some problems when searching for Tibetan Characters. More specifically: /u0f10-/u0f19. We are using the StandardAnalyzer (3.4) and I've narrowed the problem down to StandardTokenizerImpl throwing away these characters i.e. in getNextToken(), falls through case1:

RE: PyLucene Error Message

2012-03-30 Thread David Mosca
I have added the wait but the script still crashes from time to time. (I noticed that the value of self.jvm.attachCurrentThread() is always 0, i.e. the script always enters the while loop only once). thanks -Original Message- From: Greg Bowyer [mailto:gbow...@shopzilla.com] Sent: 29 Mar

Surge 2012 CFP is Open!

2012-03-30 Thread Katherine Jeschke
Surge 2012, the scalability conference, September 27-28, Baltimore, MD has opened its CFP. Please visit http://omniti.com/surge/2012/cfp for details. -- Katherine Jeschke Director of Marketing and Creative Services OmniTI Computer Consulting, Inc. 7070 Samuel Morse Drive, Ste.150 Columbia, MD 210