Hi Jim, Thanks again for some help.
So are you suggesting that when I read it I need to convert it to UTF-16 if I detect the bom? (feff = utf-16, big-endian). Then I should just throw in a lexer rule that does skips and make sure that any line in my file can accept that first? BOM : '\u65279'{ skip(); }; On Tue, May 05, 2009 at 10:11:01AM -0700, Jim Idle wrote: > ian eyberg wrote: > > Hi, > > > > someone has sent me a file to parse and there are all sorts of > > '<feff>' characters in them in arbritrary spots -- looking it up > > online it appears it's some sort of character to indicate what > > encoding the strings are -- '(bom) byte order mark' > > > > my question -- what should I do with these? should I accept that > > some files are going to have these and convert them to spaces as a > > sort of pre-processor or should I take the easy way out and say > > "we don't support this" ;) > > > > the person handing me the file says he never opened it in a text > > editor and it was a piece of software on a OSX box > > > > maybe if I detect a bom in one of my documents I can convert the > > entire file to the appropriate encoding first?? > > > > thanks, > > > > > You don;t say what target language you are going to use, but if you open > the file using the correcting encoding, then I believe it will take care > of the BOM for you. The BOM is optional but it indicates if the string > that follows is Big Endian or Little Endian, which is important when > reading UCS2 and similar character encodings (of more than one byte). > You can't ignore it if the machine you are parsing on has a different > ordering than the one where the file was created. > > See: http://unicode.org/faq/utf_bom.html for more information than you > could possibly want to know about the BOM. > > Jim > > List: http://www.antlr.org/mailman/listinfo/antlr-interest > Unsubscribe: > http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- ian eyberg List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to il-antlr-interest@googlegroups.com To unsubscribe from this group, send email to il-antlr-interest+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~----------~----~----~----~------~----~------~--~---