Hola Juan, On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote: > I have an index in Spanish and I use Snowball to stem and > analyze and it works perfectly. However, I am running into > trouble storing (not indexing, only storing) words that > have special characters. > > That is, I store the special character but the it comes > garbled when I read it back. > To provide an example: > > String content = "niƱos"; > document.add(new Field("name",content,Store.YES, Index.Tokenized)); > writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
If your source code is encoded as Latin-1, then it will likely appear to you to be the correct character (depending on the editor/viewer you're using and its configuration), but Java may not properly convert it to Unicode, depending on the encoding it expects your source code to be in (see the -encoding option to javac - if you don't specify it, then the platform default encoding is used). You could test whether this is the problem by instead trying: String content = "ni\u00F1os"; ... > Looking at the index with Luke it shows me "ni�os" but > when I want to see the full text (by right clicking) it shows > me ni os. � is the Unicode replacement character (U+FFFD), and it's routinely used, including within Lucene itself, as the substitute character for byte sequences that are not valid in the designated source encoding. > I know Lucene is supposed to store fields in UTF8, but then, > how can I make sure I sotre something and get it back just as > it was, including special characters? Make sure that the data you give to Lucene is encoded properly, and then what you get back should also be. Please try the suggestion I gave you above ("ni\u00F1os"). If you still have the same problem, you may have found a bug - please report back what you find. Steve