Here's a unit test:
import junit.framework.TestCase;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.RAMDirectory;


public class SpanishTest extends TestCase {

  public void testSpanish() throws Exception {
    RAMDirectory directory = new RAMDirectory();
    String content = "niños";
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true);
    Document document = new Document();
document.add(new Field("name", content, Field.Store.YES, Field.Index.TOKENIZED)); SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer("Spanish");
    writer.addDocument(document, snowballAnalyzer);
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);
    QueryParser parser = new QueryParser("name", snowballAnalyzer);
    Query query = parser.parse(content);
    System.out.println("Query: " + query);
    Hits hits = searcher.search(query);
assertTrue("hits Size: " + hits.length() + " is not: " + 1, hits.length() == 1);
    Document theDoc = hits.doc(0);
    String nombre = theDoc.get("name");
    System.out.println("Nombre: " + nombre);
  }
}


When I run this in IntelliJ, I get:

Query: name:niñ
Nombre: niños

Process finished with exit code 0


Are you by chance indexing XML?



On Aug 21, 2008, at 1:16 PM, Juan Pablo Morales wrote:

I have an index in Spanish and I use Snowball to stem and analyze and it works perfectly. However, I am running into trouble storing (not indexing,
only storing) words that have special characters.

That is, I store the special character but the it comes garbled when I read
it back.
To provide an example:

String content = "niños";
document.add(new Field("name",content,Store.YES, Index.Tokenized));
writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
.
When I read the field back
String nombre = doc.get("name");

Then name will contain "ni�os"

Looking at the index with Luke it shows me "ni�os" but when I want to
see the full text (by right clicking) it shows me ni�os.

I know Lucene is supposed to store fields in UTF8, but then, how can I make sure I sotre something and get it back just as it was, including special
characters?

Thanks
--
Juan Pablo Morales
Ingenian Software ltda
Bogotá, Colombia

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to