Re: Storing special characters in Lucene

Grant Ingersoll Thu, 21 Aug 2008 15:31:32 -0700

Here's a unit test:
import junit.framework.TestCase;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.RAMDirectory;



public class SpanishTest extends TestCase {

  public void testSpanish() throws Exception {
    RAMDirectory directory = new RAMDirectory();
    String content = "niños";

IndexWriter writer = new IndexWriter(directory, newStandardAnalyzer(), true);

    Document document = new Document();

document.add(new Field("name", content, Field.Store.YES,Field.Index.TOKENIZED));SnowballAnalyzer snowballAnalyzer = newSnowballAnalyzer("Spanish");

    writer.addDocument(document, snowballAnalyzer);
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);
    QueryParser parser = new QueryParser("name", snowballAnalyzer);
    Query query = parser.parse(content);
    System.out.println("Query: " + query);
    Hits hits = searcher.search(query);

assertTrue("hits Size: " + hits.length() + " is not: " + 1,hits.length() == 1);

    Document theDoc = hits.doc(0);
    String nombre = theDoc.get("name");
    System.out.println("Nombre: " + nombre);
  }
}


When I run this in IntelliJ, I get:

Query: name:niñ
Nombre: niños

Process finished with exit code 0


Are you by chance indexing XML?



On Aug 21, 2008, at 1:16 PM, Juan Pablo Morales wrote:

I have an index in Spanish and I use Snowball to stem and analyzeand itworks perfectly. However, I am running into trouble storing (notindexing,
only storing) words that have special characters.
That is, I store the special character but the it comes garbled whenI read
it back.
To provide an example:

String content = "niños";
document.add(new Field("name",content,Store.YES, Index.Tokenized));
writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
.
When I read the field back
String nombre = doc.get("name");

Then name will contain "ni�os"
Looking at the index with Luke it shows me "ni�os" but when Iwant to
see the full text (by right clicking) it shows me ni�os.
I know Lucene is supposed to store fields in UTF8, but then, how canI makesure I sotre something and get it back just as it was, includingspecial
characters?

Thanks
--
Juan Pablo Morales
Ingenian Software ltda
Bogotá, Colombia


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Storing special characters in Lucene

Reply via email to