Re: Questions about doing a full text search with numeric values

Jack Krupansky Thu, 27 Jun 2013 10:47:55 -0700

Do continue to experiment with Solr as a "testbed" - all of the analysisfilters used by Solr are... part of Lucene, so once you figure things out inSolr (using the Solr Admin UI analysis page), you can mechanically translateto raw Lucene API calls.


Look at the standard tokenizer, it should do a better job with punctuation.


-- Jack Krupansky

-----Original Message-----From: Todd Hunt

Sent: Thursday, June 27, 2013 1:14 PM
To: java-user@lucene.apache.org
Subject: Questions about doing a full text search with numeric values

I am working on an application that is using Tika to index text baseddocuments and store the text results in Lucene. These documents can rangeanywhere from 1 page to thousands of pages.

We are currently using Lucene 3.0.3. I am currently using theStandarAnalyzer to index and search for the text that is contained in oneLucene document field.

For strictly alpha based, English words, the searches return the results asexpected. The problem has to do with searching for numeric values in theindexed documents. So examples of text in the documents that cannot befound unless wild cards are used are:


Ø  1-800-costumes.com

o   800 does not find the text above

Ø  $118.30

o   118 does not find the text above

Ø  3tigers

o   3 does not find the text above

Ø  000000123456

o   123456 does not find the text above

Ø  123,abc,foo,bar,456

o   This is in a CSV file

o   123 nor 456 finds the text above

§ I realize that it has to do with the texted only being separated bycommas and so it is treated as one token, but I think the issue is nodifferent than the others

The expectation from our users is that if they can open the document in itsdefault application (Word, Adobe, Notepad, etc.) and perform a "find" withinthat application and find the text, then our application based on Luceneshould be able to find the same text.

It is not reasonable for us to request that our users surround their searchwith wildcards. Also, it seems like a kludge to programmatically put wildcards around any numeric values the user may enter for searching.

Is there some type of numeric parser or filter that would help me out withthese scenarios?

I've looked at Solr and we already have strong foundation of code utilizingSpring, Hibernate, and Lucene. Trying to integrate Solr into ourapplication would take too much refactoring and time that isn't availablefor this release.

Also, since these numeric values are embedded within the documents, I don'tthink storing them as their own field would make sense since I want tomaintain the context of the numeric values within the document.

Thank you.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Questions about doing a full text search with numeric values

Reply via email to