Do you want to search for shingles?
On 3/4/2015 9:16 PM, Stephen Rudd wrote:
I have created a slightly hairy document collection that contains 10s of
millions of DNA sequence words that I wish to process to find rarer and unique
words. Each of the words is between 100 characters (nucleotides) and 1000
characters in length.
I have been able to use WildcardQuery and FuzzyQuery to select for words -
using the query “*ubst*” I can recover subst, substring etc.
I am a little challenged in selecting words in the reciprocal direction - if I
start with a long word such as “sequence”, what would be the most appropriate
way to select for the words in the database that are found within e.g. sequ,
quenc and ence?
Is there a simple logical way that this could or should be done? A few pointers
would be very much appreciated.
Cheers
Stephen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org