On 2/24/2009 at 5:36 PM, Chris Hostetter wrote: > Shingling is (lucene specific?) vernacular for word based ngrams
"Shingle" is not a Lucene-specific term - here's an entry, e.g., from an IBM "Glossary of terms for enterprise search" at <http://publib.boulder.ibm.com/infocenter/discover/v8r5m0/index.jsp?topi c=/com.ibm.discovery.es.common.doc/standard/iiysgloss.htm>: ----- shingle A string of consecutive tokens (words) that are taken from a sentence. For example, from "This is a very short sentence.", the 3-word shingles (or trigrams) are: This is a is a very a very short very short sentence Shingles can be used in statistical linguistics. For example, if two different texts have a lot of common shingles, the texts are probably related somehow. ----- The earliest usage I can find is Andrei Broder et al.'s 1997 report "Syntactic Clustering of the Web": http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf I first saw the term in Broder's 2000 paper "Identifying and Filtering Near-Duplicate Documents": http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202 000.pdf Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org