On 2/24/2009 at 5:36 PM, Chris Hostetter wrote:
> Shingling is (lucene specific?) vernacular for word based ngrams

"Shingle" is not a Lucene-specific term - here's an entry, e.g., from an
IBM "Glossary of terms for enterprise search" at
<http://publib.boulder.ibm.com/infocenter/discover/v8r5m0/index.jsp?topi
c=/com.ibm.discovery.es.common.doc/standard/iiysgloss.htm>:

-----
shingle

    A string of consecutive tokens (words) that are taken from a
sentence. For example, from "This is a very short sentence.", the 3-word
shingles (or trigrams) are:

    This is a
    is a very
    a very short
    very short sentence

    Shingles can be used in statistical linguistics. For example, if two
different texts have a lot of common shingles, the texts are probably
related somehow.
-----

The earliest usage I can find is Andrei Broder et al.'s 1997 report
"Syntactic Clustering of the Web":

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

I first saw the term in Broder's 2000 paper "Identifying and Filtering
Near-Duplicate Documents": 

http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202
000.pdf

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to