OK, I see.  Well, on non-trivially sized corpora, I think storage
requirements can become an issue, and in a situation where you're handling
user queries one might wonder how often someone will query a 10-gram.  But
if you can make it work, go nuts!

For a lot of statistical language modeling there seems to be a sweet spot
at the 3-gram point.  I feel like I even saw a paper recently that compared
different human languages and concluded something about the importance of
trigrams, but I can't find it now.


On Tue, Mar 10, 2015 at 10:58 AM, Sam Raker <sam.ra...@gmail.com> wrote:

> I more meant deciding on a maximum size and storing them qua ngrams--it
> seems limiting. On the other hand, after a certain size, they stop being
> ngrams and start being something else--"texts," possibly.
>
> On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote:
>>
>> By "hard coding" n-grams, do you mean using the simple string
>> representation, e.g. "aunt rhodie" as the key in your database?  If so,
>> then maybe it helps to think of it from the perspective that it's not
>> really just text, it's a string that encodes an n-gram just like
>> "[\"aunt\", \"rhodie\"]" is another way to encode an n-gram--the
>> encoding/decoding uses clojure.string/join and clojure.string/split instead
>> of json/write and json/read, and escaping tokens that contain spaces is on
>> your TODO list at a low priority :)
>>
>> (And I think the Google n-gram corpus
>> <https://catalog.ldc.upenn.edu/LDC2006T13> uses the same format.)
>>
>>
>> John
>>
>>
>> On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker <sam....@gmail.com> wrote:
>>
>>> That's interesting. I've been really reluctant to "hard code" n-grams,
>>> but it's probably the best way to go.
>>>
>>> On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote:
>>>>
>>>> One thing you can do is index 1, 2, 3...n-grams and use a simple & fast
>>>> key-value store (like leveldb etc.)  e.g., you could have entries like
>>>>
>>>> "aunt rhodie" -> song-9, song-44
>>>> "woman" -> song-12, song-65, song-96
>>>>
>>>>
>>>> That's basically how I made the Metafilter N-gram Viewer
>>>> <http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer
>>>> <https://books.google.com/ngrams>.
>>>>
>>>> Another possibility is using Lucene.  Just be aware that Lucene calls
>>>> n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of
>>>> words ("that the", "the old", "old gray") shingles.  So you would end up
>>>> using (I think, I haven't done this) the ShingleFilter
>>>> <https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>>>> .
>>>>
>>>> You might also find this article by Russ Cox interesting, where he
>>>> describes building and using an inverted trigram index:
>>>> http://swtch.com/~rsc/regexp/regexp4.html
>>>>
>>>>
>>>> John
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Three things that you might find interesting:
>>>>
>>>> Russ Cox' explanation of doing indexing and retrieval with an inverted
>>>> trigram index: http://swtch.com/~rsc/regexp/regexp4.html
>>>>
>>>>
>>>> On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <phill...@gmail.com>
>>>> wrote:
>>>>
>>>>> A lot of guys would use Lucene.  Lucene calls n-grams of words
>>>>> "shingles". [1]
>>>>>
>>>>> As for "architecture", here is a suggestion to use Lucene to find keys
>>>>> to records in your "real" database. [2]
>>>>>
>>>>> [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/
>>>>>
>>>>> [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ
>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Clojure" group.
>>>>> To post to this group, send email to clo...@googlegroups.com
>>>>> Note that posts from new members are moderated - please be patient
>>>>> with your first post.
>>>>> To unsubscribe from this group, send email to
>>>>> clojure+u...@googlegroups.com
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/clojure?hl=en
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Clojure" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to clojure+u...@googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clo...@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+u...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to clojure+u...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to