By "hard coding" n-grams, do you mean using the simple string
representation, e.g. "aunt rhodie" as the key in your database?  If so,
then maybe it helps to think of it from the perspective that it's not
really just text, it's a string that encodes an n-gram just like
"[\"aunt\", \"rhodie\"]" is another way to encode an n-gram--the
encoding/decoding uses clojure.string/join and clojure.string/split instead
of json/write and json/read, and escaping tokens that contain spaces is on
your TODO list at a low priority :)

(And I think the Google n-gram corpus
<https://catalog.ldc.upenn.edu/LDC2006T13> uses the same format.)


John


On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker <sam.ra...@gmail.com> wrote:

> That's interesting. I've been really reluctant to "hard code" n-grams, but
> it's probably the best way to go.
>
> On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote:
>>
>> One thing you can do is index 1, 2, 3...n-grams and use a simple & fast
>> key-value store (like leveldb etc.)  e.g., you could have entries like
>>
>> "aunt rhodie" -> song-9, song-44
>> "woman" -> song-12, song-65, song-96
>>
>>
>> That's basically how I made the Metafilter N-gram Viewer
>> <http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer
>> <https://books.google.com/ngrams>.
>>
>> Another possibility is using Lucene.  Just be aware that Lucene calls
>> n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of
>> words ("that the", "the old", "old gray") shingles.  So you would end up
>> using (I think, I haven't done this) the ShingleFilter
>> <https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html>
>> .
>>
>> You might also find this article by Russ Cox interesting, where he
>> describes building and using an inverted trigram index:
>> http://swtch.com/~rsc/regexp/regexp4.html
>>
>>
>> John
>>
>>
>>
>>
>>
>> Three things that you might find interesting:
>>
>> Russ Cox' explanation of doing indexing and retrieval with an inverted
>> trigram index: http://swtch.com/~rsc/regexp/regexp4.html
>>
>>
>> On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <phill...@gmail.com>
>> wrote:
>>
>>> A lot of guys would use Lucene.  Lucene calls n-grams of words
>>> "shingles". [1]
>>>
>>> As for "architecture", here is a suggestion to use Lucene to find keys
>>> to records in your "real" database. [2]
>>>
>>> [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/
>>>
>>> [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ
>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To post to this group, send email to clo...@googlegroups.com
>>> Note that posts from new members are moderated - please be patient with
>>> your first post.
>>> To unsubscribe from this group, send email to
>>> clojure+u...@googlegroups.com
>>> For more options, visit this group at
>>> http://groups.google.com/group/clojure?hl=en
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Clojure" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to clojure+u...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to