OK, I see. Well, on non-trivially sized corpora, I think storage requirements can become an issue, and in a situation where you're handling user queries one might wonder how often someone will query a 10-gram. But if you can make it work, go nuts!
For a lot of statistical language modeling there seems to be a sweet spot at the 3-gram point. I feel like I even saw a paper recently that compared different human languages and concluded something about the importance of trigrams, but I can't find it now. On Tue, Mar 10, 2015 at 10:58 AM, Sam Raker <sam.ra...@gmail.com> wrote: > I more meant deciding on a maximum size and storing them qua ngrams--it > seems limiting. On the other hand, after a certain size, they stop being > ngrams and start being something else--"texts," possibly. > > On Tuesday, March 10, 2015 at 1:29:44 PM UTC-4, John Wiseman wrote: >> >> By "hard coding" n-grams, do you mean using the simple string >> representation, e.g. "aunt rhodie" as the key in your database? If so, >> then maybe it helps to think of it from the perspective that it's not >> really just text, it's a string that encodes an n-gram just like >> "[\"aunt\", \"rhodie\"]" is another way to encode an n-gram--the >> encoding/decoding uses clojure.string/join and clojure.string/split instead >> of json/write and json/read, and escaping tokens that contain spaces is on >> your TODO list at a low priority :) >> >> (And I think the Google n-gram corpus >> <https://catalog.ldc.upenn.edu/LDC2006T13> uses the same format.) >> >> >> John >> >> >> On Mon, Mar 9, 2015 at 7:09 PM, Sam Raker <sam....@gmail.com> wrote: >> >>> That's interesting. I've been really reluctant to "hard code" n-grams, >>> but it's probably the best way to go. >>> >>> On Monday, March 9, 2015 at 6:12:43 PM UTC-4, John Wiseman wrote: >>>> >>>> One thing you can do is index 1, 2, 3...n-grams and use a simple & fast >>>> key-value store (like leveldb etc.) e.g., you could have entries like >>>> >>>> "aunt rhodie" -> song-9, song-44 >>>> "woman" -> song-12, song-65, song-96 >>>> >>>> >>>> That's basically how I made the Metafilter N-gram Viewer >>>> <http://mefingram.appspot.com/>, a clone of Google Books Ngram Viewer >>>> <https://books.google.com/ngrams>. >>>> >>>> Another possibility is using Lucene. Just be aware that Lucene calls >>>> n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of >>>> words ("that the", "the old", "old gray") shingles. So you would end up >>>> using (I think, I haven't done this) the ShingleFilter >>>> <https://lucene.apache.org/core/4_2_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html> >>>> . >>>> >>>> You might also find this article by Russ Cox interesting, where he >>>> describes building and using an inverted trigram index: >>>> http://swtch.com/~rsc/regexp/regexp4.html >>>> >>>> >>>> John >>>> >>>> >>>> >>>> >>>> >>>> Three things that you might find interesting: >>>> >>>> Russ Cox' explanation of doing indexing and retrieval with an inverted >>>> trigram index: http://swtch.com/~rsc/regexp/regexp4.html >>>> >>>> >>>> On Sat, Mar 7, 2015 at 3:22 AM, Matching Socks <phill...@gmail.com> >>>> wrote: >>>> >>>>> A lot of guys would use Lucene. Lucene calls n-grams of words >>>>> "shingles". [1] >>>>> >>>>> As for "architecture", here is a suggestion to use Lucene to find keys >>>>> to records in your "real" database. [2] >>>>> >>>>> [1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/ >>>>> >>>>> [2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Clojure" group. >>>>> To post to this group, send email to clo...@googlegroups.com >>>>> Note that posts from new members are moderated - please be patient >>>>> with your first post. >>>>> To unsubscribe from this group, send email to >>>>> clojure+u...@googlegroups.com >>>>> For more options, visit this group at >>>>> http://groups.google.com/group/clojure?hl=en >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "Clojure" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to clojure+u...@googlegroups.com. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To post to this group, send email to clo...@googlegroups.com >>> Note that posts from new members are moderated - please be patient with >>> your first post. >>> To unsubscribe from this group, send email to >>> clojure+u...@googlegroups.com >>> For more options, visit this group at >>> http://groups.google.com/group/clojure?hl=en >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "Clojure" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to clojure+u...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.