On Wed, Apr 18, 2012 at 2:54 PM, Pedro <pedro...@gmail.com> wrote: > So to put it in a simple way, I need to tokenize all my data and > create an index which I load into memory...? > Is this how it is usually done? For example, does my browser (firefox) > keep an index of all the words present in urls and page titles on > memory at any given time?
Yes; features like "autocomplete" use data structures to index and quickly retrieve matches. For example, tries or suffix trees allow for quick searches given partial keywords. There's a question on Stack Overflow that talks about this: http://stackoverflow.com/questions/2426583/data-structure-for-auto-completion I recently refreshed my suffix trees implementation with scribbled documentation (http://planet.racket-lang.org/package-source/dyoo/suffixtree.plt/1/2/planet-docs/manual/index.html). I do I need to make a revised release that does not produce a spurious error message at compile-time; my apologies for that. The library might come in handy. For example: ###################################### > (require (planet dyoo/suffixtree:1:2)) > (define tree (make-tree)) > (tree-add! tree (string->label "apple sauce")) > (tree-add! tree (string->label "fruit juice")) > (tree-add! tree (string->label "mango")) ###################################### At this point, we can ask the tree if a substring exists in any suffix of the strings we've added to the tree: ###################################### > (tree-contains? tree (string->label "juic")) #t > (tree-contains? tree (string->label "moo")) #f ###################################### We can augment the suffix tree structure so that we can pick out the original terms that matched the given substring. It shouldn't be too bad, though it'll take a little more work. ____________________ Racket Users list: http://lists.racket-lang.org/users