On Jun 11, 3:05 am, Chris Rebert <c...@rebertia.com> wrote: > On Wed, Jun 10, 2009 at 5:53 PM, Mensanator<mensana...@aol.com> wrote: > > On Jun 10, 5:24 pm, David Wilson <d...@botanicus.net> wrote: > >> Hi, > > >> During a fun coding session yesterday, I came across a problem that I > >> thought was already solved by itertools, but on investigation it seems > >> it isn't. > > >> The problem is simple: given one or more ordered sequences, return > >> only the objects that appear in each sequence, without reading the > >> whole set into memory. This is basically an SQL many-many join. > > > Why not use SQL? > > Agreed. I seem to recall the last person asking for such a function > wanted to use it to combine SQL results. >
My original use case was a full text indexer. Here's the code: http://code.google.com/p/ghetto-fts/ Let me invert the question and ask: why would I want to use SQL for this? Or in my own words: what kind of girly-code requires an RDBMS just to join some sequences? =) Given that Google reports 14.132 billion occurrences of "the" on the English web, which is about right, given that some estimate the English web at ~15 billion documents, or about 33.8 bits to uniquely identify each document, let's assume we use a 64bit integer, that's theoretically 111.7GiB of data loaded into SQL just for a single word. Introducing SQL quickly results in artificially requiring a database system, when a 15 line function would have sufficed. It also restricts how I store my data, and prevents, say, using a columnar, variable length, or delta encoding on my sequence of document IDs, which would massively improve the storage footprint (say, saving 48-56 bits per element). I'll avoid mention of the performance aspects altogether. "What the hell are you thinking", David > Cheers, > Chris > --http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list