Re: Working with big datasets, merging two ordered lists by key

Leif Mon, 10 Mar 2014 21:41:30 -0700

Re. Tim's points below:

*i)* The seqs have to be ordered, or one of them has to be loaded fully 
into memory; I don't think there's any way around that.


*ii)* Frank's solution does *not* require the seqs to be the same length, 
and it gives you the complete 'diff' of the seqs (aka outer join), which 
could be handy.  The one snag I see is that it is eager, not lazy, so it's 
going to put the answer completely in memory.  So unless you are projecting 
out a small subset of the fields from each record, you will probably end up 
using as much memory as the other solutions.  I wrote a lazy version using 
'iterate', but I'm not sure it doesn't keep both entire seqs in memory, too.

My two cents:

1. If you have enough memory, go with Moritz' suggestion to read the 
smaller seq into a map.  Then you can do a simple for comprehension and 
arrange it so that the second, larger seq will never be completely in 
memory.
2. Another possible solution is to load the textfile into a temp table in 
your database.  Then the solution is one simple SQL query, backed by 
hyper-optimized code designed to deal with this exact problem.
3. You may want to try the naive approach: 400k records sounds like it 
could very well fit into memory, as long as each record doesn't have a huge 
amount of data.
4. A library that has tools to deal with big files: 
https://github.com/kyleburton/clj-etl-utils

--Leif

On Monday, March 10, 2014 11:01:07 PM UTC-4, frye wrote:
>
> Hey Frank, 
>
> Right. So I tried this loop / recur, and it runs, giving a result of *([4 
> nil] [3 3] [2 nil] [1 1])*. But I'm not sure how that's going to help you 
> (although not discounting the possibility). 
>
> You can simultaneously iterate through pairs of lists, to compare values. 
> However you cannot guarantee that those lists will be *i)* ordered, and 
> *ii)* the same length. Both those conditions are required for your 
> algorithm to work. Plus, what you suggest still means that you'll have to 
> scan through the entire space of both results. So we're not going to avoid 
> that. 
>
> Based on your requirements, I still see my original *for* comprehension 
> as the most straightforward way to solve the problem. My second suggested 
> algorithm could also work. But I could be wrong and am always learning too. 
> So trying different solutions is a good habit to keep. 
>
>
> Hth 
>
> Tim Washington 
> Interruptsoftware.com <http://interruptsoftware.com> 
>
>  
> On Mon, Mar 10, 2014 at 4:53 AM, Frank Behrens <fbeh...@gmail.com<javascript:>
> > wrote:
>
>> Hey, just to share, I came up with this code, which seem quite ok to me,
>> Feels like I already understand something, do i,
>> Have a nice day, Frank
>>
>> (loop
>>   [a '(1 2 3 4)
>>    b '(1 3)
>>    out ()]
>>   (cond 
>>     (and (empty? a)(empty? b)) out
>>     (empty? a)                 (recur a (rest b) (conj out [nil (first 
>> b)]))   
>>     (empty? b)                 (recur (rest a)  b (conj out [(first a) 
>> nil]))
>>     :else (let
>>             [fa   (first a)
>>              fb   (first b)
>>              cmp  (compare fa fb)]
>>             (cond 
>>                 (= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
>>                 (> 0 cmp) (recur (rest a)  b       (conj out [fa nil]))
>>                 :else     (recur  a       (rest b) (conj out [nil 
>> fb]))))))
>>
>>
>> Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:
>>
>>> Thanks for your suggestions. 
>>> a for loop has to do  100.000 * 300.000 compares
>>> Storing the database table into a 300.000 element hash, would be a 
>>> memory penalty I want to avoid.
>>>
>>> I'm quite shure that assential part of the solution is a function to 
>>> iterate through both list at once,
>>> spitting out pairs of values according to compare
>>>
>>> (merge-sortedlists 
>>>   '(1 2 3)
>>>   '(   2    4))
>>> => ([1 nil] [2 2] [3 nil] [nil 4])
>>>
>>> Seems quite doable.
>>> Try to implement now.
>>>
>>> Frank
>>>
>>> 

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

Reply via email to