Re: all values for a key must fit in memory

2014-05-25 Thread Nilesh
Hi Patrick, In this particular case, at the end of my tasks I have X different types of keys. I need to write their values to X different files respectively. For now I'm writing everything to the driver node's local FS. While the number of key-value pairs can grow to millions (billions?), X is mo

Re: all values for a key must fit in memory

2014-05-25 Thread Patrick Wendell
Nilesh - out of curiosity - what operation are you doing on the values for the key? On Sun, May 25, 2014 at 6:35 PM, Nilesh wrote: > Hi Andrew, > > Thanks for the reply! > > It's clearer about the API part now. That's what I wanted to know. > > Wow, tuples, why didn't that occur to me. That's a l

Re: all values for a key must fit in memory

2014-05-25 Thread Nilesh
Hi Andrew, Thanks for the reply! It's clearer about the API part now. That's what I wanted to know. Wow, tuples, why didn't that occur to me. That's a lovely ugly hack. :) I also came across something that solved my real problem though - the RDD.toLocalIterator method from 1.0, the logic of whic

credential transfer question

2014-05-25 Thread Shihaoliang (Shihaoliang)
Hi, I have view the code about UGI in spark. If spark interactive with kerberos HDFS, The spark will apply delegate token in scheduler side, and stored as credential into the UGI; And the credential will be transferred to spark executor so that they can authenticate the HDFS. My question is

Re: all values for a key must fit in memory

2014-05-25 Thread Andrew Ash
Hi Nilesh, That change from Matei to change (Key, Seq[Value]) into (Key, Iterable[Value]) was to enable the optimization in future releases without breaking the API. Currently though, all values on a single key are still held in memory on a single machine. The way I've gotten around this is by i

Re: all values for a key must fit in memory

2014-05-25 Thread Nilesh
I would like to clarify something. Matei mentioned that in Spark 1.0 groupBy returns an (Key, Iterable[Value]) instead of (Key, Seq[Value]). Does this also automatically assure us that the whole Iterable[Value] is not in fact stored in memory? That is to say, with 1.0, will it be possible to do gro