Re: groupByKey() and keys with many values

2015-09-08 Thread Reynold Xin
On Tue, Sep 8, 2015 at 6:51 AM, Antonio Piccolboni wrote: > As far as the DB writes, remember spark can retry a computation, so your > writes have to be idempotent (see this thread > , in > which Reynold is a bit optimistic about f

Re: groupByKey() and keys with many values

2015-09-08 Thread Antonio Piccolboni
You may also consider selecting distinct keys and fetching from database first, then join on key with values. This in case Sean's approach is not viable -- in case you need to have the DB data before the first reduce call. By not revealing your problem, you are forcing us to make guesses, which are

Re: groupByKey() and keys with many values

2015-09-08 Thread Sean Owen
I think groupByKey is intended for cases where you do want the values in memory; for one-pass use cases, it's more efficient to use reduceByKey, or aggregateByKey if lower-level operations are needed. For your case, you probably want to do you reduceByKey, then perform the expensive per-key lookup

Re: groupByKey() and keys with many values

2015-09-07 Thread kaklakariada
Hi Antonio! Thank you very much for your answer! You are right in that in my case the computation could be replaced by a reduceByKey. The thing is that my computation also involves database queries: 1. Fetch key-specific data from database into memory. This is expensive and I only want to do this

Re: groupByKey() and keys with many values

2015-09-07 Thread Antonio Piccolboni
To expand on what Sean said, I would look into replacing groupByKey with reduceByKey. Also take a look at this doc . I happen to have designed a library that was subject to

Re: groupByKey() and keys with many values

2015-09-07 Thread Sean Owen
That's how it's intended to work; if it's a problem, you probably need to re-design your computation to not use groupByKey. Usually you can do so. On Mon, Sep 7, 2015 at 9:02 AM, kaklakariada wrote: > Hi, > > I already posted this question on the users mailing list > (http://apache-spark-user-lis