Re: ANN: clj-spark - A Clojure wrapper for the Spark API

Marc Limotte Wed, 23 Jan 2013 07:03:06 -0800

Hi Mark.

This was very much exploratory work, and a lot of it was just about
learning the Spark paradigms.  That being said, merging for future work
seems appropriate, but it's not clear yet if I will be pursuing this work
further.  Might wind up using Shark instead [would love to use Cascading
over Spark as well, if it existed].


I'd like to know what issue you had in getting the code/examples to work?
 I had a couple of people try this out from scratch on clean systems and it
did work for them.

Serializing the functions is necessary as far as I can tell.  It would not
work for me without this.  As far as I can tell (this is largely
guesswork), the problem is that each time the anonymous function is
evaluated on a different JVM it gets a different class name (e.g. fn_123).
 There is high likelihood that the name assigned on the master is not the
same as the name on the task JVMs, so you wind up with a
ClassNotFoundException.

I don't know why this would work for you.  If you have any insight on this,
I would love to hear it?

Marc

On Tue, Jan 22, 2013 at 8:09 AM, Mark Hamstra <markhams...@gmail.com> wrote:

> Hmmm... a lot of duplicated work.  Sorry I didn't get my stuff in a more
> usable form for you, but I wasn't aware that anybody was even interested in
> it.  I've got some stuff that I want to rework a little, and I'm still
> thinking through the best way to integrate with the new reducers code in
> Clojure, but I haven't had the right combination of time and motivation to
> finish off what I started and document it.  At any rate, we should work at
> merging the two efforts, since I don't see any need for duplicate APIs.
>
> In taking a quick first pass at it, I wasn't able to get your code and
> examples to work, but I'm curious what your reasoning is for
> using serializable.fn and avoiding use of
> clojure.core/fn or #().  I'm not sure that is strictly necessary.  For
> example, the following works just fine with my API:
>
> (require 'spark.api.clojure.core)
>
> (wrappers!) ; one of the pieces I want to re-work, but allows functions
> like map to work with  either Clojure collections or RDDs
>
> (set-spark-context! "local[4]" "cljspark")
>
> (def rdd (parallelize [1 2 3 4]))
>
> (def mrdd1 (map #(+ 2 %) rdd))
>
> (def result1 (collect mrdd1))
>
> (def offset1 4)
>
> (def mrdd2 (map #(+ offset %) rdd))
>
> (def result2 (collect mrdd2))
>
> (def mrdd3 (map (let [offset2 5] (+ offset %)) rdd))
>
> (def result3 (collect mrdd3))
>
>
> That will result in result1, result2, and result3 being [3 4 5 6], [5 6 7
> 8], and [6 7 8 9] respectively, without any need for serializable-fn.
>
>
> On Tuesday, January 22, 2013 6:55:53 AM UTC-8, Marc Limotte wrote:
>
>> A Clojure api for the Spark Project.  I am aware that there is another
>> clojure spark wrapper project which looks very interesting,  This project
>> has similar goals.  And also similar to that project it is not absolutely
>> complete, but it is does have some documentation and examples.  And it is
>> useable and should be easy enough to extend as needed.  This is the result
>> of about three weeks of work.  It handles many of the initial problems like
>> serializing anonymous functions, converting back and forth between Scala
>> Tuples and Clojure seqs, and converting RDDs to PairRDDs.
>>
>> The project is available here:
>>
>> https://github.com/**TheClimateCorporation/clj-**spark<https://github.com/TheClimateCorporation/clj-spark>
>>
>> Thanks to The Climate Corporation for allowing me to release it.  At
>> Climate, we do the majority of our Big Data work with Cascalog (on top of
>> Cascading).  I was looking into Spark for some of the benefits that it
>> provides.  I suspect we will explore Shark next, and may work it in to our
>> processes for some of our more adhoc/exploratory queries.
>>
>> I think it would be interesting to see a Cascading planner on top of
>> Spark, which would enable Cascalog queries (mostly) for free.  I suspect
>> that might be a superior method of using Clojure on Spark.
>>
>> Marc Limotte
>>
>>  --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: ANN: clj-spark - A Clojure wrapper for the Spark API

Reply via email to