Re: ANN: clj-spark - A Clojure wrapper for the Spark API

Mark Hamstra Thu, 24 Jan 2013 21:52:49 -0800

Okay, that begins to make more sense.  For a Clojure coder, Cascalog is 
definitely a more attractive means to express workflows at a high level 
than is Cascading.  And like I said before, I can get behind an effort to 
produce a higher-level syntax to express Spark workflows.  Shark is nice as 
far as it goes, but it is not really a means to express iterative machine 
learning algorithms and other things that SQL was never intended to do (and 
nobody expects it to do.)  I also remain unconvinced that Cascading is the 
right foundation on which to build a higher-level abstraction for Spark -- 
Cascading can express more than SQL can, but I still don't think that it is 
designed to effectively express everything that Spark can do.  I'm much 
more interested in something like MLbase <http://www.mlbase.org/> as a 
higher-level interface to Spark (and other frameworks.)



On Thursday, January 24, 2013 6:31:25 PM UTC-8, Marc Limotte wrote:
>
> I'm interested in the Cascading layer because that would enable Cascalog 
> queries.  I like the declarative (logic programming) form of Cascalog over 
> the more imperative Spark style.  SQL style is good for some uses cases 
> also, but I believe Cascalog is more com posable than SQL.  
>
> Cascading uses a [pluggable] planner to transform the workflow into some 
> DAG.  I believe there are two planners now: hadoop and local, and the idea 
> of planning for additional non-hadoop engines in the future is a design 
> goal.  I think it would be very similar to the way that Shark takes the 
> Hive AST and then creates a physical plan for Spark.
>
> Thanks for the error report with clj-spark.  That error looks familiar, 
> but I thought it was fixed.  I'll look into reproducing it next week.
>
>
> Marc
>
> On Wed, Jan 23, 2013 at 11:24 AM, Mark Hamstra 
> <markh...@gmail.com<javascript:>
> > wrote:
>
>> I certainly understand the exploration and learning motivation -- I did 
>> much the same thing.  At this point, I wouldn't consider either of our 
>> efforts to be a complete or fully usable Clojure API for Spark, but there 
>> are definitely ideas worth looking at in both if anyone gets to the point 
>> of attempting to write a complete and robust API -- which I won't be doing 
>> in the immediate future.
>>
>> I'm not sure that I am following you on Cascading and Spark.  Are you 
>> saying that you want to use the Cascading API to express workflows which 
>> will then be transformed into a DAG of Spark stages and run as a Spark job? 
>>  I don't think that I agree with that strategy.  While I can get behind 
>> various higher-level abstractions to express Spark jobs (which is what 
>> Shark is doing, after all), I don't find Cascading's API to be terribly 
>> elegant: When writing a Spark job in Scala, I just don't find myself think 
>> that it would be a whole lot easier if I could write the job in Cascading. 
>>  Part of that is because I'm not fluent in Cascading, but from what I have 
>> seen and done with it, I don't lust after Cascading.  The other problem I 
>> have with the Cascading-to-Spark strategy is that Cascading has been 
>> designed and implemented very much with Hadoop in mind, but Spark can do 
>> quite a bit more that Hadoop cannot.  I don't think that Cascading itself 
>> would be a good fit for expressing Spark jobs that can really leverage the 
>> advantages that Spark has over Hadoop.  None of that is meant to say that 
>> Cascading isn't a step forward over writing jobs using Hadoop's Java API; 
>> but at this point I just don't see Cascading as a step forward for writing 
>> Spark jobs.
>>
>> Anyway, here's one of the early problems I ran into when trying to follow 
>> your README:
>>
>> $ lein --version
>> Leiningen 2.0.0 on Java 1.7.0_09 OpenJDK 64-Bit Server VM
>> $ lein deps
>> $ lein compile
>> Compiling clj-spark.spark.functions
>> Compiling clj-spark.api
>> Compiling clj-spark.util
>> Compiling clj-spark.examples.query
>> $ lein run
>> 2013-01-23 11:17:10,436  WARN api:1 - JavaSparkContext local Simple Job 
>> /home/mark/Desktop/Scala/Spark/0.6 [] {}
>> Exception in thread "main" java.lang.ClassCastException: 
>> clojure.lang.PersistentVector cannot be cast to java.lang.CharSequence
>>  at clojure.string$split.invoke(string.clj:174)
>> at clj_spark.api$spark_context.doInvoke(api.clj:18)
>> at clojure.lang.RestFn.invoke(RestFn.java:805)
>>  at clj_spark.examples.query$_main.doInvoke(query.clj:33)
>> at clojure.lang.RestFn.invoke(RestFn.java:397)
>> at clojure.lang.Var.invoke(Var.java:411)
>>  at user$eval22.invoke(NO_SOURCE_FILE:1)
>> at clojure.lang.Compiler.eval(Compiler.java:6511)
>> at clojure.lang.Compiler.eval(Compiler.java:6501)
>>  at clojure.lang.Compiler.eval(Compiler.java:6477)
>> at clojure.core$eval.invoke(core.clj:2797)
>> at clojure.main$eval_opt.invoke(main.clj:297)
>>  at clojure.main$initialize.invoke(main.clj:316)
>> at clojure.main$null_opt.invoke(main.clj:349)
>> at clojure.main$main.doInvoke(main.clj:427)
>>  at clojure.lang.RestFn.invoke(RestFn.java:421)
>> at clojure.lang.Var.invoke(Var.java:419)
>> at clojure.lang.AFn.applyToHelper(AFn.java:163)
>>  at clojure.lang.Var.applyTo(Var.java:532)
>> at clojure.main.main(main.java:37)
>> zsh: exit 1     lein run
>>
>> On Wednesday, January 23, 2013 7:02:43 AM UTC-8, Marc Limotte wrote:
>>
>>> Hi Mark.
>>>
>>> This was very much exploratory work, and a lot of it was just about 
>>> learning the Spark paradigms.  That being said, merging for future work 
>>> seems appropriate, but it's not clear yet if I will be pursuing this work 
>>> further.  Might wind up using Shark instead [would love to use Cascading 
>>> over Spark as well, if it existed].
>>>
>>> I'd like to know what issue you had in getting the code/examples to 
>>> work?  I had a couple of people try this out from scratch on clean systems 
>>> and it did work for them.
>>>
>>> Serializing the functions is necessary as far as I can tell.  It would 
>>> not work for me without this.  As far as I can tell (this is largely 
>>> guesswork), the problem is that each time the anonymous function is 
>>> evaluated on a different JVM it gets a different class name (e.g. fn_123). 
>>>  There is high likelihood that the name assigned on the master is not the 
>>> same as the name on the task JVMs, so you wind up with a 
>>> ClassNotFoundException.
>>>
>>> I don't know why this would work for you.  If you have any insight on 
>>> this, I would love to hear it?
>>>
>>> Marc
>>>
>>> On Tue, Jan 22, 2013 at 8:09 AM, Mark Hamstra <markh...@gmail.com>wrote:
>>>
>>>> Hmmm... a lot of duplicated work.  Sorry I didn't get my stuff in a 
>>>> more usable form for you, but I wasn't aware that anybody was even 
>>>> interested in it.  I've got some stuff that I want to rework a little, and 
>>>> I'm still thinking through the best way to integrate with the new reducers 
>>>> code in Clojure, but I haven't had the right combination of time and 
>>>> motivation to finish off what I started and document it.  At any rate, we 
>>>> should work at merging the two efforts, since I don't see any need for 
>>>> duplicate APIs.
>>>>
>>>> In taking a quick first pass at it, I wasn't able to get your code and 
>>>> examples to work, but I'm curious what your reasoning is for 
>>>> using serializable.fn and avoiding use of 
>>>> clojure.core/fn or #().  I'm not sure that is strictly necessary.  For 
>>>> example, the following works just fine with my API:
>>>>
>>>> (require 'spark.api.clojure.core)
>>>>
>>>> (wrappers!) ; one of the pieces I want to re-work, but allows functions 
>>>> like map to work with  either Clojure collections or RDDs
>>>>
>>>> (set-spark-context! "local[4]" "cljspark")
>>>>
>>>> (def rdd (parallelize [1 2 3 4]))
>>>>
>>>> (def mrdd1 (map #(+ 2 %) rdd))
>>>>
>>>> (def result1 (collect mrdd1))
>>>>
>>>> (def offset1 4)
>>>>
>>>> (def mrdd2 (map #(+ offset %) rdd))
>>>>
>>>> (def result2 (collect mrdd2))
>>>>
>>>> (def mrdd3 (map (let [offset2 5] (+ offset %)) rdd))
>>>>
>>>> (def result3 (collect mrdd3))
>>>>
>>>>
>>>> That will result in result1, result2, and result3 being [3 4 5 6], [5 6 
>>>> 7 8], and [6 7 8 9] respectively, without any need for serializable-fn.
>>>>
>>>>
>>>> On Tuesday, January 22, 2013 6:55:53 AM UTC-8, Marc Limotte wrote:
>>>>
>>>>> A Clojure api for the Spark Project.  I am aware that there is another 
>>>>> clojure spark wrapper project which looks very interesting,  This project 
>>>>> has similar goals.  And also similar to that project it is not absolutely 
>>>>> complete, but it is does have some documentation and examples.  And it is 
>>>>> useable and should be easy enough to extend as needed.  This is the 
>>>>> result 
>>>>> of about three weeks of work.  It handles many of the initial problems 
>>>>> like 
>>>>> serializing anonymous functions, converting back and forth between Scala 
>>>>> Tuples and Clojure seqs, and converting RDDs to PairRDDs.
>>>>>
>>>>> The project is available here:
>>>>>
>>>>> https://github.com/**TheClimateC**orporation/clj-**spark<https://github.com/TheClimateCorporation/clj-spark>
>>>>>
>>>>> Thanks to The Climate Corporation for allowing me to release it.  At 
>>>>> Climate, we do the majority of our Big Data work with Cascalog (on top of 
>>>>> Cascading).  I was looking into Spark for some of the benefits that it 
>>>>> provides.  I suspect we will explore Shark next, and may work it in to 
>>>>> our 
>>>>> processes for some of our more adhoc/exploratory queries. 
>>>>>
>>>>> I think it would be interesting to see a Cascading planner on top of 
>>>>> Spark, which would enable Cascalog queries (mostly) for free.  I suspect 
>>>>> that might be a superior method of using Clojure on Spark.
>>>>>
>>>>> Marc Limotte
>>>>>
>>>>>  -- 
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To post to this group, send email to clo...@googlegroups.com
>>>>
>>>> Note that posts from new members are moderated - please be patient with 
>>>> your first post.
>>>> To unsubscribe from this group, send email to
>>>> clojure+u...@**googlegroups.com
>>>>
>>>> For more options, visit this group at
>>>> http://groups.google.com/**group/clojure?hl=en<http://groups.google.com/group/clojure?hl=en>
>>>>
>>>
>>>  -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com<javascript:>
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>>  
>>  
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: ANN: clj-spark - A Clojure wrapper for the Spark API

Reply via email to