Okay, that begins to make more sense. For a Clojure coder, Cascalog is definitely a more attractive means to express workflows at a high level than is Cascading. And like I said before, I can get behind an effort to produce a higher-level syntax to express Spark workflows. Shark is nice as far as it goes, but it is not really a means to express iterative machine learning algorithms and other things that SQL was never intended to do (and nobody expects it to do.) I also remain unconvinced that Cascading is the right foundation on which to build a higher-level abstraction for Spark -- Cascading can express more than SQL can, but I still don't think that it is designed to effectively express everything that Spark can do. I'm much more interested in something like MLbase <http://www.mlbase.org/> as a higher-level interface to Spark (and other frameworks.)
On Thursday, January 24, 2013 6:31:25 PM UTC-8, Marc Limotte wrote: > > I'm interested in the Cascading layer because that would enable Cascalog > queries. I like the declarative (logic programming) form of Cascalog over > the more imperative Spark style. SQL style is good for some uses cases > also, but I believe Cascalog is more com posable than SQL. > > Cascading uses a [pluggable] planner to transform the workflow into some > DAG. I believe there are two planners now: hadoop and local, and the idea > of planning for additional non-hadoop engines in the future is a design > goal. I think it would be very similar to the way that Shark takes the > Hive AST and then creates a physical plan for Spark. > > Thanks for the error report with clj-spark. That error looks familiar, > but I thought it was fixed. I'll look into reproducing it next week. > > > Marc > > On Wed, Jan 23, 2013 at 11:24 AM, Mark Hamstra > <markh...@gmail.com<javascript:> > > wrote: > >> I certainly understand the exploration and learning motivation -- I did >> much the same thing. At this point, I wouldn't consider either of our >> efforts to be a complete or fully usable Clojure API for Spark, but there >> are definitely ideas worth looking at in both if anyone gets to the point >> of attempting to write a complete and robust API -- which I won't be doing >> in the immediate future. >> >> I'm not sure that I am following you on Cascading and Spark. Are you >> saying that you want to use the Cascading API to express workflows which >> will then be transformed into a DAG of Spark stages and run as a Spark job? >> I don't think that I agree with that strategy. While I can get behind >> various higher-level abstractions to express Spark jobs (which is what >> Shark is doing, after all), I don't find Cascading's API to be terribly >> elegant: When writing a Spark job in Scala, I just don't find myself think >> that it would be a whole lot easier if I could write the job in Cascading. >> Part of that is because I'm not fluent in Cascading, but from what I have >> seen and done with it, I don't lust after Cascading. The other problem I >> have with the Cascading-to-Spark strategy is that Cascading has been >> designed and implemented very much with Hadoop in mind, but Spark can do >> quite a bit more that Hadoop cannot. I don't think that Cascading itself >> would be a good fit for expressing Spark jobs that can really leverage the >> advantages that Spark has over Hadoop. None of that is meant to say that >> Cascading isn't a step forward over writing jobs using Hadoop's Java API; >> but at this point I just don't see Cascading as a step forward for writing >> Spark jobs. >> >> Anyway, here's one of the early problems I ran into when trying to follow >> your README: >> >> $ lein --version >> Leiningen 2.0.0 on Java 1.7.0_09 OpenJDK 64-Bit Server VM >> $ lein deps >> $ lein compile >> Compiling clj-spark.spark.functions >> Compiling clj-spark.api >> Compiling clj-spark.util >> Compiling clj-spark.examples.query >> $ lein run >> 2013-01-23 11:17:10,436 WARN api:1 - JavaSparkContext local Simple Job >> /home/mark/Desktop/Scala/Spark/0.6 [] {} >> Exception in thread "main" java.lang.ClassCastException: >> clojure.lang.PersistentVector cannot be cast to java.lang.CharSequence >> at clojure.string$split.invoke(string.clj:174) >> at clj_spark.api$spark_context.doInvoke(api.clj:18) >> at clojure.lang.RestFn.invoke(RestFn.java:805) >> at clj_spark.examples.query$_main.doInvoke(query.clj:33) >> at clojure.lang.RestFn.invoke(RestFn.java:397) >> at clojure.lang.Var.invoke(Var.java:411) >> at user$eval22.invoke(NO_SOURCE_FILE:1) >> at clojure.lang.Compiler.eval(Compiler.java:6511) >> at clojure.lang.Compiler.eval(Compiler.java:6501) >> at clojure.lang.Compiler.eval(Compiler.java:6477) >> at clojure.core$eval.invoke(core.clj:2797) >> at clojure.main$eval_opt.invoke(main.clj:297) >> at clojure.main$initialize.invoke(main.clj:316) >> at clojure.main$null_opt.invoke(main.clj:349) >> at clojure.main$main.doInvoke(main.clj:427) >> at clojure.lang.RestFn.invoke(RestFn.java:421) >> at clojure.lang.Var.invoke(Var.java:419) >> at clojure.lang.AFn.applyToHelper(AFn.java:163) >> at clojure.lang.Var.applyTo(Var.java:532) >> at clojure.main.main(main.java:37) >> zsh: exit 1 lein run >> >> On Wednesday, January 23, 2013 7:02:43 AM UTC-8, Marc Limotte wrote: >> >>> Hi Mark. >>> >>> This was very much exploratory work, and a lot of it was just about >>> learning the Spark paradigms. That being said, merging for future work >>> seems appropriate, but it's not clear yet if I will be pursuing this work >>> further. Might wind up using Shark instead [would love to use Cascading >>> over Spark as well, if it existed]. >>> >>> I'd like to know what issue you had in getting the code/examples to >>> work? I had a couple of people try this out from scratch on clean systems >>> and it did work for them. >>> >>> Serializing the functions is necessary as far as I can tell. It would >>> not work for me without this. As far as I can tell (this is largely >>> guesswork), the problem is that each time the anonymous function is >>> evaluated on a different JVM it gets a different class name (e.g. fn_123). >>> There is high likelihood that the name assigned on the master is not the >>> same as the name on the task JVMs, so you wind up with a >>> ClassNotFoundException. >>> >>> I don't know why this would work for you. If you have any insight on >>> this, I would love to hear it? >>> >>> Marc >>> >>> On Tue, Jan 22, 2013 at 8:09 AM, Mark Hamstra <markh...@gmail.com>wrote: >>> >>>> Hmmm... a lot of duplicated work. Sorry I didn't get my stuff in a >>>> more usable form for you, but I wasn't aware that anybody was even >>>> interested in it. I've got some stuff that I want to rework a little, and >>>> I'm still thinking through the best way to integrate with the new reducers >>>> code in Clojure, but I haven't had the right combination of time and >>>> motivation to finish off what I started and document it. At any rate, we >>>> should work at merging the two efforts, since I don't see any need for >>>> duplicate APIs. >>>> >>>> In taking a quick first pass at it, I wasn't able to get your code and >>>> examples to work, but I'm curious what your reasoning is for >>>> using serializable.fn and avoiding use of >>>> clojure.core/fn or #(). I'm not sure that is strictly necessary. For >>>> example, the following works just fine with my API: >>>> >>>> (require 'spark.api.clojure.core) >>>> >>>> (wrappers!) ; one of the pieces I want to re-work, but allows functions >>>> like map to work with either Clojure collections or RDDs >>>> >>>> (set-spark-context! "local[4]" "cljspark") >>>> >>>> (def rdd (parallelize [1 2 3 4])) >>>> >>>> (def mrdd1 (map #(+ 2 %) rdd)) >>>> >>>> (def result1 (collect mrdd1)) >>>> >>>> (def offset1 4) >>>> >>>> (def mrdd2 (map #(+ offset %) rdd)) >>>> >>>> (def result2 (collect mrdd2)) >>>> >>>> (def mrdd3 (map (let [offset2 5] (+ offset %)) rdd)) >>>> >>>> (def result3 (collect mrdd3)) >>>> >>>> >>>> That will result in result1, result2, and result3 being [3 4 5 6], [5 6 >>>> 7 8], and [6 7 8 9] respectively, without any need for serializable-fn. >>>> >>>> >>>> On Tuesday, January 22, 2013 6:55:53 AM UTC-8, Marc Limotte wrote: >>>> >>>>> A Clojure api for the Spark Project. I am aware that there is another >>>>> clojure spark wrapper project which looks very interesting, This project >>>>> has similar goals. And also similar to that project it is not absolutely >>>>> complete, but it is does have some documentation and examples. And it is >>>>> useable and should be easy enough to extend as needed. This is the >>>>> result >>>>> of about three weeks of work. It handles many of the initial problems >>>>> like >>>>> serializing anonymous functions, converting back and forth between Scala >>>>> Tuples and Clojure seqs, and converting RDDs to PairRDDs. >>>>> >>>>> The project is available here: >>>>> >>>>> https://github.com/**TheClimateC**orporation/clj-**spark<https://github.com/TheClimateCorporation/clj-spark> >>>>> >>>>> Thanks to The Climate Corporation for allowing me to release it. At >>>>> Climate, we do the majority of our Big Data work with Cascalog (on top of >>>>> Cascading). I was looking into Spark for some of the benefits that it >>>>> provides. I suspect we will explore Shark next, and may work it in to >>>>> our >>>>> processes for some of our more adhoc/exploratory queries. >>>>> >>>>> I think it would be interesting to see a Cascading planner on top of >>>>> Spark, which would enable Cascalog queries (mostly) for free. I suspect >>>>> that might be a superior method of using Clojure on Spark. >>>>> >>>>> Marc Limotte >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To post to this group, send email to clo...@googlegroups.com >>>> >>>> Note that posts from new members are moderated - please be patient with >>>> your first post. >>>> To unsubscribe from this group, send email to >>>> clojure+u...@**googlegroups.com >>>> >>>> For more options, visit this group at >>>> http://groups.google.com/**group/clojure?hl=en<http://groups.google.com/group/clojure?hl=en> >>>> >>> >>> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com<javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+u...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> >> >> > > -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en