I've found Clojure to be an excellent fit for big data processing for a few reasons:
- the nature of big data is that it is often unstructured or semi-structured, and Clojure's immutable ad hoc map-based orientation is well suited to this - much of the big data ecosystem is Java or JVM-based (and continues to be!) and Clojure interop with Java enables using all of the tooling and platforms in Clojure That said, some Clojure libs in the space (like Cascalog that you mentioned) seem quiet the past few years. I personally would favor more active Java/JVM projects and simply interop with them from Clojure. Here are a couple of issues that I've run into in Clojure -> Java interop in some of these big data platforms and their solutions: 1) Some big data java frameworks want you to extend their base classes and provide generic parameters as you do. Clojure's class generation tools (gen-class and proxy, etc) do not support providing generic parameters when extending Java types. The Java complier on the other hand will keep generic parameter values in the compiled target class as class metadata (which is how some of these big data systems -- like Apache Beam, for one -- are using them at runtime.) The solution here is to write Java classes that delegate back to Clojure functions thru vars. 2) These same frameworks often want you to serialize the functions you provide to distribute the code throughout the cluster. Clojure disables Serialization for the classes it generates, so using the same Java classes you create to achieve the generic parameter concretizations you will make Serializable and instantiate from Clojure by passing a Var bound to a function. Vars in Cljoure are serializable and so doing things this way allows (refs to) Clojure functions to be distributed across the cluster. The key thing is that all of this is very simple to arrange in code once you get the basics down, but I've seen a few people stumble on these not knowing the tricks. And I realize my short descriptions here may leave some people wanted. I may try a blog post on these when time permits. On Tuesday, July 2, 2019 at 11:07:49 AM UTC-5, orazio wrote: > > Hi All, > > I'm newbie on Clojure/Big Data, and i'm starting with hadoop. > I have installed Hortonworks HDP 3.1 > I have to design a Big Data Layer that ingests large iot datasets and > social media datasets, process data with MapReduce job and produce > aggregation to store on HBASE tables. > > For now, my focus is addressed on data processing issue. My question is: > Is Clojure a good choice for distributed data processing on hadoop ? > I found Cascalog as fully-featured data processing and querying library > for Clojure or Java. But are there any active maintainers, for this library > ? > Do you know other excellent clojure/Hadoop work in the community, abaout > data processing? > > I would appreciate some help. > > Orazio > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/be2e6800-874d-4a30-8b6f-44aa32bd3901%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.