I've found Clojure to be an excellent fit for big data processing for a few 
reasons:

- the nature of big data is that it is often unstructured or 
semi-structured, and Clojure's immutable ad hoc map-based orientation is 
well suited to this
- much of the big data ecosystem is Java or JVM-based (and continues to 
be!) and Clojure interop with Java enables using all of the tooling and 
platforms in Clojure

That said, some Clojure libs in the space (like Cascalog that you 
mentioned) seem quiet the past few years. I personally would favor more 
active Java/JVM projects and simply interop with them from Clojure.

Here are a couple of issues that I've run into in Clojure -> Java interop 
in some of these big data platforms and their solutions:

1) Some big data java frameworks want you to extend their base classes and 
provide generic parameters as you do. Clojure's class generation tools 
(gen-class and proxy, etc) do not support providing generic parameters when 
extending Java types. The Java complier on the other hand will keep generic 
parameter values in the compiled target class as class metadata (which is 
how some of these big data systems -- like Apache Beam, for one -- are 
using them at runtime.) The solution here is to write Java classes that 
delegate back to Clojure functions thru vars.

2) These same frameworks often want you to serialize the functions you 
provide to distribute the code throughout the cluster. Clojure disables 
Serialization for the classes it generates, so using the same Java classes 
you create to achieve the generic parameter concretizations you will make 
Serializable and instantiate from Clojure by passing a Var bound to a 
function. Vars in Cljoure are serializable and so doing things this way 
allows (refs to) Clojure functions to be distributed across the cluster.

The key thing is that all of this is very simple to arrange in code once 
you get the basics down, but I've seen a few people stumble on these not 
knowing the tricks. And I realize my short descriptions here may leave some 
people wanted. I may try a blog post on these when time permits.

On Tuesday, July 2, 2019 at 11:07:49 AM UTC-5, orazio wrote:
>
> Hi All,
>
> I'm newbie on Clojure/Big Data, and i'm starting with hadoop.
> I have installed Hortonworks HDP 3.1 
> I have to design a Big Data Layer that ingests large iot datasets and 
> social media datasets, process data with MapReduce job and produce 
> aggregation to store on HBASE tables.
>
> For now, my focus is addressed on data processing issue. My question is: 
> Is Clojure a good choice for distributed data processing on hadoop ?
> I found Cascalog as fully-featured data processing and querying library 
> for Clojure or Java. But are there any active maintainers, for this library 
> ? 
> Do you know other excellent clojure/Hadoop work in the community, abaout 
> data processing? 
>
> I would appreciate some help.
>
> Orazio
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/clojure/be2e6800-874d-4a30-8b6f-44aa32bd3901%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to