Re: Clojure is a good choice for Big Data? Which clojure/Hadoop work to use?

orazio Thu, 04 Jul 2019 10:07:07 -0700

probably as Thad says, for a farsighted choice the tool to use for batch 
processing is Apache Spark. But I'm worried about its learning curve and 
the time it takes. I don't have much time to develop my map reduce 
algorithems. I would like to use a consolidated and fairly used tool in 
production. Recently I also saw Scalding ( 
https://github.com/twitter/scalding ) Scalding is written in Scala and 
built on top of Cascading <http://www.cascading.org/>; it is a Java library 
that abstracts away low-level Hadoop details. It is adopted in production 
by may company such as Ebay,Sky, Twitter, Linkedin, Spotify, etc.. (
https://github.com/twitter/scalding/wiki/Powered-By) .
Scalding seems more maintained and supported than Cascalog. With the many 
examples around on github, it seems to have a smoother learning curve than 
Apache Spark. You know Scalding, what do you think about it. Any 
suggestions are welcome.



Il giorno giovedì 4 luglio 2019 16:43:05 UTC+2, Thad Guidry ha scritto:
>
> "Batch" - doing things in chunks
> "Processing" - THE WORLD :-)  because it means so many different things to 
> so many folks (including your boss)
>
> Without a doubt, you will love Apache Spark for your batch processing and 
> writing Spark Programs to conquer any World you are building.
> Spend time to install Spark standalone deploy and then use its powerful 
> Spark Shell <https://spark.apache.org/docs/latest/quick-start.html> (the 
> feeling of Clojure REPL  !!)
> If you just want to jump in to a public cluster and Try Spark, then I 
> would suggest Databricks <https://databricks.com/spark/about>. 
> Spend time reading the features under Libraries drop-down menu on Apache 
> Spark website <https://spark.apache.org/>.
>
> You might even be encouraged enough to write an official API in Clojure 
> for Apache Spark within a year!  (win-win)
>
> One note of caution if you are building something for long term, you will 
> eventually have a need for data versioning, ACID transactions, schema 
> evolution, for this I use Delta Lake <https://delta.io/> (not Datomic) 
> since its fully compatible with Spark
>
> Best of luck!
> Thad
> https://www.linkedin.com/in/thadguidry/
>
>
> On Thu, Jul 4, 2019 at 3:22 AM orazio <orazio...@gmail.com <javascript:>> 
> wrote:
>
>> Hi @atdixon and Thad, thanks for your help.
>>
>> I provide more details about my project
>> My big data layer  is inspired by Lambda architecture. The pipeline 
>> include following layers and related tool choosed to address the issue:
>> - *Nifi* for *data ingestion*, and publisinh data/message on  kafka 
>> topic.
>> - *Kafka* as *message broker* that with kafka connect, allow me to store 
>> data in mongodb ( with mongodb sink and 1 day retention period ) and HDFS 
>> (hdfk sink with 1 year retention period)
>> - *Real time processing* with *mongoDB* using it's built-in QueryEngine 
>> taht provides extensive Querying, Filtering, and Searching abilities.
>> - *Batch processing* of data stored on HDFS, that performs data 
>> aggregation and store result on a HBase Table. *?* The question is : 
>> Which tool do you suggest to use for data processing sotred on HDFS ?
>> - *Serving Layer* with *HBase/Phoneix* to store and allow access to 
>> batch view.
>>
>> Now i'm invoking your help to choose *the most appropriate tool to 
>> execute batch jobs (map reduce)* which will have to aggregate data.
>> Natahn Marz suggests Clojure/Cascalog. Do you know other excellent 
>> clojure/Hadoop work in the community, about data processing?
>> if you know some particularly appropriate tools, I could also consider 
>> other work/library outside the clojure community.
>>
>> Thanks
>>
>>
>>
>> Il giorno mercoledì 3 luglio 2019 14:56:09 UTC+2, Thad Guidry ha scritto:
>>>
>>> "The best code is never written"
>>>
>>> https://zeppelin.apache.org/ 
>>> https://nifi.apache.org/  
>>>  
>>> Thad
>>> https://www.linkedin.com/in/thadguidry/
>>>
>>>
>>> On Tue, Jul 2, 2019 at 11:07 AM orazio <orazio...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I'm newbie on Clojure/Big Data, and i'm starting with hadoop.
>>>> I have installed Hortonworks HDP 3.1 
>>>> I have to design a Big Data Layer that ingests large iot datasets and 
>>>> social media datasets, process data with MapReduce job and produce 
>>>> aggregation to store on HBASE tables.
>>>>
>>>> For now, my focus is addressed on data processing issue. My question 
>>>> is: Is Clojure a good choice for distributed data processing on hadoop ?
>>>> I found Cascalog as fully-featured data processing and querying library 
>>>> for Clojure or Java. But are there any active maintainers, for this 
>>>> library 
>>>> ? 
>>>> Do you know other excellent clojure/Hadoop work in the community, 
>>>> abaout data processing? 
>>>>
>>>> I would appreciate some help.
>>>>
>>>> Orazio
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Clojure" group.
>>>> To post to this group, send email to clo...@googlegroups.com
>>>> Note that posts from new members are moderated - please be patient with 
>>>> your first post.
>>>> To unsubscribe from this group, send email to
>>>> clo...@googlegroups.com
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/clojure?hl=en
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Clojure" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to clo...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/clojure/fbc26ffb-5f00-46a7-bf33-7a899f1ffead%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/clojure/fbc26ffb-5f00-46a7-bf33-7a899f1ffead%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com 
>> <javascript:>
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clo...@googlegroups.com <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clo...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/clojure/25a56148-9231-4a1b-8bba-8cb79776ba6b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/clojure/25a56148-9231-4a1b-8bba-8cb79776ba6b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/clojure/c9c85e21-69c4-4837-a6aa-1065472f386a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Clojure is a good choice for Big Data? Which clojure/Hadoop work to use?

Reply via email to