probably as Thad says, for a farsighted choice the tool to use for batch processing is Apache Spark. But I'm worried about its learning curve and the time it takes. I don't have much time to develop my map reduce algorithems. I would like to use a consolidated and fairly used tool in production. Recently I also saw Scalding ( https://github.com/twitter/scalding ) Scalding is written in Scala and built on top of Cascading <http://www.cascading.org/>; it is a Java library that abstracts away low-level Hadoop details. It is adopted in production by may company such as Ebay,Sky, Twitter, Linkedin, Spotify, etc.. ( https://github.com/twitter/scalding/wiki/Powered-By) . Scalding seems more maintained and supported than Cascalog. With the many examples around on github, it seems to have a smoother learning curve than Apache Spark. You know Scalding, what do you think about it. Any suggestions are welcome.
Il giorno giovedì 4 luglio 2019 16:43:05 UTC+2, Thad Guidry ha scritto: > > "Batch" - doing things in chunks > "Processing" - THE WORLD :-) because it means so many different things to > so many folks (including your boss) > > Without a doubt, you will love Apache Spark for your batch processing and > writing Spark Programs to conquer any World you are building. > Spend time to install Spark standalone deploy and then use its powerful > Spark Shell <https://spark.apache.org/docs/latest/quick-start.html> (the > feeling of Clojure REPL !!) > If you just want to jump in to a public cluster and Try Spark, then I > would suggest Databricks <https://databricks.com/spark/about>. > Spend time reading the features under Libraries drop-down menu on Apache > Spark website <https://spark.apache.org/>. > > You might even be encouraged enough to write an official API in Clojure > for Apache Spark within a year! (win-win) > > One note of caution if you are building something for long term, you will > eventually have a need for data versioning, ACID transactions, schema > evolution, for this I use Delta Lake <https://delta.io/> (not Datomic) > since its fully compatible with Spark > > Best of luck! > Thad > https://www.linkedin.com/in/thadguidry/ > > > On Thu, Jul 4, 2019 at 3:22 AM orazio <orazio...@gmail.com <javascript:>> > wrote: > >> Hi @atdixon and Thad, thanks for your help. >> >> I provide more details about my project >> My big data layer is inspired by Lambda architecture. The pipeline >> include following layers and related tool choosed to address the issue: >> - *Nifi* for *data ingestion*, and publisinh data/message on kafka >> topic. >> - *Kafka* as *message broker* that with kafka connect, allow me to store >> data in mongodb ( with mongodb sink and 1 day retention period ) and HDFS >> (hdfk sink with 1 year retention period) >> - *Real time processing* with *mongoDB* using it's built-in QueryEngine >> taht provides extensive Querying, Filtering, and Searching abilities. >> - *Batch processing* of data stored on HDFS, that performs data >> aggregation and store result on a HBase Table. *?* The question is : >> Which tool do you suggest to use for data processing sotred on HDFS ? >> - *Serving Layer* with *HBase/Phoneix* to store and allow access to >> batch view. >> >> Now i'm invoking your help to choose *the most appropriate tool to >> execute batch jobs (map reduce)* which will have to aggregate data. >> Natahn Marz suggests Clojure/Cascalog. Do you know other excellent >> clojure/Hadoop work in the community, about data processing? >> if you know some particularly appropriate tools, I could also consider >> other work/library outside the clojure community. >> >> Thanks >> >> >> >> Il giorno mercoledì 3 luglio 2019 14:56:09 UTC+2, Thad Guidry ha scritto: >>> >>> "The best code is never written" >>> >>> https://zeppelin.apache.org/ >>> https://nifi.apache.org/ >>> >>> Thad >>> https://www.linkedin.com/in/thadguidry/ >>> >>> >>> On Tue, Jul 2, 2019 at 11:07 AM orazio <orazio...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>>> I'm newbie on Clojure/Big Data, and i'm starting with hadoop. >>>> I have installed Hortonworks HDP 3.1 >>>> I have to design a Big Data Layer that ingests large iot datasets and >>>> social media datasets, process data with MapReduce job and produce >>>> aggregation to store on HBASE tables. >>>> >>>> For now, my focus is addressed on data processing issue. My question >>>> is: Is Clojure a good choice for distributed data processing on hadoop ? >>>> I found Cascalog as fully-featured data processing and querying library >>>> for Clojure or Java. But are there any active maintainers, for this >>>> library >>>> ? >>>> Do you know other excellent clojure/Hadoop work in the community, >>>> abaout data processing? >>>> >>>> I would appreciate some help. >>>> >>>> Orazio >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To post to this group, send email to clo...@googlegroups.com >>>> Note that posts from new members are moderated - please be patient with >>>> your first post. >>>> To unsubscribe from this group, send email to >>>> clo...@googlegroups.com >>>> For more options, visit this group at >>>> http://groups.google.com/group/clojure?hl=en >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "Clojure" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to clo...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/clojure/fbc26ffb-5f00-46a7-bf33-7a899f1ffead%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/clojure/fbc26ffb-5f00-46a7-bf33-7a899f1ffead%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clo...@googlegroups.com >> <javascript:> >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clo...@googlegroups.com <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clo...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/clojure/25a56148-9231-4a1b-8bba-8cb79776ba6b%40googlegroups.com >> >> <https://groups.google.com/d/msgid/clojure/25a56148-9231-4a1b-8bba-8cb79776ba6b%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/c9c85e21-69c4-4837-a6aa-1065472f386a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.