Just to add one concrete example regarding HDFS dependency. Have a look at checkpointing https://spark.apache.org/docs/1.6.2/streaming-programming-guide.html#checkpointing For example, for Spark Streaming, you can not do any window operation in a cluster without checkpointing to HDFS (or S3).
Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io On Thu, Aug 25, 2016 at 11:13 PM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi Kant, > > I trust the following would be of use. > > Big Data depends on Hadoop Ecosystem from whichever angle one looks at it. > > In the heart of it and with reference to points you raised about HDFS, one > needs to have a working knowledge of Hadoop Core System including HDFS, > Map-reduce algorithm and Yarn whether one uses them or not. After all Big > Data is all about horizontal scaling with master and nodes (as opposed to > vertical scaling like SQL Server running on a Host). and distributed data > (by default data is replicated three times on different nodes for > scalability and availability). > > Other members including Sean provided the limits on how far one operate > Spark in its own space. If you are going to deal with data (data in motion > and data at rest), then you will need to interact with some form of storage > and HDFS and compatible file systems like S3 are the natural choices. > > Zookeeper is not just about high availability. It is used in Spark > Streaming with Kafka, it is also used with Hive for concurrency. It is also > a distributed locking system. > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 25 August 2016 at 20:52, Mark Hamstra <m...@clearstorydata.com> wrote: > >> s/playing a role/paying a role/ >> >> On Thu, Aug 25, 2016 at 12:51 PM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >>> One way you can start to make this make more sense, Sean, is if you >>> exploit the code/data duality so that the non-distributed data that you are >>> sending out from the driver is actually paying a role more like code (or at >>> least parameters.) What is sent from the driver to an Executer is then >>> used (typically as seeds or parameters) to execute some procedure on the >>> Worker node that generates the actual data on the Workers. After that, you >>> proceed to execute in a more typical fashion with Spark using the >>> now-instantiated distributed data. >>> >>> But I don't get the sense that this meta-programming-ish style is really >>> what the OP was aiming at. >>> >>> On Thu, Aug 25, 2016 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote: >>> >>>> Without a distributed storage system, your application can only create >>>> data on the driver and send it out to the workers, and collect data back >>>> from the workers. You can't read or write data in a distributed way. There >>>> are use cases for this, but pretty limited (unless you're running on 1 >>>> machine). >>>> >>>> I can't really imagine a serious use of (distributed) Spark without >>>> (distribute) storage, in a way I don't think many apps exist that don't >>>> read/write data. >>>> >>>> The premise here is not just replication, but partitioning data across >>>> compute resources. With a distributed file system, your big input exists >>>> across a bunch of machines and you can send the work to the pieces of data. >>>> >>>> On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <kanth...@gmail.com> >>>> wrote: >>>> >>>>> @Mich I understand why I would need Zookeeper. It is there for fault >>>>> tolerance given that spark is a master-slave architecture and when a mater >>>>> goes down zookeeper will run a leader election algorithm to elect a new >>>>> leader however DevOps hate Zookeeper they would be much happier to go with >>>>> etcd & consul and looks like if we mesos scheduler we should be able to >>>>> drop Zookeeper. >>>>> >>>>> HDFS I am still trying to understand why I would need for spark. I >>>>> understand the purpose of distributed file systems in general but I don't >>>>> understand in the context of spark since many people say you can run a >>>>> spark distributed cluster in a stand alone mode but I am not sure what are >>>>> its pros/cons if we do it that way. In a hadoop world I understand that >>>>> one >>>>> of the reasons HDFS is there is for replication other words if we write >>>>> some data to a HDFS it will store that block across different nodes such >>>>> that if one of nodes goes down it can still retrieve that block from other >>>>> nodes. In the context of spark I am not really sure because 1) I am new 2) >>>>> Spark paper says it doesn't replicate data instead it stores the >>>>> lineage(all the transformations) such that it can reconstruct it. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Aug 25, 2016 9:18 AM, Mich Talebzadeh >>>>> mich.talebza...@gmail.com wrote: >>>>> >>>>>> You can use Spark on Oracle as a query tool. >>>>>> >>>>>> It all depends on the mode of the operation. >>>>>> >>>>>> If you running Spark with yarn-client/cluster then you will need >>>>>> yarn. It comes as part of Hadoop core (HDFS, Map-reduce and Yarn). >>>>>> >>>>>> I have not gone and installed Yarn without installing Hadoop. >>>>>> >>>>>> What is the overriding reason to have the Spark on its own? >>>>>> >>>>>> You can use Spark in Local or Standalone mode if you do not want >>>>>> Hadoop core. >>>>>> >>>>>> HTH >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> On 24 August 2016 at 21:54, kant kodali <kanth...@gmail.com> wrote: >>>>>> >>>>>> What do I loose if I run spark without using HDFS or Zookeper ? which >>>>>> of them is almost a must in practice? >>>>>> >>>>>> >>>>>> >>>> >>> >> >