Sorry, I think you misunderstood. Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true. Would you say the same if you were pulling data in to spark from Oracle or DB2? There are a couple of different design patterns and use cases where data could be stored in Hive yet your only access method is via a JDBC or Thift/Rest service. Think also of compute / storage cluster implementations.
WRT to #2, not exactly what I meant, by exposing the data… and there are limitations to the thift service… > On Jun 21, 2016, at 5:44 PM, ayan guha <guha.a...@gmail.com> wrote: > > 1. Yes, in the sense you control number of executors from spark application > config. > 2. Any IO will be done from executors (never ever on driver, unless you > explicitly call collect()). For example, connection to a DB happens one for > each worker (and used by local executors). Also, if you run a reduceByKey job > and write to hdfs, you will find a bunch of files were written from various > executors. What happens when you want to expose the data to world: Spark > Thrift Server (STS), which is a long running spark application (ie spark > context) which can serve data from RDDs. > > Suppose I have a data source… like a couple of hive tables and I access the > tables via beeline. (JDBC) - > This is NOT a spark application, and there is no RDD created. Beeline is just > a jdbc client tool. You use beeline to connect to HS2 or STS. > > In this case… Hive generates a map/reduce job and then would stream the > result set back to the client node where the RDD result set would be built. > -- > This is never true. When you connect Hive from spark, spark actually reads > hive metastore and streams data directly from HDFS. Hive MR jobs do not play > any role here, making spark faster than hive. > > HTH.... > > Ayan > > On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_had...@hotmail.com > <mailto:msegel_had...@hotmail.com>> wrote: > Ok, its at the end of the day and I’m trying to make sure I understand the > locale of where things are running. > > I have an application where I have to query a bunch of sources, creating some > RDDs and then I need to join off the RDDs and some other lookup tables. > > > Yarn has two modes… client and cluster. > > I get it that in cluster mode… everything is running on the cluster. > But in client mode, the driver is running on the edge node while the workers > are running on the cluster. > > When I run a sparkSQL command that generates a new RDD, does the result set > live on the cluster with the workers, and gets referenced by the driver, or > does the result set get migrated to the driver running on the client? (I’m > pretty sure I know the answer, but its never safe to assume anything…) > > The follow up questions: > > 1) If I kill the app running the driver on the edge node… will that cause > YARN to free up the cluster’s resources? (In cluster mode… that doesn’t > happen) What happens and how quickly? > > 1a) If using the client mode… can I spin up and spin down the number of > executors on the cluster? (Assuming that when I kill an executor any portion > of the RDDs associated with that executor are gone, however the spark context > is still alive on the edge node? [again assuming that the spark context lives > with the driver.]) > > 2) Any I/O between my spark job and the outside world… (e.g. walking through > the data set and writing out a data set to a file) will occur on the edge > node where the driver is located? (This may seem kinda silly, but what > happens when you want to expose the result set to the world… ? ) > > Now for something slightly different… > > Suppose I have a data source… like a couple of hive tables and I access the > tables via beeline. (JDBC) In this case… Hive generates a map/reduce job and > then would stream the result set back to the client node where the RDD result > set would be built. I realize that I could run Hive on top of spark, but > that’s a separate issue. Here the RDD will reside on the client only. (That > is I could in theory run this as a single spark instance.) > If I were to run this on the cluster… then the result set would stream thru > the beeline gate way and would reside back on the cluster sitting in RDDs > within each executor? > > I realize that these are silly questions but I need to make sure that I know > the flow of the data and where it ultimately resides. There really is a > method to my madness, and if I could explain it… these questions really would > make sense. ;-) > > TIA, > > -Mike > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > > > > > -- > Best Regards, > Ayan Guha