Sorry, I think you misunderstood. 
Spark can read from JDBC sources so to say using beeline as a way to access 
data is not a spark application isn’t really true.  Would you say the same if 
you were pulling data in to spark from Oracle or DB2? 
There are a couple of different design patterns and use cases where data could 
be stored in Hive yet your only access method is via a JDBC or Thift/Rest 
service.  Think also of compute / storage cluster implementations. 

WRT to #2, not exactly what I meant, by exposing the data… and there are 
limitations to the thift service…

> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> 1. Yes, in the sense you control number of executors from spark application 
> config. 
> 2. Any IO will be done from executors (never ever on driver, unless you 
> explicitly call collect()). For example, connection to a DB happens one for 
> each worker (and used by local executors). Also, if you run a reduceByKey job 
> and write to hdfs, you will find a bunch of files were written from various 
> executors. What happens when you want to expose the data to world: Spark 
> Thrift Server (STS), which is a long running spark application (ie spark 
> context) which can serve data from RDDs. 
> 
> Suppose I have a data source… like a couple of hive tables and I access the 
> tables via beeline. (JDBC)  -  
> This is NOT a spark application, and there is no RDD created. Beeline is just 
> a jdbc client tool. You use beeline to connect to HS2 or STS. 
> 
> In this case… Hive generates a map/reduce job and then would stream the 
> result set back to the client node where the RDD result set would be built.  
> -- 
> This is never true. When you connect Hive from spark, spark actually reads 
> hive metastore and streams data directly from HDFS. Hive MR jobs do not play 
> any role here, making spark faster than hive. 
> 
> HTH....
> 
> Ayan
> 
> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Ok, its at the end of the day and I’m trying to make sure I understand the 
> locale of where things are running.
> 
> I have an application where I have to query a bunch of sources, creating some 
> RDDs and then I need to join off the RDDs and some other lookup tables.
> 
> 
> Yarn has two modes… client and cluster.
> 
> I get it that in cluster mode… everything is running on the cluster.
> But in client mode, the driver is running on the edge node while the workers 
> are running on the cluster.
> 
> When I run a sparkSQL command that generates a new RDD, does the result set 
> live on the cluster with the workers, and gets referenced by the driver, or 
> does the result set get migrated to the driver running on the client? (I’m 
> pretty sure I know the answer, but its never safe to assume anything…)
> 
> The follow up questions:
> 
> 1) If I kill the  app running the driver on the edge node… will that cause 
> YARN to free up the cluster’s resources? (In cluster mode… that doesn’t 
> happen) What happens and how quickly?
> 
> 1a) If using the client mode… can I spin up and spin down the number of 
> executors on the cluster? (Assuming that when I kill an executor any portion 
> of the RDDs associated with that executor are gone, however the spark context 
> is still alive on the edge node? [again assuming that the spark context lives 
> with the driver.])
> 
> 2) Any I/O between my spark job and the outside world… (e.g. walking through 
> the data set and writing out a data set to a file) will occur on the edge 
> node where the driver is located?  (This may seem kinda silly, but what 
> happens when you want to expose the result set to the world… ? )
> 
> Now for something slightly different…
> 
> Suppose I have a data source… like a couple of hive tables and I access the 
> tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and 
> then would stream the result set back to the client node where the RDD result 
> set would be built.  I realize that I could run Hive on top of spark, but 
> that’s a separate issue. Here the RDD will reside on the client only.  (That 
> is I could in theory run this as a single spark instance.)
> If I were to run this on the cluster… then the result set would stream thru 
> the beeline gate way and would reside back on the cluster sitting in RDDs 
> within each executor?
> 
> I realize that these are silly questions but I need to make sure that I know 
> the flow of the data and where it ultimately resides.  There really is a 
> method to my madness, and if I could explain it… these questions really would 
> make sense. ;-)
> 
> TIA,
> 
> -Mike
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha

Reply via email to