Re: Silly question about Yarn client vs Yarn cluster modes...

ayan guha Tue, 21 Jun 2016 22:53:07 -0700

I may be wrong here, but beeline is basically a client library. So you
"connect" to STS and/or HS2 using beeline.


Spark connecting to jdbc is different discussion and no way related to
beeline. When you read data from DB (Oracle, DB2 etc) then you do not use
beeline, but use jdbc connection to the DB.....

#2 agree on thrift limitations, but not sure if there are any other
mechanism to share data to other systems such as API/BI Tools etc......

On Wed, Jun 22, 2016 at 1:47 PM, Michael Segel <msegel_had...@hotmail.com>
wrote:

>
> Sorry, I think you misunderstood.
> Spark can read from JDBC sources so to say using beeline as a way to
> access data is not a spark application isn’t really true.  Would you say
> the same if you were pulling data in to spark from Oracle or DB2?
> There are a couple of different design patterns and use cases where data
> could be stored in Hive yet your only access method is via a JDBC or
> Thift/Rest service.  Think also of compute / storage cluster
> implementations.
>
> WRT to #2, not exactly what I meant, by exposing the data… and there are
> limitations to the thift service…
>
> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.a...@gmail.com> wrote:
>
> 1. Yes, in the sense you control number of executors from spark
> application config.
> 2. Any IO will be done from executors (never ever on driver, unless you
> explicitly call collect()). For example, connection to a DB happens one for
> each worker (and used by local executors). Also, if you run a reduceByKey
> job and write to hdfs, you will find a bunch of files were written from
> various executors. What happens when you want to expose the data to world:
> Spark Thrift Server (STS), which is a long running spark application (ie
> spark context) which can serve data from RDDs.
>
> Suppose I have a data source… like a couple of hive tables and I access
> the tables via beeline. (JDBC)  -
> This is NOT a spark application, and there is no RDD created. Beeline is
> just a jdbc client tool. You use beeline to connect to HS2 or STS.
>
> In this case… Hive generates a map/reduce job and then would stream the
> result set back to the client node where the RDD result set would be built.
>  --
> This is never true. When you connect Hive from spark, spark actually reads
> hive metastore and streams data directly from HDFS. Hive MR jobs do not
> play any role here, making spark faster than hive.
>
> HTH....
>
> Ayan
>
> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>> Ok, its at the end of the day and I’m trying to make sure I understand
>> the locale of where things are running.
>>
>> I have an application where I have to query a bunch of sources, creating
>> some RDDs and then I need to join off the RDDs and some other lookup tables.
>>
>>
>> Yarn has two modes… client and cluster.
>>
>> I get it that in cluster mode… everything is running on the cluster.
>> But in client mode, the driver is running on the edge node while the
>> workers are running on the cluster.
>>
>> When I run a sparkSQL command that generates a new RDD, does the result
>> set live on the cluster with the workers, and gets referenced by the
>> driver, or does the result set get migrated to the driver running on the
>> client? (I’m pretty sure I know the answer, but its never safe to assume
>> anything…)
>>
>> The follow up questions:
>>
>> 1) If I kill the  app running the driver on the edge node… will that
>> cause YARN to free up the cluster’s resources? (In cluster mode… that
>> doesn’t happen) What happens and how quickly?
>>
>> 1a) If using the client mode… can I spin up and spin down the number of
>> executors on the cluster? (Assuming that when I kill an executor any
>> portion of the RDDs associated with that executor are gone, however the
>> spark context is still alive on the edge node? [again assuming that the
>> spark context lives with the driver.])
>>
>> 2) Any I/O between my spark job and the outside world… (e.g. walking
>> through the data set and writing out a data set to a file) will occur on
>> the edge node where the driver is located?  (This may seem kinda silly, but
>> what happens when you want to expose the result set to the world… ? )
>>
>> Now for something slightly different…
>>
>> Suppose I have a data source… like a couple of hive tables and I access
>> the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce
>> job and then would stream the result set back to the client node where the
>> RDD result set would be built.  I realize that I could run Hive on top of
>> spark, but that’s a separate issue. Here the RDD will reside on the client
>> only.  (That is I could in theory run this as a single spark instance.)
>> If I were to run this on the cluster… then the result set would stream
>> thru the beeline gate way and would reside back on the cluster sitting in
>> RDDs within each executor?
>>
>> I realize that these are silly questions but I need to make sure that I
>> know the flow of the data and where it ultimately resides.  There really is
>> a method to my madness, and if I could explain it… these questions really
>> would make sense. ;-)
>>
>> TIA,
>>
>> -Mike
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


-- 
Best Regards,
Ayan Guha

Re: Silly question about Yarn client vs Yarn cluster modes...

Reply via email to