how to get partition column info in Data Source V2 writer

2019-12-17 Thread aakash aakash
Hi Spark dev folks,

First of all kudos on this new Data Source v2, API looks simple and it
makes easy to develop a new data source and use it.

With my current work, I am trying to implement a new data source V2 writer
with Spark 2.3 and I was wondering how I will get the info about partition
by columns. I see that it has been passed to Data Source V1 from
DataFrameWriter but not for V2.


Thanks,
Aakash


Re: how to get partition column info in Data Source V2 writer

2019-12-17 Thread aakash aakash
Thanks Andrew!

It seems there is a drastic change in 3.0, going through it.

-Aakash

On Tue, Dec 17, 2019 at 11:01 AM Andrew Melo  wrote:

> Hi Aakash
>
> On Tue, Dec 17, 2019 at 12:42 PM aakash aakash 
> wrote:
>
>> Hi Spark dev folks,
>>
>> First of all kudos on this new Data Source v2, API looks simple and it
>> makes easy to develop a new data source and use it.
>>
>> With my current work, I am trying to implement a new data source V2
>> writer with Spark 2.3 and I was wondering how I will get the info about
>> partition by columns. I see that it has been passed to Data Source V1 from
>> DataFrameWriter but not for V2.
>>
>
> Not directly related to your Q, but just so you're aware, the DSv2 API
> evolved from 2.3->2.4 and then again for 2.4->3.0.
>
> Cheers
> Andrew
>
>
>>
>>
>> Thanks,
>> Aakash
>>
>


Re: how to get partition column info in Data Source V2 writer

2019-12-18 Thread aakash aakash
Thanks Wenchen!

On Wed, Dec 18, 2019 at 7:25 PM Wenchen Fan  wrote:

> Hi Aakash,
>
> You can try the latest DS v2 with the 3.0 preview, and the API is in a
> quite stable shape now. With the latest API, a Writer is created from a
> Table, and the Table has the partitioning information.
>
> Thanks,
> Wenchen
>
> On Wed, Dec 18, 2019 at 3:22 AM aakash aakash 
> wrote:
>
>> Thanks Andrew!
>>
>> It seems there is a drastic change in 3.0, going through it.
>>
>> -Aakash
>>
>> On Tue, Dec 17, 2019 at 11:01 AM Andrew Melo 
>> wrote:
>>
>>> Hi Aakash
>>>
>>> On Tue, Dec 17, 2019 at 12:42 PM aakash aakash 
>>> wrote:
>>>
>>>> Hi Spark dev folks,
>>>>
>>>> First of all kudos on this new Data Source v2, API looks simple and it
>>>> makes easy to develop a new data source and use it.
>>>>
>>>> With my current work, I am trying to implement a new data source V2
>>>> writer with Spark 2.3 and I was wondering how I will get the info about
>>>> partition by columns. I see that it has been passed to Data Source V1 from
>>>> DataFrameWriter but not for V2.
>>>>
>>>
>>> Not directly related to your Q, but just so you're aware, the DSv2 API
>>> evolved from 2.3->2.4 and then again for 2.4->3.0.
>>>
>>> Cheers
>>> Andrew
>>>
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Aakash
>>>>
>>>


graceful shutdown for slave host for Spark Standalone Cluster

2020-04-20 Thread aakash aakash
Hi,

We use ec2 to run batch spark jobs to filter and process our data and
sometimes we need to replace the host or deploy a new fleet.  Since we run
the driver in cluster mode and if the host goes down it will be
detrimental. We also use some native code to make sure our table is
modified by only one customer and that does not allow us to use supervise
mode.

I was wondering whether the Standalone cluster has a way to have graceful
shutdown for slaves whether it allows finishing the current running driver
and executor and does not take any new request from the master. So we can
implement a sidecar and once all running driver and executor finish we can
allow shut down by the host.

Thanks for your help and suggestion!

Regards,
Aakash


Fwd: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
Re-posting it at dev group.

Thanks and Regards,
Aakash


-- Forwarded message --
From: aakash aakash 
Date: Mon, Nov 14, 2016 at 4:10 PM
Subject: using Spark Streaming with Kafka 0.9/0.10
To: user-subscr...@spark.apache.org


Hi,

I am planning to use Spark Streaming to consume messages from Kafka 0.9. I
have couple of questions regarding this :


   - I see APIs are annotated with @Experimental. So can you please tell me
   when are we planning to make it production ready ?
   - Currently, I see we are using Kafka 0.10 and so curious to know why
   not we started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka
   client would not be compatible with 0.9 client since there are some changes
   in arguments in consumer API.
   - Current API extends InputDstream and as per document it means RDD will
   be generated by running a service/thread only on the driver node instead of
   worker node. Can you please explain to me why we are doing this and what is
   required to make sure that it runs on worker node.


Thanks in advance !

Regards,
Aakash


Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
> You can use the 0.8 artifact to consume from a 0.9 broker

We are currently using "Camus
<http://docs.confluent.io/1.0/camus/docs/intro.html>" in production and one
of the main goal to move to Spark is to use new Kafka Consumer API  of
Kafka 0.9 and in our case we need the security provisions available in 0.9,
that why we cannot use 0.8 client.

> Where are you reading documentation indicating that the direct stream
only runs on the driver?

I might be wrong here, but I see that new
<http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html>
kafka+Spark stream code extend the InputStream
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.InputDStream>
and its documentation says :

* Input streams that can generate RDDs from new data by running a
service/thread only on the driver node (that is, without running a receiver
on worker nodes) *
Thanks and regards,
Aakash Pradeep


On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger  wrote:

> It'd probably be worth no longer marking the 0.8 interface as
> experimental.  I don't think it's likely to be subject to active
> development at this point.
>
> You can use the 0.8 artifact to consume from a 0.9 broker
>
> Where are you reading documentation indicating that the direct stream
> only runs on the driver?  It runs consumers on the worker nodes.
>
>
> On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash 
> wrote:
> > Re-posting it at dev group.
> >
> > Thanks and Regards,
> > Aakash
> >
> >
> > -- Forwarded message --
> > From: aakash aakash 
> > Date: Mon, Nov 14, 2016 at 4:10 PM
> > Subject: using Spark Streaming with Kafka 0.9/0.10
> > To: user-subscr...@spark.apache.org
> >
> >
> > Hi,
> >
> > I am planning to use Spark Streaming to consume messages from Kafka 0.9.
> I
> > have couple of questions regarding this :
> >
> > I see APIs are annotated with @Experimental. So can you please tell me
> when
> > are we planning to make it production ready ?
> > Currently, I see we are using Kafka 0.10 and so curious to know why not
> we
> > started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka client
> > would not be compatible with 0.9 client since there are some changes in
> > arguments in consumer API.
> > Current API extends InputDstream and as per document it means RDD will be
> > generated by running a service/thread only on the driver node instead of
> > worker node. Can you please explain to me why we are doing this and what
> is
> > required to make sure that it runs on worker node.
> >
> >
> > Thanks in advance !
> >
> > Regards,
> > Aakash
> >
>


Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
Thanks for the link and info Cody !


Regards,
Aakash


On Tue, Nov 15, 2016 at 7:47 PM, Cody Koeninger  wrote:

> Generating / defining an RDDis not the same thing as running the
> compute() method of an rdd .  The direct stream definitely runs kafka
> consumers on the executors.
>
> If you want more info, the blog post and video linked from
> https://github.com/koeninger/kafka-exactly-once refers to the 0.8
> implementation, but the general design is similar for the 0.10
> version.
>
> I think the likelihood of an official release supporting 0.9 is fairly
> slim at this point, it's a year out of date and wouldn't be a drop-in
> dependency change.
>
>
> On Tue, Nov 15, 2016 at 5:50 PM, aakash aakash 
> wrote:
> >
> >
> >> You can use the 0.8 artifact to consume from a 0.9 broker
> >
> > We are currently using "Camus" in production and one of the main goal to
> > move to Spark is to use new Kafka Consumer API  of Kafka 0.9 and in our
> case
> > we need the security provisions available in 0.9, that why we cannot use
> 0.8
> > client.
> >
> >> Where are you reading documentation indicating that the direct stream
> > only runs on the driver?
> >
> > I might be wrong here, but I see that new kafka+Spark stream code extend
> the
> > InputStream and its documentation says : Input streams that can generate
> > RDDs from new data by running a service/thread only on the driver node
> (that
> > is, without running a receiver on worker nodes)
> >
> > Thanks and regards,
> > Aakash Pradeep
> >
> >
> > On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger 
> wrote:
> >>
> >> It'd probably be worth no longer marking the 0.8 interface as
> >> experimental.  I don't think it's likely to be subject to active
> >> development at this point.
> >>
> >> You can use the 0.8 artifact to consume from a 0.9 broker
> >>
> >> Where are you reading documentation indicating that the direct stream
> >> only runs on the driver?  It runs consumers on the worker nodes.
> >>
> >>
> >> On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash  >
> >> wrote:
> >> > Re-posting it at dev group.
> >> >
> >> > Thanks and Regards,
> >> > Aakash
> >> >
> >> >
> >> > -- Forwarded message --
> >> > From: aakash aakash 
> >> > Date: Mon, Nov 14, 2016 at 4:10 PM
> >> > Subject: using Spark Streaming with Kafka 0.9/0.10
> >> > To: user-subscr...@spark.apache.org
> >> >
> >> >
> >> > Hi,
> >> >
> >> > I am planning to use Spark Streaming to consume messages from Kafka
> 0.9.
> >> > I
> >> > have couple of questions regarding this :
> >> >
> >> > I see APIs are annotated with @Experimental. So can you please tell me
> >> > when
> >> > are we planning to make it production ready ?
> >> > Currently, I see we are using Kafka 0.10 and so curious to know why
> not
> >> > we
> >> > started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka
> client
> >> > would not be compatible with 0.9 client since there are some changes
> in
> >> > arguments in consumer API.
> >> > Current API extends InputDstream and as per document it means RDD will
> >> > be
> >> > generated by running a service/thread only on the driver node instead
> of
> >> > worker node. Can you please explain to me why we are doing this and
> what
> >> > is
> >> > required to make sure that it runs on worker node.
> >> >
> >> >
> >> > Thanks in advance !
> >> >
> >> > Regards,
> >> > Aakash
> >> >
> >
> >
>