from:"Samy Dindane"

Re: How to write a custom file system?

2016-11-21 Thread Samy Dindane

think about side effects to any application including spark (memory consumption etc). On 21 Nov 2016, at 18:26, Samy Dindane wrote: Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend FileSystem or LocalFileS

How to write a custom file system?

2016-11-21 Thread Samy Dindane

Hi, I'd like to extend the file:// file system and add some custom logic to the API that lists files. I think I need to extend FileSystem or LocalFileSystem from org.apache.hadoop.fs, but I am not sure how to go about it exactly. How to write a custom file system and make it usable by Spark?

Re: How to load only the data of the last partition

2016-11-18 Thread Samy Dindane

;ll query it using spark, you'll always get the latest version. Daniel On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane mailto:s...@dindane.com>> wrote: Hi, I have some data partitioned this way: /data/year=2016/month=9/version=0 /data/year=2016/month=10/version=0 /data/ye

How to load only the data of the last partition

2016-11-17 Thread Samy Dindane

Hi, I have some data partitioned this way: /data/year=2016/month=9/version=0 /data/year=2016/month=10/version=0 /data/year=2016/month=10/version=1 /data/year=2016/month=10/version=2 /data/year=2016/month=10/version=3 /data/year=2016/month=11/version=0 /data/year=2016/month=11/version=1 When usi

How to impersonate a user from a Spark program

2016-11-09 Thread Samy Dindane

Hi, In order to impersonate a user when submitting a job with `spark-submit`, the `proxy-user` option is used. Is there a similar feature when running a job inside a Scala program? Maybe by specifying some configuration value? Thanks. Samy

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane

cords per batch number does not change. BTW, how many kafka partitions are you using, and how many actually have data for a given batch? 3 partitions. All of them have more than maxRatePerPartition records (my topic has hundred of millions of records). On Thu, Oct 13, 2016 at 4:33 AM, Samy

Re: DataFrame API: how to partition by a "virtual" column, or by a nested column?

2016-10-13 Thread Samy Dindane

This partially answers the question: http://stackoverflow.com/a/35449563/604041 On 10/04/2016 03:10 PM, Samy Dindane wrote: Hi, I have the following schema: -root |-timestamp |-date |-year |-month |-day |-some_column |-some_other_column I'd like to achieve either of thes

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane

like this: https://i.imgsafe.org/e730492453.png notice the cutover point On Wed, Oct 12, 2016 at 11:00 AM, Samy Dindane wrote: I am 100% sure. println(conf.get("spark.streaming.backpressure.enabled")) prints true. On 10/12/2016 05:48 PM, Cody Koeninger wrote: Just to make 100% sur

How to prevent having more than one instance of a specific job running on the cluster

2016-10-12 Thread Samy Dindane

Hi, I'd like a specific job to fail if there's another instance of it already running on the cluster (Spark Standalone in my case). How to achieve this? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane

I am 100% sure. println(conf.get("spark.streaming.backpressure.enabled")) prints true. On 10/12/2016 05:48 PM, Cody Koeninger wrote: Just to make 100% sure, did you set spark.streaming.backpressure.enabled to true? On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane wrote: On 10/

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane

/2016 06:08 PM, Cody Koeninger wrote: http://spark.apache.org/docs/latest/configuration.html "This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dinda

Re: What happens when an executor crashes?

2016-10-12 Thread Samy Dindane

x27;t overwrite fileA and fileB, because they already have correct data and offsets. You just write fileC. Then once youve recovered you go on about your job as normal, starting at topic-0 offsets 60, topic-1 offsets 66 Clear as mud? On Mon, Oct 10, 2016 at 5:36 PM, Samy Dindane wrote: On

Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Samy Dindane

Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. Thank you. Samy -

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane

nt in the future. Yes I don't see what you mean. :) I really appreciate your help. Thanks a lot. On Mon, Oct 10, 2016 at 12:12 PM, Samy Dindane wrote: I just noticed that you're the author of the code I linked in my previous email. :) It's helpful. When using `foreachPartition`

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane

s that right? Thank you. Samy On 10/10/2016 04:58 PM, Samy Dindane wrote: Hi Cody, I am writing a spark job that reads records from a Kafka topic and writes them on the file system. This would be straightforward if it weren't for the custom checkpointing logic I want to have; Spark's

Re: What happens when an executor crashes?

2016-10-10 Thread Samy Dindane

ones running? On 10/10/2016 11:19 AM, Samy Dindane wrote: Hi, I am writing a streaming job that reads a Kafka topic. As far as I understand, Spark does a 1:1 mapping between its executors and Kafka partitions. In order to correctly implement my checkpoint logic, I'd like to know what ex

What happens when an executor crashes?

2016-10-10 Thread Samy Dindane

Hi, I am writing a streaming job that reads a Kafka topic. As far as I understand, Spark does a 1:1 mapping between its executors and Kafka partitions. In order to correctly implement my checkpoint logic, I'd like to know what exactly happens when an executors crashes. Also, is it possible to

DataFrame API: how to partition by a "virtual" column, or by a nested column?

2016-10-04 Thread Samy Dindane

Hi, I have the following schema: -root |-timestamp |-date |-year |-month |-day |-some_column |-some_other_column I'd like to achieve either of these: 1) Use the timestamp field to partition by year, month and day. This looks weird though, as Spark wouldn't magically know how to lo

Re: How to write a custom file system?

How to write a custom file system?

Re: How to load only the data of the last partition

How to load only the data of the last partition

How to impersonate a user from a Spark program

Re: Limit Kafka batches size with Spark Streaming

Re: DataFrame API: how to partition by a "virtual" column, or by a nested column?

Re: Limit Kafka batches size with Spark Streaming

How to prevent having more than one instance of a specific job running on the cluster

Re: Limit Kafka batches size with Spark Streaming

Re: Limit Kafka batches size with Spark Streaming

Re: What happens when an executor crashes?

Limit Kafka batches size with Spark Streaming

Re: What happens when an executor crashes?

Re: What happens when an executor crashes?

Re: What happens when an executor crashes?

What happens when an executor crashes?

DataFrame API: how to partition by a "virtual" column, or by a nested column?

18 matches

Site Navigation

Mail list logo

Footer information