think about side effects to any application including spark (memory consumption
etc).
On 21 Nov 2016, at 18:26, Samy Dindane wrote:
Hi,
I'd like to extend the file:// file system and add some custom logic to the API
that lists files.
I think I need to extend FileSystem or LocalFileS
Hi,
I'd like to extend the file:// file system and add some custom logic to the API
that lists files.
I think I need to extend FileSystem or LocalFileSystem from
org.apache.hadoop.fs, but I am not sure how to go about it exactly.
How to write a custom file system and make it usable by Spark?
;ll query it using
spark, you'll always get the latest version.
Daniel
On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane mailto:s...@dindane.com>> wrote:
Hi,
I have some data partitioned this way:
/data/year=2016/month=9/version=0
/data/year=2016/month=10/version=0
/data/ye
Hi,
I have some data partitioned this way:
/data/year=2016/month=9/version=0
/data/year=2016/month=10/version=0
/data/year=2016/month=10/version=1
/data/year=2016/month=10/version=2
/data/year=2016/month=10/version=3
/data/year=2016/month=11/version=0
/data/year=2016/month=11/version=1
When usi
Hi,
In order to impersonate a user when submitting a job with `spark-submit`, the
`proxy-user` option is used.
Is there a similar feature when running a job inside a Scala program? Maybe by
specifying some configuration value?
Thanks.
Samy
cords per batch number does
not change.
BTW, how many kafka partitions are you using, and how many actually
have data for a given batch?
3 partitions.
All of them have more than maxRatePerPartition records (my topic has hundred of
millions of records).
On Thu, Oct 13, 2016 at 4:33 AM, Samy
This partially answers the question: http://stackoverflow.com/a/35449563/604041
On 10/04/2016 03:10 PM, Samy Dindane wrote:
Hi,
I have the following schema:
-root
|-timestamp
|-date
|-year
|-month
|-day
|-some_column
|-some_other_column
I'd like to achieve either of thes
like this:
https://i.imgsafe.org/e730492453.png
notice the cutover point
On Wed, Oct 12, 2016 at 11:00 AM, Samy Dindane wrote:
I am 100% sure.
println(conf.get("spark.streaming.backpressure.enabled")) prints true.
On 10/12/2016 05:48 PM, Cody Koeninger wrote:
Just to make 100% sur
Hi,
I'd like a specific job to fail if there's another instance of it already
running on the cluster (Spark Standalone in my case).
How to achieve this?
Thank you.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I am 100% sure.
println(conf.get("spark.streaming.backpressure.enabled")) prints true.
On 10/12/2016 05:48 PM, Cody Koeninger wrote:
Just to make 100% sure, did you set
spark.streaming.backpressure.enabled
to true?
On Wed, Oct 12, 2016 at 10:09 AM, Samy Dindane wrote:
On 10/
/2016 06:08 PM, Cody Koeninger wrote:
http://spark.apache.org/docs/latest/configuration.html
"This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see
below)."
On Tue, Oct 11, 2016 at 10:57 AM, Samy Dinda
x27;t overwrite fileA and fileB, because they already have
correct data and offsets. You just write fileC.
Then once youve recovered you go on about your job as normal, starting
at topic-0 offsets 60, topic-1 offsets 66
Clear as mud?
On Mon, Oct 10, 2016 at 5:36 PM, Samy Dindane wrote:
On
Hi,
Is it possible to limit the size of the batches returned by the Kafka consumer
for Spark Streaming?
I am asking because the first batch I get has hundred of millions of records
and it takes ages to process and checkpoint them.
Thank you.
Samy
-
nt in the
future.
Yes I don't see what you mean. :)
I really appreciate your help. Thanks a lot.
On Mon, Oct 10, 2016 at 12:12 PM, Samy Dindane wrote:
I just noticed that you're the author of the code I linked in my previous
email. :) It's helpful.
When using `foreachPartition`
s that right?
Thank you.
Samy
On 10/10/2016 04:58 PM, Samy Dindane wrote:
Hi Cody,
I am writing a spark job that reads records from a Kafka topic and writes them
on the file system.
This would be straightforward if it weren't for the custom checkpointing logic
I want to have; Spark's
ones running?
On 10/10/2016 11:19 AM, Samy Dindane wrote:
Hi,
I am writing a streaming job that reads a Kafka topic.
As far as I understand, Spark does a 1:1 mapping between its executors and
Kafka partitions.
In order to correctly implement my checkpoint logic, I'd like to know what
ex
Hi,
I am writing a streaming job that reads a Kafka topic.
As far as I understand, Spark does a 1:1 mapping between its executors and
Kafka partitions.
In order to correctly implement my checkpoint logic, I'd like to know what
exactly happens when an executors crashes.
Also, is it possible to
Hi,
I have the following schema:
-root
|-timestamp
|-date
|-year
|-month
|-day
|-some_column
|-some_other_column
I'd like to achieve either of these:
1) Use the timestamp field to partition by year, month and day.
This looks weird though, as Spark wouldn't magically know how to lo
18 matches
Mail list logo