Hi,
I have a weird issue - spark streaming application fails once/twice a day
with java.io.OptionalDataException. It happens when deserializing task.
The problem appeared after migration from spark 1.6 to spark 2.2, cluster
runs in latest CDH distro, YARN mode.
I have no clue what to blame and w
Hi,
I have a weird issue - spark streaming application fails once/twice a day
with java.io.OptionalDataException. It happens when deserializing task.
The problem appeared after migration from spark 1.6 to spark 2.2, cluster
runs in latest CDH distro, YARN mode.
I have no clue what to blame and w
I am tying to connect to AWS managed ES service using Spark ES Connector ,
but am not able to.
I am passing es.nodes and es.port along with es.nodes.wan.only set to true.
But it fails with below error:
34 ERROR NetworkClient: Node [x.x.x.x:443] failed (The server x.x.x.x
failed to respond); no ot
?9?4
I have no idea about this
---Original---
From: "serkan ta?0?6"
Date: 2017/7/31 16:42:59
To: "pandees waran";"??"<1427357...@qq.com>;
Cc: "user@spark.apache.org";
Subject: Re: Spark parquet file read problem !
Thank you very much.
Schema merge fixed the structure problem b
Calling cache/persist fails all our jobs (i have posted 2 threads on this).
And we're giving up hope in finding a solution.
So I'd like to find a workaround for that:
If I save an RDD to hdfs and read it back, can I use it in more than one
operation?
Example: (using cache)
// do a whole bunch
Hi Jeff, that looks sane to me. Do you have additional details?
On 1 August 2017 at 11:05, jeff saremi wrote:
> Calling cache/persist fails all our jobs (i have posted 2 threads on
> this).
>
> And we're giving up hope in finding a solution.
> So I'd like to find a workaround for that:
>
> If
Very likely, much of the potential duplication is already being avoided
even without calling cache/persist. When running the above code without
`myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at
least one of them you will likely see that many Stages are marked as
"skipped", whi
I am trying to use PySpark to read a Kafka stream and then write it to
Redis. However, PySpark does not have support for a ForEach sink. So, I am
thinking of reading the Kafka stream into a DataFrame in Python and then
sending that DataFrame into a Scala application to be written to Redis. Is
there
here are the threads that talk about problems we're experiencing. These
problems exacerbate when we use cache/persist
https://www.mail-archive.com/user@spark.apache.org/msg64987.html
https://www.mail-archive.com/user@spark.apache.org/msg64986.html
So I am looking for a way to reproduce the same
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory")
myrdd.checkpoint()
val result1 = myrdd.map(op1(_))
result1.count() // Will save `myrdd` to HDFS and do map(op1…
val result2 = myrdd.map(op2(_))
result2.count() // Will load `myrdd` from HDFS
Thanks Vadim. I'll try that
From: Vadim Semenov
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpointDir(
Thanks Mark. I'll examine the status more carefully to observe this.
From: Mark Hamstra
Sent: Tuesday, August 1, 2017 11:25:46 AM
To: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
Very likely, much of the potential duplication is
I run pyspark streaming example queue_streaming.py. But run into the
following error, does anyone know what might be wrong ? Thanks
ERROR [2017-08-02 08:29:20,023] ({Stop-StreamingContext}
Logging.scala[logError]:91) - Cannot connect to Python process. It's
probably dead. Stopping StreamingContext
13 matches
Mail list logo