I need to write a Spark Structured Streaming pipeline that involves
multiple aggregations, splitting data into multiple sub-pipes and union
them. Also it need to have stateful aggregation with timeout.
Spark Structured Streaming support all of the required functionality but
not as one stream. I di
The cluster mode doesn't upload jars to the driver node. This is a known
issue: https://issues.apache.org/jira/browse/SPARK-4160
On Wed, Dec 27, 2017 at 1:27 AM, Geoff Von Allmen
wrote:
> I’ve tried it both ways.
>
> Uber jar gives me gives me the following:
>
>- Caused by: java.lang.ClassNo
Hi Jeroen,
can you try to then use the EMR version 5.10 instead or EMR version 5.11
instead?
can you please try selecting a subnet which is in a different availability
zone?
if possible just try to increase the number of task instances and see the
difference?
also in case you are using caching, tr
We extensively use pubmed & clinical trial databases for our work and it
involves making large amount of parametric rest api queries, usually if the
data download is large the requests get timed out ad we have to run queries
in very small batches . We also extensively use large number(thousands) of
On 28 Dec 2017, at 19:42, Gourav Sengupta wrote:
> In the EMR cluster what are the other applications that you have enabled
> (like HIVE, FLUME, Livy, etc).
Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff
behind my back).
> Are you using SPARK Session?
Yes.
>
On 28 Dec 2017, at 19:40, Maximiliano Felice
wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of
> a mix of speculative execution and OOM issues in the container.
Interesting! However I don't have any OOM exception in the logs. Does that rule
out your hypothes
On 28 Dec 2017, at 19:25, Patrick Alwell wrote:
> You are using groupByKey() have you thought of an alternative like
> aggregateByKey() or combineByKey() to reduce shuffling?
I am aware of this indeed. I do have a groupByKey() that is difficult to avoid,
but the problem occurs afterwards.
> Dy
HI Jeroen,
Can I get a few pieces of additional information please?
In the EMR cluster what are the other applications that you have enabled
(like HIVE, FLUME, Livy, etc).
Are you using SPARK Session? If yes is your application using cluster mode
or client mode?
Have you read the EC2 service leve
Hi Jeroen,
I experienced a similar issue a few weeks ago. The situation was a result
of a mix of speculative execution and OOM issues in the container.
First of all, when an executor takes too much time in Spark, it is handled
by the YARN speculative execution, which will launch a new executor an
Joren,
Anytime there is a shuffle in the network, Spark moves to a new stage. It seems
like you are having issues either pre or post shuffle. Have you looked at a
resource management tool like ganglia to determine if this is a memory or
thread related issue? The spark UI?
You are using groupBy
On 28 Dec 2017, at 17:41, Richard Qiao wrote:
> Are you able to specify which path of data filled up?
I can narrow it down to a bunch of files but it's not so straightforward.
> Any logs not rolled over?
I have to manually terminate the cluster but there is nothing more in the
driver's log whe
Dear Sparkers,
Once again in times of desperation, I leave what remains of my mental sanity to
this wise and knowledgeable community.
I have a Spark job (on EMR 5.8.0) which had been running daily for months, if
not the whole year, with absolutely no supervision. This changed all of sudden
for
Hi
I would want to build pyspark-application, which searches sequential items or
events of time series from csv-files.
What are the best data structures for this purpose ? Dataframe of pyspark or
pandas, or RDD or SQL or something else ?
---
Esa
Hello,
Thanks for your answer.
And what do you think about approach of querying data using
OpenTSDB/KairosDB piece by piece, creating a dataframe for each piece,
and then making a union out of them?
This would enable us to store and query data as timeseries and process
it using Spark?
Bes
14 matches
Mail list logo