First thing would be that scala supports them. Then for other things someone
might need to redesign the Spark source code to leverage modules - this could
be a rather handy feature to have a small but very well designed core (core,
ml, graph etc) around which others write useful modules.
> On 1
Hi all,
we have a dataframe with 1000 partitions and we need to write the
dataframe into a MySQL using this command:
df.coalesce(20)
df.write.jdbc(url=url,
table=table,
mode=mode,
properties=properties)
and we get this errors randomly
java
How many rows do you have in total?
> On 16. May 2018, at 11:36, Davide Brambilla
> wrote:
>
> Hi all,
>we have a dataframe with 1000 partitions and we need to write the
> dataframe into a MySQL using this command:
>
> df.coalesce(20)
> df.write.jdbc(url=url,
> table=tab
We implemented a streaming query with aggregation on event-time with
watermark. I'm wondering why aggregation state is not cleanup up. According
to documentation old aggregation state should be cleared when using
watermarks. We also don't see any condition [1] for why state should not be
cleanup up
Hi,
we have 2 millions of rows using a cluster using an EMR cluster with 8
machines m4.4xlarge with 100GB EBS storage.
Davide B.
Davide Brambilla
ContentWise R&D
ContentWise
davide.brambi...@contentwise.tv
I have been testing some with aggregations, but I seem to hit a wall on two
issues.
example:
val avg = areaStateDf.groupBy($"plantKey").avg("sensor")
1) How can I use the result from an aggr within the same stream, to do further
calculations?
2) It seems to be very slow. If I want a moving windo
Hi,
I am using SPARK to read the XML / JSON files to create a dataframe and
save it as a hive table
Sample XML file:
101
45
COMMAND
Note field 'validation-timeout' under testexecutioncontroller.
Below is the schema populated by DF after reading the XML file
|-- id:
Upon downsizing to 20 partitions some of your partitions become too big,
and I see that you're doing caching, and executors try to write big
partitions to disk, but fail because they exceed 2GiB
> Caused by: java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE
at sun.nio.ch.FileChann
Hello,
I have a structured streaming job that consumes messages from kafka and does
some stateful associations using flatMapGroupWithState. Every time I submit
the job, it runs fine for around 2hours and then stops abruptly without any
error messages. All I can see in the debug logs is the below m
Yes, the workaround is to create multiple StringIndexers as you described.
OneHotEncoderEstimator is only in Spark 2.3.0, you will have to use just
OneHotEncoder.
On Tue, May 15, 2018, 8:40 AM Mina Aslani wrote:
> Hi,
>
> So, what is the workaround? Should I create multiple indexer(one for each
Hi
I would go for a regular mysql bulkload. I m saying writing an output
that mysql is able to load in one process. I d'say spark jdbc is ok for
small fetch/load. When comes large RDBMS call, it turns out using the
regular optimized API is better than jdbc
2018-05-16 16:18 GMT+02:00 Vadim Semenov
--
Regards,
Varma Dantuluri
Hi Spark-users,
I want to submit as many spark applications as the resources permit. I am
using cluster mode on a yarn cluster. Yarn can queue and launch these
applications without problems. The problem lies on spark-submit itself.
Spark-submit starts a jvm which could fail due to insufficient me
You can either:
- set spark.yarn.submit.waitAppCompletion=false, which will make
spark-submit go away once the app starts in cluster mode.
- use the (new in 2.3) InProcessLauncher class + some custom Java code
to submit all the apps from the same "launcher" process.
On Wed, May 16, 2018 at 1:45 P
How about using Livy to submit jobs?
On Thu, 17 May 2018 at 7:24 am, Marcelo Vanzin wrote:
> You can either:
>
> - set spark.yarn.submit.waitAppCompletion=false, which will make
> spark-submit go away once the app starts in cluster mode.
> - use the (new in 2.3) InProcessLauncher class + some cu
We would like to utilize maintaining an arbitrary state between invokations
of the iterations of StructuredStreaming in python
How can we maintain a static DataFrame that acts as state between the
iterations?
Several options that may be relevant:
1. in Spark memory (distributed across the workers
16 matches
Mail list logo