I have a job that is running into intermittent errors with [SparkDriver]
java.lang.OutOfMemoryError: Java heap space. Before I was getting this
error I was getting errors saying the result size exceed the
spark.driver.maxResultSize.
This does not make any sense to me, as there are no actions in m
;
> There has to be a difference in classpaths in yarn-client and yarn-cluster
> mode. Perhaps a good starting point would be to print classpath as a first
> thing in SimpleApp.main. It should give clues around why it works in
> yarn-cluster mode.
>
> Thanks,
> Aniket
>
>
Hi,
I have a problem trying to get a fairly simple app working which makes use
of native avro libraries. The app runs fine on my local machine and in
yarn-cluster mode, but when I try to run it on EMR yarn-client mode I get
the error below. I'm aware this is a version problem, as EMR runs an
ear
.take(20).foreach(println)
On Thu, Jun 4, 2015 at 12:05 PM Tom Seddon wrote:
> Hi,
>
> I've worked out how to use explode on my input avro dataset with the
> following structure
> root
> |-- pageViewId: string (nullable = false)
> |-- components: array
Hi,
I've worked out how to use explode on my input avro dataset with the
following structure
root
|-- pageViewId: string (nullable = false)
|-- components: array (nullable = true)
||-- element: struct (containsNull = false)
|||-- name: string (nullable = false)
|||-- loadT
Hi,
I've searched but can't seem to find a PySpark example. How do I write
compressed text file output to S3 using PySpark saveAsTextFile?
Thanks,
Tom
I'm trying to set up a PySpark ETL job that takes in JSON log files and
spits out fact table files for upload to Redshift. Is there an efficient
way to send different event types to different outputs without having to
just read the same cached RDD twice? I have my first RDD which is just a
json p
Yes please can you share. I am getting this error after expanding my
application to include a large broadcast variable. Would be good to know if
it can be fixed with configuration.
On 23 October 2014 18:04, Michael Campbell
wrote:
> Can you list what your fix was so others can benefit?
>
> On W
Hi,
Just wondering if anyone has any advice about this issue, as I am
experiencing the same thing. I'm working with multiple broadcast variables
in PySpark, most of which are small, but one of around 4.5GB, using 10
workers at 31GB memory each and driver with same spec. It's not running
out of m