Thanks for the response JayeshLalwani. Clearly in my case the issue was with
my approach, not with the memory.
The job was taking much longer time even for smaller dataset.
Thanks again!
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---
I was able to solve it by writing a java method (to slice and dice data) and
invoking the method/function from spark.map. This transformed the data way
faster than my previous approach.
Thanks geoHeil for the pointer.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---
Is your input data partitioned? How much memory have you assigned to your
executor? Have you looked at how much time is being spent in GC in the
executor? Is Spark spilling the data into disk?
It is likely that the partition is too big. Spark tries to read the whole
partition into the memory of
I am not 100% sure if spark is smart enough to achieve this using a single
pass over the data. If not you could create a java udf for this which
correctly parses all the columns at once.
Otherwise you could enable Tungsten off heap memory which might speed
things up.
lsn24 schrieb am Fr. 13. Apr
Hello,
We are running into issues while trying to process fixed length files using
spark.
The approach we took is as follows:
1. Read the .bz2 file into a dataset from hdfs using
spark.read().textFile() API.Create a temporary view.
Dataset rawDataset = sparkSession.read().textFile(filePa