Thanks Holden !
On Thu, Aug 3, 2017 at 4:02 AM, Holden Karau wrote:
> The memory overhead is based less on the total amount of data and more on
> what you end up doing with the data (e.g. if your doing a lot of off-heap
> processing or using Python you need to increase it). Honestly most people
The memory overhead is based less on the total amount of data and more on
what you end up doing with the data (e.g. if your doing a lot of off-heap
processing or using Python you need to increase it). Honestly most people
find this number for their job "experimentally" (e.g. they try a few
differen
Ryan,
Thank you for reply.
For 2 TB of Data what should be the value of
spark.yarn.executor.memoryOverhead = ?
with regards to this - i see issue at spark
https://issues.apache.org/jira/browse/SPARK-18787 , not sure whether it
works or not at Spark 2.0.1 !
can you elaborate more for spark.memor
Chetan,
When you're writing to a partitioned table, you want to use a shuffle to
avoid the situation where each task has to write to every partition. You
can do that either by adding a repartition by your table's partition keys,
or by adding an order by with the partition keys and then columns you
Can anyone please guide me with above issue.
On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri
wrote:
> Hello Spark Users,
>
> I have Hbase table reading and writing to Hive managed table where i
> applied partitioning by date column which worked fine but it has generate
> more number of files in a