I think TableInputFormat will try to maintain as much locality as possible,
assigning one Spark partition per region and trying to assign that
partition to a YARN container/executor on the same node (assuming you're
using Spark over YARN). So the reason for the uneven distribution could be
that you
by the output of the dfsadmin command, so I am still trying to track that
down. The total allocated disk space of 28 TB should still be more than
enough.
Saad
On Sat, Apr 7, 2018 at 2:40 PM, Saad Mufti wrote:
> Thanks. I checked and it is using another s3 folder for the tempor
> Unfortunately some inputformats need a (local) tmp Directory. Sometimes
> this cannot be avoided.
>
> See also the source:
>
> https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapred/TableSnapshotInputFormat.java
>
> On 7. Apr 2018
Hi,
I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The
input data is HBase files in AWS S3 using EMRFS, but there is no HBase
running on the Spark cluster itself. It is restoring the HBase snapshot
into files on disk in another S3 folder used for temporary storage, then
creati