> On 21 Apr 2017, at 19:36, Paul Tremblay <paulhtremb...@gmail.com> wrote:
> 
> We are tasked with loading a big file (possibly 2TB) into a data warehouse. 
> In order to do this efficiently, we need to split the file into smaller files.
> 
> I don't believe there is a way to do this with Spark, because in order for 
> Spark to distribute the file to the worker nodes, it first has to be split 
> up, right? 

if it is in HDFS, it's already been broken up by block size and scattered 
around the filesystem, so probably split up by 128/256MB blocks, 3x replicated 
each, offering lots of places for local data.

If its in another FS, different strategies may apply, including no lo

> 
> We ended up using a single machine with a single thread to do the splitting. 
> I just want to make sure I am not missing something obvious.
> 

you don't explicitly need to split up the file if you can run different workers 
against different parts of the same file, which means you need to split it up,

This is what org.apache.hadoop.mapreduce.InputFormat.getSplits() does: you will 
need to define an input format for your data source, and provide the split 
calculation

> Thanks!
> 
> -- 
> Paul Henry Tremblay
> Attunix


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to