Re: splitting a huge file

2017-04-24 Thread Steve Loughran
> On 21 Apr 2017, at 19:36, Paul Tremblay wrote: > > We are tasked with loading a big file (possibly 2TB) into a data warehouse. > In order to do this efficiently, we need to split the file into smaller files. > > I don't believe there is a way to do this with Spark, because in order for > Sp

Re: splitting a huge file

2017-04-21 Thread Roger Marin
If the file is in HDFS already you can use spark to read the file using a specific input format (depending on file type) to split it. http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html On Sat, Apr 22, 2017 at 4:36 AM, Paul Tremblay wrote: > We are tasked with loa

Re: splitting a huge file

2017-04-21 Thread Jörn Franke
What is your DWH technology? If the file is on HDFS and depending on the format than Spark can read parts of it in parallel. > On 21. Apr 2017, at 20:36, Paul Tremblay wrote: > > We are tasked with loading a big file (possibly 2TB) into a data warehouse. > In order to do this efficiently, we n