On 5 Nov 2015, at 02:03, Younes Naguib 
<younes.nag...@tritondigital.com<mailto:younes.nag...@tritondigital.com>> wrote:

Hi all,

I’m reading large text files from s3. Sizes between from 30GB and 40GB.
Every stage runs in 8-9s, except the last 32, jumps to 1mn-2mn for some reason!
Here is my sample code:
    val myDF = sc.textFile(input_file).map{
      x =>
        val p = x.split("\t", -1)
        new zzzzzzzz(....)
    }.toDF()

    myDF.registerTempTable("tbl")
    sqlContext.sql("select count(1) from tbl").collect()

Any help/idea?

Thanks,
Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com<mailto:younes.nag...@streamtheworld.com>


There's a performance problem in S3n on Hadoop 2.6, where the jets3t library 
scans through the tail of the file on a close(). S3a on Hadoop 2.7+ doesn't 
have this problem

Reply via email to