Re: Spark CSV skip lines

Hyukjin Kwon Sat, 10 Sep 2016 02:33:07 -0700

Hi Selvam,

If your report is commented with any character (e.g. #), you can skip these
lines via comment option [1].

If you are using Spark 1.x, then you might be able to do this by manually
skipping from the RDD and then making this to DataFrame as below:

I haven’t tested this but I think this should work.

val rdd = sparkContext.textFile("...")
val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) {
    iter.drop(10)
  } else {
    iter
  }
}
val df = new CsvParser().csvRdd(sqlContext, filteredRdd)

If you are using Spark 2.0, then it seems there is no way to manually
modifying the source data because loading existing RDD or DataSet[String]
to DataFrame is not yet supported.

There is an issue open[2]. I hope this is helpful.

Thanks.

[1]
https://github.com/apache/spark/blob/27209252f09ff73c58e60c6df8aaba73b308088c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L369
[2] https://issues.apache.org/jira/browse/SPARK-15463

On 10 Sep 2016 6:14 p.m., "Selvam Raman" <sel...@gmail.com> wrote:

> Hi,
>
> I am using spark csv to read csv file. The issue is my files first n lines
> contains some report and followed by actual data (header and rest of the
> data).
>
> So how can i skip first n lines in spark csv. I dont have any specific
> comment character in the first byte.
>
> Please give me some idea.
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>

Re: Spark CSV skip lines

Reply via email to