Re: Question about SPARK-11374 (skip.header.line.count)

Dongjin Lee Thu, 08 Dec 2016 21:57:07 -0800

+1 For this idea. I need it also.

Regards,
Dongjin


On Fri, Dec 9, 2016 at 8:59 AM, Dongjoon Hyun <dongj...@apache.org> wrote:

> Hi, All.
>
> Could you give me some opinion?
>
> There is an old SPARK issue, SPARK-11374, about removing header lines from
> text file.
> Currently, Spark supports removing CSV header lines by the following way.
>
> ```
> scala> spark.read.option("header","true").csv("/data").show
> +---+---+
> | c1| c2|
> +---+---+
> |  1|  a|
> |  2|  b|
> +---+---+
> ```
>
> In SQL world, we can support that like the Hive way,
> `skip.header.line.count`.
>
> ```
> scala> sql("CREATE TABLE t1 (id INT, value VARCHAR(10)) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data'
> TBLPROPERTIES('skip.header.line.count'='1')")
> scala> sql("SELECT * FROM t1").show
> +---+-----+
> | id|value|
> +---+-----+
> |  1|    a|
> |  2|    b|
> +---+-----+
> ```
>
> Although I made a PR for this based on the JIRA issue, I want to know this
> is really needed feature.
> Is it need for your use cases? Or, it's enough for you to remove them in a
> preprocessing stage.
> If this is too old and not proper in these days, I'll close the PR and
> JIRA issue as WON'T FIX.
>
> Thank you for all in advance!
>
> Bests,
> Dongjoon.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
<http://www.facebook.com/dongjin.lee.kr>linkedin:
kr.linkedin.com/in/dongjinleekr
<http://kr.linkedin.com/in/dongjinleekr>github:
<http://goog_969573159/>github.com/dongjinleekr
<http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
<http://www.twitter.com/dongjinleekr>*

Re: Question about SPARK-11374 (skip.header.line.count)

Reply via email to