Re: Databricks fails to read the csv file with blank line at the file header

Mich Talebzadeh Mon, 28 Mar 2016 15:26:44 -0700

Hi,

I am using Spark 1.6.1 with Hive 2.


I agree this may be a case to be resolved. I just happened to work around
it. That first blank line causes

val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header",
"true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")

to crash.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 28 March 2016 at 23:19, Hyukjin Kwon <[email protected]> wrote:

> Could I ask which version are you using?
>
> It looks the cause is the empty line right after header (because that case
> is not being checked in tests).
>
> However, for empty lines before the header or inside date, they are being
> tested.
>
>
> https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/ages.csv
>
>
> https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv
>
> So I think it might have to be able to read that case as well and this
> might be an issue.
>
> It can be simply done by ETL but I think the behaviour might have to be
> consistent.
>
> Maybe would this be better if this issue is open and discussed?
> On 29 Mar 2016 6:54 a.m., "Ashok Kumar" <[email protected]>
> wrote:
>
>> Thanks a ton sir. Very helpful
>>
>>
>> On Monday, 28 March 2016, 22:36, Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>
>> Pretty straight forward
>>
>> #!/bin/ksh
>> DIR="hdfs://<hostname>:9000/data/stg/accounts/nw/xxxxxxxxx"
>> #
>> ## Remove the blank header line from the spreadsheets and compress them
>> #
>> echo `date` " ""=======  Started Removing blank header line and
>> Compressing all csv FILEs"
>> for FILE in `ls *.csv`
>> do
>>   sed '1d' ${FILE} > ${FILE}.tmp
>>   mv -f ${FILE}.tmp ${FILE}
>>   /usr/bin/bzip2 ${FILE}
>> done
>> #
>> ## Clear out hdfs staging directory
>> #
>> echo `date` " ""=======  Started deleting old files from hdfs staging
>> directory ${DIR}"
>> hdfs dfs -rm -r ${DIR}/*.bz2
>> echo `date` " ""=======  Started Putting bz2 fileS to hdfs staging
>> directory ${DIR}"
>> for FILE in `ls *.bz2`
>> do
>>   hdfs dfs -copyFromLocal ${FILE} ${DIR}
>> done
>> echo `date` " ""=======  Checking that all files are moved to hdfs
>> staging directory"
>> hdfs dfs -ls ${DIR}
>> exit 0
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 28 March 2016 at 22:24, Ashok Kumar <[email protected]> wrote:
>>
>> Hello Mich
>>
>> If you accommodate can you please share your approach to steps 1-3 above.
>>
>> Best regards
>>
>>
>> On Sunday, 27 March 2016, 14:53, Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>
>> Pretty simple as usual it is a combination of ETL and ELT.
>>
>> Basically csv files are loaded into staging directory on host, compressed
>> before pushing into hdfs
>>
>>
>>    1. ETL --> Get rid of the header blank line on the csv files
>>    2. ETL --> Compress the csv files
>>    3. ETL --> Put the compressed CVF files  into hdfs staging directory
>>    4. ELT --> Use databricks to load the csv files
>>    5. ELT --> Spark FP to prcess the csv data
>>    6. ELT --> register it as a temporary table
>>    7. ELT --> Create an ORC table in a named database in compressed
>>    zlib2 format in Hive database
>>    8. ELT --> Insert/select from temporary table to Hive table
>>
>>
>> So the data is stored in an ORC table and one can do whatever analysis
>> using Spark, Hive etc
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 27 March 2016 at 03:05, Koert Kuipers <[email protected]> wrote:
>>
>> To me this is expected behavior that I would not want fixed, but if you
>> look at the recent commits for spark-csv it has one that deals this...
>> On Mar 26, 2016 21:25, "Mich Talebzadeh" <[email protected]>
>> wrote:
>>
>>
>> Hi,
>>
>> I have a standard csv file (saved as csv in HDFS) that has first line of
>> blank at the header
>> as follows
>>
>> [blank line]
>> Date, Type, Description, Value, Balance, Account Name, Account Number
>> [blank line]
>> 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN
>> AE","'638585-60125663",
>>
>> When I read this file using the following standard
>>
>> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header",
>> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>>
>> it crashes.
>>
>> java.util.NoSuchElementException
>>         at java.util.ArrayList$Itr.next(ArrayList.java:794)
>>
>>  If I go and manually delete the first blank line it works OK
>>
>> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header",
>> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>>
>> df: org.apache.spark.sql.DataFrame = [Date: string,  Type: string,
>> Description: string,  Value: double,  Balance: double,  Account Name:
>> string,  Account Number: string]
>>
>> I can easily write a shell script to get rid of blank line. I was
>> wondering if databricks does have a flag to get rid of the first blank line
>> in csv file format?
>>
>> P.S. If the file is stored as DOS text file, this problem goes away.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: Databricks fails to read the csv file with blank line at the file header

Reply via email to