Hi, I am using Spark 1.6.1 with Hive 2.
I agree this may be a case to be resolved. I just happened to work around it. That first blank line causes val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") to crash. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 28 March 2016 at 23:19, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Could I ask which version are you using? > > It looks the cause is the empty line right after header (because that case > is not being checked in tests). > > However, for empty lines before the header or inside date, they are being > tested. > > > https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/ages.csv > > > https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv > > So I think it might have to be able to read that case as well and this > might be an issue. > > It can be simply done by ETL but I think the behaviour might have to be > consistent. > > Maybe would this be better if this issue is open and discussed? > On 29 Mar 2016 6:54 a.m., "Ashok Kumar" <ashok34...@yahoo.com.invalid> > wrote: > >> Thanks a ton sir. Very helpful >> >> >> On Monday, 28 March 2016, 22:36, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >> >> Pretty straight forward >> >> #!/bin/ksh >> DIR="hdfs://<hostname>:9000/data/stg/accounts/nw/xxxxxxxxx" >> # >> ## Remove the blank header line from the spreadsheets and compress them >> # >> echo `date` " ""======= Started Removing blank header line and >> Compressing all csv FILEs" >> for FILE in `ls *.csv` >> do >> sed '1d' ${FILE} > ${FILE}.tmp >> mv -f ${FILE}.tmp ${FILE} >> /usr/bin/bzip2 ${FILE} >> done >> # >> ## Clear out hdfs staging directory >> # >> echo `date` " ""======= Started deleting old files from hdfs staging >> directory ${DIR}" >> hdfs dfs -rm -r ${DIR}/*.bz2 >> echo `date` " ""======= Started Putting bz2 fileS to hdfs staging >> directory ${DIR}" >> for FILE in `ls *.bz2` >> do >> hdfs dfs -copyFromLocal ${FILE} ${DIR} >> done >> echo `date` " ""======= Checking that all files are moved to hdfs >> staging directory" >> hdfs dfs -ls ${DIR} >> exit 0 >> HTH >> >> >> Dr Mich Talebzadeh >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> http://talebzadehmich.wordpress.com >> >> >> On 28 March 2016 at 22:24, Ashok Kumar <ashok34...@yahoo.com> wrote: >> >> Hello Mich >> >> If you accommodate can you please share your approach to steps 1-3 above. >> >> Best regards >> >> >> On Sunday, 27 March 2016, 14:53, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >> >> Pretty simple as usual it is a combination of ETL and ELT. >> >> Basically csv files are loaded into staging directory on host, compressed >> before pushing into hdfs >> >> >> 1. ETL --> Get rid of the header blank line on the csv files >> 2. ETL --> Compress the csv files >> 3. ETL --> Put the compressed CVF files into hdfs staging directory >> 4. ELT --> Use databricks to load the csv files >> 5. ELT --> Spark FP to prcess the csv data >> 6. ELT --> register it as a temporary table >> 7. ELT --> Create an ORC table in a named database in compressed >> zlib2 format in Hive database >> 8. ELT --> Insert/select from temporary table to Hive table >> >> >> So the data is stored in an ORC table and one can do whatever analysis >> using Spark, Hive etc >> >> >> >> Dr Mich Talebzadeh >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> http://talebzadehmich.wordpress.com >> >> >> On 27 March 2016 at 03:05, Koert Kuipers <ko...@tresata.com> wrote: >> >> To me this is expected behavior that I would not want fixed, but if you >> look at the recent commits for spark-csv it has one that deals this... >> On Mar 26, 2016 21:25, "Mich Talebzadeh" <mich.talebza...@gmail.com> >> wrote: >> >> >> Hi, >> >> I have a standard csv file (saved as csv in HDFS) that has first line of >> blank at the header >> as follows >> >> [blank line] >> Date, Type, Description, Value, Balance, Account Name, Account Number >> [blank line] >> 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN >> AE","'638585-60125663", >> >> When I read this file using the following standard >> >> val df = >> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >> "true").option("header", >> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") >> >> it crashes. >> >> java.util.NoSuchElementException >> at java.util.ArrayList$Itr.next(ArrayList.java:794) >> >> If I go and manually delete the first blank line it works OK >> >> val df = >> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", >> "true").option("header", >> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") >> >> df: org.apache.spark.sql.DataFrame = [Date: string, Type: string, >> Description: string, Value: double, Balance: double, Account Name: >> string, Account Number: string] >> >> I can easily write a shell script to get rid of blank line. I was >> wondering if databricks does have a flag to get rid of the first blank line >> in csv file format? >> >> P.S. If the file is stored as DOS text file, this problem goes away. >> >> Thanks >> >> Dr Mich Talebzadeh >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> http://talebzadehmich.wordpress.com >> >> >> >> >> >> >> >> >>