Pretty straight forward #!/bin/ksh DIR="hdfs://<hostname>:9000/data/stg/accounts/nw/xxxxxxxxx" # ## Remove the blank header line from the spreadsheets and compress them # echo `date` " ""======= Started Removing blank header line and Compressing all csv FILEs" for FILE in `ls *.csv` do sed '1d' ${FILE} > ${FILE}.tmp mv -f ${FILE}.tmp ${FILE} /usr/bin/bzip2 ${FILE} done # ## Clear out hdfs staging directory # echo `date` " ""======= Started deleting old files from hdfs staging directory ${DIR}" hdfs dfs -rm -r ${DIR}/*.bz2 echo `date` " ""======= Started Putting bz2 fileS to hdfs staging directory ${DIR}" for FILE in `ls *.bz2` do hdfs dfs -copyFromLocal ${FILE} ${DIR} done echo `date` " ""======= Checking that all files are moved to hdfs staging directory" hdfs dfs -ls ${DIR} exit 0
HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 28 March 2016 at 22:24, Ashok Kumar <ashok34...@yahoo.com> wrote: > Hello Mich > > If you accommodate can you please share your approach to steps 1-3 above. > > Best regards > > > On Sunday, 27 March 2016, 14:53, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > > > Pretty simple as usual it is a combination of ETL and ELT. > > Basically csv files are loaded into staging directory on host, compressed > before pushing into hdfs > > > 1. ETL --> Get rid of the header blank line on the csv files > 2. ETL --> Compress the csv files > 3. ETL --> Put the compressed CVF files into hdfs staging directory > 4. ELT --> Use databricks to load the csv files > 5. ELT --> Spark FP to prcess the csv data > 6. ELT --> register it as a temporary table > 7. ELT --> Create an ORC table in a named database in compressed zlib2 > format in Hive database > 8. ELT --> Insert/select from temporary table to Hive table > > > So the data is stored in an ORC table and one can do whatever analysis > using Spark, Hive etc > > > > Dr Mich Talebzadeh > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > http://talebzadehmich.wordpress.com > > > On 27 March 2016 at 03:05, Koert Kuipers <ko...@tresata.com> wrote: > > To me this is expected behavior that I would not want fixed, but if you > look at the recent commits for spark-csv it has one that deals this... > On Mar 26, 2016 21:25, "Mich Talebzadeh" <mich.talebza...@gmail.com> > wrote: > > > Hi, > > I have a standard csv file (saved as csv in HDFS) that has first line of > blank at the header > as follows > > [blank line] > Date, Type, Description, Value, Balance, Account Name, Account Number > [blank line] > 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN > AE","'638585-60125663", > > When I read this file using the following standard > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", > "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") > > it crashes. > > java.util.NoSuchElementException > at java.util.ArrayList$Itr.next(ArrayList.java:794) > > If I go and manually delete the first blank line it works OK > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", > "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/") > > df: org.apache.spark.sql.DataFrame = [Date: string, Type: string, > Description: string, Value: double, Balance: double, Account Name: > string, Account Number: string] > > I can easily write a shell script to get rid of blank line. I was > wondering if databricks does have a flag to get rid of the first blank line > in csv file format? > > P.S. If the file is stored as DOS text file, this problem goes away. > > Thanks > > Dr Mich Talebzadeh > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > http://talebzadehmich.wordpress.com > > > > > >