Ok a bit of a challenge. Have you tried using databricks stuff?. they can read compressed files and they might work here?
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772") case class Accounts( TransactionDate: String, TransactionType: String, Description: String, Value: Double, Balance: Double, AccountName: String, AccountNumber : String) // Map the columns to names // val a = df.filter(col("Date") > "").map(p => Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString)) // // Create a Spark temporary table // a.toDF.registerTempTable("tmp") HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 17 June 2016 at 21:02, Everett Anderson <ever...@nuna.com> wrote: > > > On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Are these mainly in csv format? >> > > Alas, no -- lots of different formats. Many are fixed width files, where I > have outside information to know which byte ranges correspond to which > columns. Some have odd null representations or non-comma delimiters (though > many of those cases might fit within the configurability of the spark-csv > package). > > > > > >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 17 June 2016 at 20:38, Everett Anderson <ever...@nuna.com.invalid> >> wrote: >> >>> Hi, >>> >>> I have a system with files in a variety of non-standard input formats, >>> though they're generally flat text files. I'd like to dynamically create >>> DataFrames of string columns. >>> >>> What's the best way to go from a RDD<String> to a DataFrame of >>> StringType columns? >>> >>> My current plan is >>> >>> - Call map() on the RDD<String> with a function to split the String >>> into columns and call RowFactory.create() with the resulting array, >>> creating a RDD<Row> >>> - Construct a StructType schema using column names and StringType >>> - Call SQLContext.createDataFrame(RDD, schema) to create the result >>> >>> Does that make sense? >>> >>> I looked through the spark-csv package a little and noticed that it's >>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a >>> restricted developer API. Anyone know if it's recommended for use? >>> >>> Thanks! >>> >>> - Everett >>> >>> >> >