Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Jason
We do the exact same approach you proposed for converting horrible text formats (VCF in the bioinformatics domain) into DataFrames. This involves creating the schema dynamically based on the header of the file too. It's simple and easy, but if you need something higher performance you might need t

Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Everett Anderson
On Fri, Jun 17, 2016 at 1:17 PM, Mich Talebzadeh wrote: > Ok a bit of a challenge. > > Have you tried using databricks stuff?. they can read compressed files and > they might work here? > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("heade

Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Mich Talebzadeh
Ok a bit of a challenge. Have you tried using databricks stuff?. they can read compressed files and they might work here? val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772") c

Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Everett Anderson
On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh wrote: > Are these mainly in csv format? > Alas, no -- lots of different formats. Many are fixed width files, where I have outside information to know which byte ranges correspond to which columns. Some have odd null representations or non-comma

Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Mich Talebzadeh
Are these mainly in csv format? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 17 June 2016 at 20:38,

Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Everett Anderson
Hi, I have a system with files in a variety of non-standard input formats, though they're generally flat text files. I'd like to dynamically create DataFrames of string columns. What's the best way to go from a RDD to a DataFrame of StringType columns? My current plan is - Call map() on the