Best way to go from RDD to DataFrame of StringType columns

Everett Anderson Fri, 17 Jun 2016 12:38:20 -0700

Hi,

I have a system with files in a variety of non-standard input formats,
though they're generally flat text files. I'd like to dynamically create
DataFrames of string columns.


What's the best way to go from a RDD<String> to a DataFrame of StringType
columns?

My current plan is

   - Call map() on the RDD<String> with a function to split the String into
   columns and call RowFactory.create() with the resulting array, creating a
   RDD<Row>
   - Construct a StructType schema using column names and StringType
   - Call SQLContext.createDataFrame(RDD, schema) to create the result

Does that make sense?

I looked through the spark-csv package a little and noticed that it's using
baseRelationToDataFrame(), but BaseRelation looks like it might be a
restricted developer API. Anyone know if it's recommended for use?

Thanks!

- Everett

Best way to go from RDD to DataFrame of StringType columns

Reply via email to