Re: Best way to go from RDD to DataFrame of StringType columns

Mich Talebzadeh Fri, 17 Jun 2016 13:17:51 -0700

Ok a bit of a challenge.

Have you tried using databricks stuff?. they can read compressed files and
they might work here?


val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header",
"true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")

case class Accounts( TransactionDate: String, TransactionType: String,
Description: String, Value: Double, Balance: Double, AccountName: String,
AccountNumber : String)
// Map the columns to names
//
val a = df.filter(col("Date") > "").map(p =>
Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
//
// Create a Spark temporary table
//
a.toDF.registerTempTable("tmp")



HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 21:02, Everett Anderson <ever...@nuna.com> wrote:

>
>
> On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Are these mainly in csv format?
>>
>
> Alas, no -- lots of different formats. Many are fixed width files, where I
> have outside information to know which byte ranges correspond to which
> columns. Some have odd null representations or non-comma delimiters (though
> many of those cases might fit within the configurability of the spark-csv
> package).
>
>
>
>
>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 June 2016 at 20:38, Everett Anderson <ever...@nuna.com.invalid>
>> wrote:
>>
>>> Hi,
>>>
>>> I have a system with files in a variety of non-standard input formats,
>>> though they're generally flat text files. I'd like to dynamically create
>>> DataFrames of string columns.
>>>
>>> What's the best way to go from a RDD<String> to a DataFrame of
>>> StringType columns?
>>>
>>> My current plan is
>>>
>>>    - Call map() on the RDD<String> with a function to split the String
>>>    into columns and call RowFactory.create() with the resulting array,
>>>    creating a RDD<Row>
>>>    - Construct a StructType schema using column names and StringType
>>>    - Call SQLContext.createDataFrame(RDD, schema) to create the result
>>>
>>> Does that make sense?
>>>
>>> I looked through the spark-csv package a little and noticed that it's
>>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>>> restricted developer API. Anyone know if it's recommended for use?
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>>
>>
>

Re: Best way to go from RDD to DataFrame of StringType columns

Reply via email to