Re: specifing schema on dataframe

ayan guha Mon, 06 Feb 2017 00:57:07 -0800

Umm....I think the premise is you need to "know" beforehand which columns
are numeric.....Unless you know it, how would you apply the schema?


On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> Thanks ayan but I meant how to derive the list automatically
>
> In your example you are specifying the numeric columns and I would like it
> to be applied to any schema if that makes sense
> On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote:
>
>> SImple (pyspark) example:
>>
>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>> >>> df.printSchema()
>> root
>>  |-- customerid: string (nullable = true)
>>  |-- foo: string (nullable = true)
>>
>> >>> numeric_field_list = ['customerid']
>>
>> >>> for k in numeric_field_list:
>> ...     df = df.withColumn(k,df[k].cast("long"))
>> ...
>> >>> df.printSchema()
>> root
>>  |-- customerid: long (nullable = true)
>>  |-- foo: string (nullable = true)
>>
>>
>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> Ok thanks Micheal!
>>
>>
>> Can I get an idea on where to start? Assuming I have the end schema and
>> the current dataframe...
>> How can I loop through it and create a new dataframe using the WithColumn?
>>
>>
>> Am I iterating through the dataframe or the schema?
>>
>> I'm assuming it's easier to iterate through the columns in the old df.
>> For each column cast it correctly and generate a new df?
>>
>>
>> Would you recommend that?
>>
>> Regards
>> Sam
>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>> If you already have the expected schema, and you know that all numbers
>> will always be formatted as strings in the input JSON, you could probably
>> derive this list automatically.
>>
>> Wouldn't it be simpler to just regex replace the numbers to remove the
>> quotes?
>>
>>
>> I think this is likely to be a slower and less robust solution.  You
>> would have to make sure that you got all the corner cases right (i.e.
>> escaping and what not).
>>
>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> I see so for the connector I need to pass in an array/list of numerical
>> columns?
>>
>> Wouldnt it be simpler to just regex replace the numbers to remove the
>> quotes?
>>
>>
>> Regards
>> Sam
>>
>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com
>> > wrote:
>>
>> Specifying the schema when parsing JSON will only let you pick between
>> similar datatypes (i.e should this be a short, long float, double etc).  It
>> will not let you perform conversions like string <-> number.  This has to
>> be done with explicit casts after the data has been loaded.
>>
>> I think you can make a solution that uses select or withColumn generic.
>> Just load the dataframe with a "parse schema" that treats numbers as
>> strings.  Then construct a list of columns that should be numbers and apply
>> the necessary conversions.
>>
>> import org.apache.spark.sql.functions.col
>> var df = spark.read.schema(parseSchema).json("...")
>> numericColumns.foreach { columnName =>
>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>> }
>>
>>
>>
>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> Thanks Micheal
>>
>> I've been spending the past few days researching this
>>
>> The problem is the generated json has double quotes on fields that are
>> numbers because the producing datastore doesn't want to lose precision
>>
>> I can change the data type true but that would be on specific to a job
>> rather than a generic streaming job. I'm writing a structured streaming
>> connector and I have the schema the generated dataframe should match.
>>
>> Unfortunately using withColumn won't help me here since the solution
>> needs to be generic
>>
>> To summarise assume I have the following json
>>
>> [{
>> "customerid": "535137",
>> "foo": "bar"
>> }]
>>
>>
>> and I know the schema should be:
>> StructType(Array(StructField("customerid",LongType,true),
>> StructField("foo",StringType,true)))
>>
>> Whats the best way of solving this?
>>
>> My current approach is to iterate over the JSON and identify which fields
>> are numbers and which arent then recreate the json
>>
>> But to be honest that doesnt seem like the cleanest approach, so happy
>> for advice on this
>>
>> Regards
>> Sam
>>
>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>> -dev
>>
>> You can use withColumn to change the type after the data has been loaded
>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>> .
>>
>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>>
>> Hi Direceu
>>
>> Thanks your right! that did work
>>
>>
>> But now im facing an even bigger problem since i dont have access to
>> change the underlying data, I just want to apply a schema over something
>> that was written via the sparkContext.newAPIHadoopRDD
>>
>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>> into a dataframe which I pass the schema into
>>
>> Whats the best way to do this?
>>
>> I doubt removing all the quotes in the JSON is the best solution is it?
>>
>> Regards
>> Sam
>>
>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>>
>> Hi Sam
>> Remove the " from the number that it will work
>>
>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
>> escreveu:
>>
>> Hi All
>>
>> I would like to specify a schema when reading from a json but when trying
>> to map a number to a Double it fails, I tried FloatType and IntType with no
>> joy!
>>
>>
>> When inferring the schema customer id is set to String, and I would like
>> to cast it as Double
>>
>> so df1 is corrupted while df2 shows
>>
>>
>> Also FYI I need this to be generic as I would like to apply it to any
>> json, I specified the below schema as an example of the issue I am facing
>>
>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>> DoubleType,FloatType, StructType, LongType,DecimalType}
>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>> val df1 = 
>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> val df2 = 
>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> df1.show(1)
>> df2.show(1)
>>
>>
>> Any help would be appreciated, I am sure I am missing something obvious
>> but for the life of me I cant tell what it is!
>>
>>
>> Kind Regards
>> Sam
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>


-- 
Best Regards,
Ayan Guha

Re: specifing schema on dataframe

Reply via email to