Re: specifing schema on dataframe

Sam Elamin Sun, 05 Feb 2017 14:10:10 -0800

Thanks Micheal

I've been spending the past few days researching this


The problem is the generated json has double quotes on fields that are
numbers because the producing datastore doesn't want to lose precision

I can change the data type true but that would be on specific to a job
rather than a generic streaming job. I'm writing a structured streaming
connector and I have the schema the generated dataframe should match.

Unfortunately using withColumn won't help me here since the solution needs
to be generic

To summarise assume I have the following json

[{
"customerid": "535137",
"foo": "bar"
}]


and I know the schema should be:
StructType(Array(StructField("customerid",LongType,true),StructField("foo",StringType,true)))

Whats the best way of solving this?

My current approach is to iterate over the JSON and identify which fields
are numbers and which arent then recreate the json

But to be honest that doesnt seem like the cleanest approach, so happy for
advice on this

Regards
Sam

On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
wrote:

> -dev
>
> You can use withColumn to change the type after the data has been loaded
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
> .
>
> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
> Hi Direceu
>
> Thanks your right! that did work
>
>
> But now im facing an even bigger problem since i dont have access to
> change the underlying data, I just want to apply a schema over something
> that was written via the sparkContext.newAPIHadoopRDD
>
> Basically I am reading in a RDD[JsonObject] and would like to convert it
> into a dataframe which I pass the schema into
>
> Whats the best way to do this?
>
> I doubt removing all the quotes in the JSON is the best solution is it?
>
> Regards
> Sam
>
> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
> escreveu:
>
> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
> DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = 
> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = 
> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>
>
>
>

Re: specifing schema on dataframe

Reply via email to