Umm....I think the premise is you need to "know" beforehand which columns are numeric.....Unless you know it, how would you apply the schema?
On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com> wrote: > Thanks ayan but I meant how to derive the list automatically > > In your example you are specifying the numeric columns and I would like it > to be applied to any schema if that makes sense > On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote: > >> SImple (pyspark) example: >> >> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json") >> >>> df.printSchema() >> root >> |-- customerid: string (nullable = true) >> |-- foo: string (nullable = true) >> >> >>> numeric_field_list = ['customerid'] >> >> >>> for k in numeric_field_list: >> ... df = df.withColumn(k,df[k].cast("long")) >> ... >> >>> df.printSchema() >> root >> |-- customerid: long (nullable = true) >> |-- foo: string (nullable = true) >> >> >> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com> >> wrote: >> >> Ok thanks Micheal! >> >> >> Can I get an idea on where to start? Assuming I have the end schema and >> the current dataframe... >> How can I loop through it and create a new dataframe using the WithColumn? >> >> >> Am I iterating through the dataframe or the schema? >> >> I'm assuming it's easier to iterate through the columns in the old df. >> For each column cast it correctly and generate a new df? >> >> >> Would you recommend that? >> >> Regards >> Sam >> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> If you already have the expected schema, and you know that all numbers >> will always be formatted as strings in the input JSON, you could probably >> derive this list automatically. >> >> Wouldn't it be simpler to just regex replace the numbers to remove the >> quotes? >> >> >> I think this is likely to be a slower and less robust solution. You >> would have to make sure that you got all the corner cases right (i.e. >> escaping and what not). >> >> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com> >> wrote: >> >> I see so for the connector I need to pass in an array/list of numerical >> columns? >> >> Wouldnt it be simpler to just regex replace the numbers to remove the >> quotes? >> >> >> Regards >> Sam >> >> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <mich...@databricks.com >> > wrote: >> >> Specifying the schema when parsing JSON will only let you pick between >> similar datatypes (i.e should this be a short, long float, double etc). It >> will not let you perform conversions like string <-> number. This has to >> be done with explicit casts after the data has been loaded. >> >> I think you can make a solution that uses select or withColumn generic. >> Just load the dataframe with a "parse schema" that treats numbers as >> strings. Then construct a list of columns that should be numbers and apply >> the necessary conversions. >> >> import org.apache.spark.sql.functions.col >> var df = spark.read.schema(parseSchema).json("...") >> numericColumns.foreach { columnName => >> df = df.withColumn(columnName, col(columnName).cast("long")) >> } >> >> >> >> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com> >> wrote: >> >> Thanks Micheal >> >> I've been spending the past few days researching this >> >> The problem is the generated json has double quotes on fields that are >> numbers because the producing datastore doesn't want to lose precision >> >> I can change the data type true but that would be on specific to a job >> rather than a generic streaming job. I'm writing a structured streaming >> connector and I have the schema the generated dataframe should match. >> >> Unfortunately using withColumn won't help me here since the solution >> needs to be generic >> >> To summarise assume I have the following json >> >> [{ >> "customerid": "535137", >> "foo": "bar" >> }] >> >> >> and I know the schema should be: >> StructType(Array(StructField("customerid",LongType,true), >> StructField("foo",StringType,true))) >> >> Whats the best way of solving this? >> >> My current approach is to iterate over the JSON and identify which fields >> are numbers and which arent then recreate the json >> >> But to be honest that doesnt seem like the cleanest approach, so happy >> for advice on this >> >> Regards >> Sam >> >> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> -dev >> >> You can use withColumn to change the type after the data has been loaded >> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html> >> . >> >> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com> >> wrote: >> >> Hi Direceu >> >> Thanks your right! that did work >> >> >> But now im facing an even bigger problem since i dont have access to >> change the underlying data, I just want to apply a schema over something >> that was written via the sparkContext.newAPIHadoopRDD >> >> Basically I am reading in a RDD[JsonObject] and would like to convert it >> into a dataframe which I pass the schema into >> >> Whats the best way to do this? >> >> I doubt removing all the quotes in the JSON is the best solution is it? >> >> Regards >> Sam >> >> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho < >> dirceu.semigh...@gmail.com> wrote: >> >> Hi Sam >> Remove the " from the number that it will work >> >> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com> >> escreveu: >> >> Hi All >> >> I would like to specify a schema when reading from a json but when trying >> to map a number to a Double it fails, I tried FloatType and IntType with no >> joy! >> >> >> When inferring the schema customer id is set to String, and I would like >> to cast it as Double >> >> so df1 is corrupted while df2 shows >> >> >> Also FYI I need this to be generic as I would like to apply it to any >> json, I specified the below schema as an example of the issue I am facing >> >> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, >> DoubleType,FloatType, StructType, LongType,DecimalType} >> val testSchema = StructType(Array(StructField("customerid",DoubleType))) >> val df1 = >> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >> val df2 = >> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}"""))) >> df1.show(1) >> df2.show(1) >> >> >> Any help would be appreciated, I am sure I am missing something obvious >> but for the life of me I cant tell what it is! >> >> >> Kind Regards >> Sam >> >> >> >> >> >> >> >> >> >> -- >> Best Regards, >> Ayan Guha >> > -- Best Regards, Ayan Guha