Hello:
I am using UDF to convert schema to JSON, and based on the JSON schema, when a schema has key “type” is “number”, I need to convert the input data to float, such as if an “income” type is number, and the input data is “100”, the output should be “100.0”. But the problem is if an original number is an “integer”, the output will be null. In the example above, the output is “null” Right now I have a temporary solution is that, traversing the schema and find all key “type” is “number”, store the key’s path from root to this kay, into a list, and then traversing the input data, based on the path list, convert the number to float each. But the algorithm’s problem is, when a key is “items”, which means the value is nested in an array, there will be more than one numbers in the value, and there will be more cases such as “items” under “items”, or “items” under “properties”. This algorithm cannot handle all the corner cases. So may I know can I get any suggestions that is there any other solutions that can help fixing the UDF integer-float converting problem? ======================= To add context: - We have data records loaded into Python dictionaries from JSON, and some fields (e.g. “income”) have mixed values – in some records “income” is parsed as an integer (e.g. “100”) and in some “income” is parsed as a float (e.g. “100.0”) - JSON { “income”: “100” }, { “income”: “100.0” } -> Python { “income”: 100 }, { “income”: 100.0} - We load these records as JSON strings into a dataframe, then we convert them into StructType using pyspark.sql.functions.udf. The int/float mixed numerical fields are marked as FloatType(). - Python { “income”: 100 }, { “income”: 100.0} -> StructType(FloatType()) - We have observed that when PySpark 2.3 casts from “int” to “FloatType”, it coerces integer values like “100” to “null” instead of to “100.0”. - Observed: Python 100 -> FloatType null - Desired: Python 100 -> FloatType 100.0 - This behavior may also be true in Scala. - We are currently trying to patch this problem by adding logic inside the Python function to recursively convert any integers to floats in Python before returning from the UDF. - We don’t want to introduce this custom and error-prone logic. Have members of this community encountered this issue in PySpark or in Scala? If so, how have you solved it? Is there a way in PySpark to enable implicit conversion of Python integers (e.g. “100”) to PySpark FloatType (e.g. “100.0”)? Thanks a lot! Sincerely Danni