[
https://issues.apache.org/jira/browse/KUDU-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fengling Wang updated KUDU-2371:
--------------------------------
Fix Version/s: (was: 1.6.0)
1.8.0
> Allow Kudu-Spark upsert API to ignore NULL column values
> --------------------------------------------------------
>
> Key: KUDU-2371
> URL: https://issues.apache.org/jira/browse/KUDU-2371
> Project: Kudu
> Issue Type: Improvement
> Components: spark
> Affects Versions: 1.6.0
> Reporter: Fengling Wang
> Assignee: Fengling Wang
> Priority: Major
> Fix For: 1.8.0
>
>
> We've seen cases where users use Spark streaming to process JSON and upsert
> into Kudu. The JSON file may have rows with fields that are not all
> specified. In this case, Spark sets the missing column values to NULL. Then
> when we upsert those records, some existing row values will be replaced by
> NULL. This is a correct behavior of Kudu-Spark, but not what users desire to
> see.
>
> {noformat}
> // This is the original Kudu table.
> scala> df.printSchema
> root
> |-- key: long (nullable = false)
> |-- int_val: integer (nullable = true)
> |-- string_val: string (nullable = true)
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123| 200| foo|
> +---+-------+----------+
>
> // Put JSON string into dataframe with matching schema.
> scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
> scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
> scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
> scala> df_from_json.show()
> +-------+-------+----------+
> | key|int_val|string_val|
> +--------+-------+----------+
> | 123| 1| null|
> +--------+-------+----------+
> scala> kuduContext.upsertRows(df_from_json, "kudu_table")
> // Below is the actual result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123| 1| null|
> +---+-------+----------+
> // Below is the desired result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123| 1| foo|
> +---+-------+----------+{noformat}
>
> In order to avoid such situation, it's suggested to add an extra flag/option
> to the upsertRows() API to allow treating NULL as an unset/omit. By doing
> this, unspecified column values will stay unchanged in the Kudu table.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)