[ 
https://issues.apache.org/jira/browse/KUDU-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengling Wang updated KUDU-2371:
--------------------------------
    Fix Version/s:     (was: 1.6.0)
                   1.8.0

> Allow Kudu-Spark upsert API to ignore NULL column values
> --------------------------------------------------------
>
>                 Key: KUDU-2371
>                 URL: https://issues.apache.org/jira/browse/KUDU-2371
>             Project: Kudu
>          Issue Type: Improvement
>          Components: spark
>    Affects Versions: 1.6.0
>            Reporter: Fengling Wang
>            Assignee: Fengling Wang
>            Priority: Major
>             Fix For: 1.8.0
>
>
> We've seen cases where users use Spark streaming to process JSON and upsert 
> into Kudu. The JSON file may have rows with fields that are not all 
> specified. In this case, Spark sets the missing column values to NULL. Then 
> when we upsert those records, some existing row values will be replaced by 
> NULL. This is a correct behavior of Kudu-Spark, but not what users desire to 
> see.
>  
> {noformat}
> // This is the original Kudu table.
> scala> df.printSchema
> root
>  |-- key: long (nullable = false)
>  |-- int_val: integer (nullable = true)
>  |-- string_val: string (nullable = true)
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|    200|       foo|
> +---+-------+----------+
>  
> // Put JSON string into dataframe with matching schema.
> scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
> scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
> scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
> scala> df_from_json.show()
> +-------+-------+----------+
> |     key|int_val|string_val|
> +--------+-------+----------+
> |     123|      1|      null|
> +--------+-------+----------+
> scala> kuduContext.upsertRows(df_from_json, "kudu_table")
> // Below is the actual result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|      1|      null|
> +---+-------+----------+
> // Below is the desired result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|      1|       foo|
> +---+-------+----------+{noformat}
>   
> In order to avoid such situation, it's suggested to add an extra flag/option 
> to the upsertRows() API to allow treating NULL as an unset/omit. By doing 
> this, unspecified column values will stay unchanged in the Kudu table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to