Re: [sql] Dataframe how to check null values

Peter Rudenko Mon, 20 Apr 2015 04:31:55 -0700

Sounds very good. Is there a jira for this? Would be cool to have in1.4, because currently cannot use dataframe.describe function with NaNvalues, need to filter manually all the columns.


Thanks,
Peter Rudenko


On 2015-04-02 21:18, Reynold Xin wrote:

Incidentally, we were discussing this yesterday. Here are somethoughts on null handling in SQL/DataFrames. Would be great to getsome feedback.

1. Treat floating point NaN and null as the same "null" value. Thiswould be consistent with most SQL databases, and Pandas. This wouldalso require some inbound conversion.

2. Internally, when we see a NaN value, we should mark the null bit astrue, and keep the NaN value. When we see a null value for a floatingpoint field, we should mark the null bit as true, and update the fieldto store NaN.

3. Externally, for floating point values, return NaN when the value isnull.


4. For all other types, return null for null values.

5. For UDFs, if the argument is primitive type only (i.e. does nothandle null) and not a floating point field, simply evaluate theexpression to null. This is consistent with most SQL UDFs and mostprogramming languages' treatment of NaN.



Any thoughts on this semantics?

On Thu, Apr 2, 2015 at 5:51 AM, Dean Wampler <deanwamp...@gmail.com<mailto:deanwamp...@gmail.com>> wrote:


    I'm afraid you're a little stuck. In Scala, the types Int, Long,
    Float,
    Double, Byte, and Boolean look like reference types in source
    code, but
    they are compiled to the corresponding JVM primitive types, which
    can't be
    null. That's why you get the warning about ==.

    It might be your best choice is to use NaN as the placeholder for
    null,
    then create one DF using a filter that removes those values. Use
    that DF to
    compute the mean. Then apply a map step to the original DF to
    translate the
    NaN's to the mean.

    dean

    Dean Wampler, Ph.D.
    Author: Programming Scala, 2nd Edition
    <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
    Typesafe <http://typesafe.com>
    @deanwampler <http://twitter.com/deanwampler>
    http://polyglotprogramming.com

    On Thu, Apr 2, 2015 at 7:54 AM, Peter Rudenko
    <petro.rude...@gmail.com <mailto:petro.rude...@gmail.com>>
    wrote:

    > Hi i need to implement MeanImputor - impute missing values with
    mean. If i
    > set missing values to null - then dataframe aggregation works
    properly, but
    > in UDF it treats null values to 0.0. Here’s example:
    >
    > |val df = sc.parallelize(Array(1.0,2.0, null, 3.0, 5.0, null)).toDF
    > df.agg(avg("_1")).first //res45: org.apache.spark.sql.Row = [2.75]
    > df.withColumn("d2", callUDF({(value: Double) => value}, DoubleType,
    > df("d"))),show() d d2 1.0 1.0 2.0 2.0 null 0.0 3.0 3.0 5.0 5.0
    null 0.0 val
    > df = sc.parallelize(Array(1.0,2.0, Double.NaN, 3.0, 5.0,
    Double.NaN)).toDF
    > df.agg(avg("_1")).first //res46: org.apache.spark.sql.Row =
    [Double.NaN] |
    >
    > In UDF i cannot compare scala’s Double to null:
    >
    > |comparing values of types Double and Null using `==' will
    always yield
    > false [warn] if (value==null) meanValue else value |
    >
    > With Double.NaN instead of null i can compare in UDF, but
    aggregation
    > doesn’t work properly. Maybe it’s related to :
    https://issues.apache.org/
    > jira/browse/SPARK-6573
    >
    > Thanks,
    > Peter Rudenko
    >
    > 
    >

Re: [sql] Dataframe how to check null values

Reply via email to