Dear Michael, dear all,
motivation:
object OtherEntities {
case class Record( x:Int, a: String)
case class Mapping( x: Int, y: Int )
val records = Seq( Record(1, "hello"), Record(2, "bob"))
val mappings = Seq( Mapping(2, 5) )
}
Now I want to perform an *left outer join* on records and mappings
(with the ON JOIN criterion on columns (recordDF("x") ===
mappingDF("x") .... shorthand is in
*leftOuterJoinWithRemovalOfEqualColumn*
val sqlContext = new SQLContext(sc)
// used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
val recordDF= sc.parallelize(OtherEntities.records, 4).toDF()
val mappingDF = sc.parallelize(OtherEntities.mappings, 4).toDF()
val joinedDF = recordDF.leftOuterJoinWithRemovalOfEqualColumn( mappingDF, "x")
joinedDF.filter(joinedDF("y").isNotNull).show
Currently, the output is
+-+-----+----+
|x| a| y|
+-+-----+----+
|1|hello|null|
|2| bob| 5|
+-+-----+----+
instead of
+-+---+-+
|x| a|y|
+-+---+-+
|2|bob|5|
+-+---+-+
The last output can be achieved by the method of changing nullable=false to
nullable=true described in my first post.
*Thus, I need this schema modification as to make outer joins work.*
Cheers and thanks,
Martin
2015-07-30 20:23 GMT+02:00 Michael Armbrust <[email protected]>:
> We don't yet updated nullability information based on predicates as we
> don't actually leverage this information in many places yet. Why do you
> want to update the schema?
>
> On Thu, Jul 30, 2015 at 11:19 AM, martinibus77 <
> [email protected]> wrote:
>
>> Hi all,
>>
>> 1. *Columns in dataframes can be nullable and not nullable. Having a
>> nullable column of Doubles, I can use the following Scala code to filter
>> all
>> "non-null" rows:*
>>
>> val df = ..... // some code that creates a DataFrame
>> df.filter( df("columnname").isNotNull() )
>>
>> +-+-----+----+
>> |x| a| y|
>> +-+-----+----+
>> |1|hello|null|
>> |2| bob| 5|
>> +-+-----+----+
>>
>> root
>> |-- x: integer (nullable = false)
>> |-- a: string (nullable = true)
>> |-- y: integer (nullable = true)
>>
>> And with the filter expression
>> +-+---+-+
>> |x| a|y|
>> +-+---+-+
>> |2|bob|5|
>> +-+---+-+
>>
>>
>> Unfortunetaly and while this is a true for a nullable column (according to
>> df.printSchema), it is not true for a column that is not nullable:
>>
>>
>> +-+-----+----+
>> |x| a| y|
>> +-+-----+----+
>> |1|hello|null|
>> |2| bob| 5|
>> +-+-----+----+
>>
>> root
>> |-- x: integer (nullable = false)
>> |-- a: string (nullable = true)
>> |-- y: integer (nullable = false)
>>
>> +-+-----+----+
>> |x| a| y|
>> +-+-----+----+
>> |1|hello|null|
>> |2| bob| 5|
>> +-+-----+----+
>>
>> such that the output is not affected by the filter. Is this intended?
>>
>>
>> 2. *What is the cheapest (in sense of performance) to turn a non-nullable
>> column into a nullable column?
>> A came uo with this:*
>>
>> /**
>> * Set, if a column is nullable.
>> * @param df source DataFrame
>> * @param cn is the column name to change
>> * @param nullable is the flag to set, such that the column is either
>> nullable or not
>> */
>> def setNullableStateOfColumn( df: DataFrame, cn: String, nullable:
>> Boolean) : DataFrame = {
>>
>> val schema = df.schema
>> val newSchema = StructType(schema.map {
>> case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t,
>> nullable = nullable, m)
>> case y: StructField => y
>> })
>> df.sqlContext.createDataFrame( df.rdd, newSchema)
>> }
>>
>> Is there a cheaper solution?
>>
>> 3. *Any comments?*
>>
>> Cheers and thx in advance,
>>
>> Martin
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-DataFrame-Nullable-column-and-filtering-tp24087.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>