I am not sure if the long work is doing this to me but I am seeing some unexpected behavior in spark 2.2.0
I have created a toy example as below toy_df = spark.createDataFrame([ ['p1','a'], ['p1','b'], ['p1','c'], ['p2','a'], ['p2','b'], ['p2','d']],schema=['patient','drug']) I create another dataframe mdf = toy_df.filter(toy_df.drug == 'c') as you know mdf would be mdf.show() +-------+----+ |patient|drug| +-------+----+ | p1| c| +-------+----+ Now If I do this toy_df.join(mdf,["patient"],"left").select(toy_df.patient.alias("P1"),toy_df.drug.alias('D1'),mdf.patient,mdf.drug).show() Surprisingly I get +---+---+-------+----+ | P1| D1|patient|drug| +---+---+-------+----+ | p2| a| p2| a| | p2| b| p2| b| | p2| d| p2| d| | p1| a| p1| a| | p1| b| p1| b| | p1| c| p1| c| +---+---+-------+----+ but if I use toy_df.join(mdf,["patient"],"left").show() I do see the expected behavior patient|drug|drug| +-------+----+----+ | p2| a|null| | p2| b|null| | p2| d|null| | p1| a| c| | p1| b| c| | p1| c| c| +-------+----+----+ and if I use an alias expression on one of the dataframes I do get the expected behavior toy_df.join(mdf.alias('D'),on=["patient"],how="left").select(toy_df.patient.alias("P1"),toy_df.drug.alias("D1"),'D.drug').show() | P1| D1|drug| +---+---+----+ | p2| a|null| | p2| b|null| | p2| d|null| | p1| a| c| | p1| b| c| | p1| c| c| +---+---+----+ So my question is what is the best way to select columns after join and is this behavior normal -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org