Pyspark Join and then column select is showing unexpected output

bis_g Wed, 06 Jun 2018 18:58:42 -0700

I am not sure if the long work is doing this to me but I am seeing some
unexpected behavior in spark 2.2.0


I have created a toy example as below

toy_df = spark.createDataFrame([
['p1','a'],
['p1','b'],
['p1','c'],
['p2','a'],
['p2','b'],
['p2','d']],schema=['patient','drug']) 
I create another dataframe

mdf = toy_df.filter(toy_df.drug == 'c')
as you know mdf would be

 mdf.show()
+-------+----+
|patient|drug|
+-------+----+
|     p1|   c|
+-------+----+ 
Now If I do this

toy_df.join(mdf,["patient"],"left").select(toy_df.patient.alias("P1"),toy_df.drug.alias('D1'),mdf.patient,mdf.drug).show()
Surprisingly I get

+---+---+-------+----+
| P1| D1|patient|drug|
+---+---+-------+----+
| p2|  a|     p2|   a|
| p2|  b|     p2|   b|
| p2|  d|     p2|   d|
| p1|  a|     p1|   a|
| p1|  b|     p1|   b|
| p1|  c|     p1|   c|
+---+---+-------+----+
but if I use

toy_df.join(mdf,["patient"],"left").show()
I do see the expected behavior

 patient|drug|drug|
+-------+----+----+
|     p2|   a|null|
|     p2|   b|null|
|     p2|   d|null|
|     p1|   a|   c|
|     p1|   b|   c|
|     p1|   c|   c|
+-------+----+----+
and if I use an alias expression on one of the dataframes I do get the
expected behavior

toy_df.join(mdf.alias('D'),on=["patient"],how="left").select(toy_df.patient.alias("P1"),toy_df.drug.alias("D1"),'D.drug').show()

| P1| D1|drug|
+---+---+----+
| p2|  a|null|
| p2|  b|null|
| p2|  d|null|
| p1|  a|   c|
| p1|  b|   c|
| p1|  c|   c|
+---+---+----+
So my question is what is the best way to select columns after join and is
this behavior normal



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Pyspark Join and then column select is showing unexpected output

Reply via email to