confused behavior about pyspark.sql, Row, schema, and createDataFrame

Chang Ya-Hsuan Wed, 23 Dec 2015 20:20:50 -0800

python version: 2.7.9
os: ubuntu 14.04
spark: 1.5.2

```
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import StructType, IntegerType
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
schema1 = StructType() \
    .add('a', IntegerType()) \
    .add('b', IntegerType())


schema2 = StructType() \
    .add('b', IntegerType()) \
    .add('a', IntegerType())
print(schema1 == schema2)

r1 = Row(a=1, b=2)
r2 = Row(b=2, a=1)
print(r1 == r2)

data = [r1, r2]

df1 = sqlc.createDataFrame(data, schema1)
df1.show()

df2 = sqlc.createDataFrame(data, schema2)
df2.show()
```

intuitively, I thought df1 and df2 should contain the same data, however,
the output is

```
False
True
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  1|  2|
+---+---+

+---+---+
|  b|  a|
+---+---+
|  1|  2|
|  1|  2|
+---+---+
```

after trace the source code, I found

1. schema (StructType) use list store fields, so it is order-sensitive
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L459
2. Row will sort according to field name when new
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1204
3. It seems (not 100% sure) that when createDataFrame, it access the filed
of Row by order not by filed name
https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L422

This behavior is a little bit tricky. Maybe this behavior could be mention
at document?

Thanks.

-- 
-- 張雅軒

confused behavior about pyspark.sql, Row, schema, and createDataFrame

Reply via email to