python version: 2.7.9
os: ubuntu 14.04
spark: 1.5.2
```
import pyspark
from pyspark.sql import Row
from pyspark.sql.types import StructType, IntegerType
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
schema1 = StructType() \
.add('a', IntegerType()) \
.add('b', IntegerType())
schema2 = StructType() \
.add('b', IntegerType()) \
.add('a', IntegerType())
print(schema1 == schema2)
r1 = Row(a=1, b=2)
r2 = Row(b=2, a=1)
print(r1 == r2)
data = [r1, r2]
df1 = sqlc.createDataFrame(data, schema1)
df1.show()
df2 = sqlc.createDataFrame(data, schema2)
df2.show()
```
intuitively, I thought df1 and df2 should contain the same data, however,
the output is
```
False
True
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 2|
+---+---+
+---+---+
| b| a|
+---+---+
| 1| 2|
| 1| 2|
+---+---+
```
after trace the source code, I found
1. schema (StructType) use list store fields, so it is order-sensitive
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L459
2. Row will sort according to field name when new
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1204
3. It seems (not 100% sure) that when createDataFrame, it access the filed
of Row by order not by filed name
https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L422
This behavior is a little bit tricky. Maybe this behavior could be mention
at document?
Thanks.
--
-- 張雅軒