python version: 2.7.9 os: ubuntu 14.04 spark: 1.5.2 ``` import pyspark from pyspark.sql import Row from pyspark.sql.types import StructType, IntegerType sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) schema1 = StructType() \ .add('a', IntegerType()) \ .add('b', IntegerType())
schema2 = StructType() \ .add('b', IntegerType()) \ .add('a', IntegerType()) print(schema1 == schema2) r1 = Row(a=1, b=2) r2 = Row(b=2, a=1) print(r1 == r2) data = [r1, r2] df1 = sqlc.createDataFrame(data, schema1) df1.show() df2 = sqlc.createDataFrame(data, schema2) df2.show() ``` intuitively, I thought df1 and df2 should contain the same data, however, the output is ``` False True +---+---+ | a| b| +---+---+ | 1| 2| | 1| 2| +---+---+ +---+---+ | b| a| +---+---+ | 1| 2| | 1| 2| +---+---+ ``` after trace the source code, I found 1. schema (StructType) use list store fields, so it is order-sensitive https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L459 2. Row will sort according to field name when new https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1204 3. It seems (not 100% sure) that when createDataFrame, it access the filed of Row by order not by filed name https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L422 This behavior is a little bit tricky. Maybe this behavior could be mention at document? Thanks. -- -- 張雅軒