Is there some way around this? For example, can Row just be an implementation of namedtuple throughout?
from collections import namedtuple class Row(namedtuple): ... >From a user perspective, it’s confusing that there are 2 different implementations of the Row class with the same name. In my case, I was writing a method to recursively convert a Row to a dict (since a Row can contain other Rows). I couldn’t directly check type(obj) == pyspark.sql.types.Row so I ended up having to do it like this: def row_to_dict(obj): """ Take a PySpark Row and convert it, and any of its nested Row objects, into Python dictionaries. """ if isinstance(obj, list): return [row_to_dict(x) for x in obj] else: try: # We can't reliably check that this is a row object # due to some weird bug. d = obj.asDict() return {k: row_to_dict(v) for k, v in d.iteritems()} except: return obj That comment about a “weird bug” was my initial reaction, though now I understand that we have 2 implementations of Row. Isn’t this worth fixing? It’s just going to confuse people, IMO. Nick On Tue, May 12, 2015 at 10:22 PM Davies Liu <dav...@databricks.com> wrote: The class (called Row) for rows from Spark SQL is created on the fly, is > different from pyspark.sql.Row (is an public API to create Row by users). > > The reason we done it in this way is that we want to have better > performance when accessing the columns. Basically, the rows are just named > tuples (called `Row`). > > -- > Davies Liu > Sent with Sparrow <http://www.sparrowmailapp.com/?sig> > > 已使用 Sparrow <http://www.sparrowmailapp.com/?sig> > > 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道: > > This is really strange. > > # Spark 1.3.1 > print type(results) > > <class 'pyspark.sql.dataframe.DataFrame'> > > a = results.take(1)[0] > > > print type(a) > > <class 'pyspark.sql.types.Row'> > > print pyspark.sql.types.Row > > <class 'pyspark.sql.types.Row'> > > print type(a) == pyspark.sql.types.Row > > False > > print isinstance(a, pyspark.sql.types.Row) > > False > > If I set a as follows, then the type checks pass fine. > > a = pyspark.sql.types.Row('name')('Nick') > > Is this a bug? What can I do to narrow down the source? > > results is a massive DataFrame of spark-perf results. > > Nick > > > >