add it with a note in the docstring
>>
>> I am not sure about this because force evaluation could be something that
>> has side effect. For example, df.count() can realize a cache and if we
>> implement __len__ to call df.count() then len(df) would end up populating
>
That all sounds reasonable but I think in the case of 4 and maybe also 3 I
would rather see it implemented to raise an error message that explains
what’s going on and suggests the explicit operation that would do the most
equivalent thing. And perhaps raise a warning (using the warnings module)
for
I agree with Reynold, at some point you’re going to run into the parts of
the pandas API that aren’t distributable. More feature parity will be good,
but users are still eventually going to hit a feature cliff. Moreover, it’s
not just the pandas API that people want to use, but also the set of
libr
Hey there,
Here’s something I proposed recently that’s in this space.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-24258
It’s motivated by working with a user who wanted to do some custom
statistics for which they could write the numpy code, and knew in what
dimensions they c
I’m with you on json being more readable than parquet, but we’ve had
success using pyarrow’s parquet reader and have been quite happy with it so
far. If your target is python (and probably if not now, then soon, R), you
should look in to it.
On Mon, May 21, 2018 at 16:52 Joseph Bradley wrote:
>
I filed an SPIP for this at
https://issues.apache.org/jira/browse/SPARK-24258. Let’s discuss!
On Wed, Apr 18, 2018 at 23:33 Leif Walsh wrote:
> I agree we should reuse as much as possible. For PySpark, I think the
> obvious choices of Breeze and numpy arrays already made make a lot of
&
ate effort from expanding
> linear algebra primitives.
> * It would be valuable to discuss external types as UDTs (which can be
> hacked with numpy and scipy types now) vs. adding linear algebra types to
> native Spark SQL.
>
>
> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh wrote:
&
Hi all,
I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
I’ve found myself wanting more.
There is a minor issue in that with the arrow serialization enabled, these
types don’t serialize properly in python UDF calls or in toPandas. There’s
a natural representation for the