I filed an SPIP for this at https://issues.apache.org/jira/browse/SPARK-24258. Let’s discuss!
On Wed, Apr 18, 2018 at 23:33 Leif Walsh <leif.wa...@gmail.com> wrote: > I agree we should reuse as much as possible. For PySpark, I think the > obvious choices of Breeze and numpy arrays already made make a lot of > sense, I’m not sure about the other language bindings and would defer to > others. > > I was under the impression that UDTs were gone and (probably?) not coming > back. Did I miss something and they’re actually going to be better > supported in the future? I think your second point (about separating > expanding the primitives from expanding SQL support) is only really true if > we’re getting UDTs back. > > You’ve obviously seen more of the history here than me. Do you have a > sense of why the efforts you mentioned never went anywhere? I don’t think > this is strictly about “mllib local”, it’s more about generic linalg, so > 19653 feels like the closest to what I’m after, but it looks to me like > that one just fizzled out, rather than a real back and forth. > > Does this just need something like a persistent product manager to scope > out the effort, champion it, and push it forward? > On Wed, Apr 18, 2018 at 20:02 Joseph Bradley <jos...@databricks.com> > wrote: > >> Thanks for the thoughts! We've gone back and forth quite a bit about >> local linear algebra support in Spark. For reference, there have been some >> discussions here: >> https://issues.apache.org/jira/browse/SPARK-6442 >> https://issues.apache.org/jira/browse/SPARK-16365 >> https://issues.apache.org/jira/browse/SPARK-19653 >> >> Overall, I like the idea of improving linear algebra support, especially >> given the rise of Python numerical processing & deep learning. But some >> considerations I'd list include: >> * There are great linear algebra libraries out there, and it would be >> ideal to reuse those as much as possible. >> * SQL support for linear algebra can be a separate effort from expanding >> linear algebra primitives. >> * It would be valuable to discuss external types as UDTs (which can be >> hacked with numpy and scipy types now) vs. adding linear algebra types to >> native Spark SQL. >> >> >> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <leif.wa...@gmail.com> wrote: >> >>> Hi all, >>> >>> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and >>> I’ve found myself wanting more. >>> >>> There is a minor issue in that with the arrow serialization enabled, >>> these types don’t serialize properly in python UDF calls or in toPandas. >>> There’s a natural representation for them in numpy.ndarray, and I’ve >>> started a conversation with the arrow community about supporting >>> tensor-valued columns, but that might be a ways out. In the meantime, I >>> think we can fix this by using the FixedSizeBinary column type in arrow, >>> together with some metadata describing the tensor shape (list of dimension >>> sizes). >>> >>> The larger issue, for which I intend to submit an SPIP soon, is that >>> these types could be better supported at the API layer, regardless of >>> serialization. In the limit, we could consider the entire numpy ndarray >>> surface area as a target. At the minimum, what I’m thinking is that these >>> types should support column operations like matrix multiply, transpose, >>> inner and outer product, etc., and maybe have a more ergonomic construction >>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the >>> VectorAssembler API is kind of clunky. >>> >>> One possibility here is to restrict the tensor column types such that >>> every value must have the same shape, e.g. a 2x2 matrix. This would allow >>> for operations to check validity before execution, for example, a matrix >>> multiply could check dimension match and fail fast. However, there might be >>> use cases for a column to contain variable shape tensors, I’m open to >>> discussion here. >>> >>> What do you all think? >>> -- >>> -- >>> Cheers, >>> Leif >>> >> >> >> >> -- >> >> Joseph Bradley >> >> Software Engineer - Machine Learning >> >> Databricks, Inc. >> >> [image: http://databricks.com] <http://databricks.com/> >> > -- > -- > Cheers, > Leif > -- -- Cheers, Leif