Kenta Murata <m...@mrkn.jp> writes: > Hi Jed, > > I'd like to describe the current status of the implementation of SparseTensor. > I hope the following explanation will help you. > > First of all, I designed the current SparseTensor format as the first > interim implementation. > At this time I used scipy.sparse as a reference.
Excellent, thanks for your reply. > The reason why I started with the two formats, COO and CSR, is that I > thought they were commonly used and appropriate as the first > implementation of SparseTensor. > > For 1: > The current SparseTensor uses int64_t for its indices value type. > If the "long" you mentioned is one in SparseTensor.fbs, that "long" is > of Flatbuffers' type, which is 64-bit signed integer. My bad. I thought it would correspond to C "long". That's great. > I hope we can improve the current SparseTensor so that users can > specify whether to use int32_t or int64_t as the index data type. > Then we can use int64_t to handle larger matrices, as well as to > improve performance with 32-bit indices. > Moreover, we can share not-too-large canonicalized scipy's sparse > matrices without conversion. > > For 2: > There are two reasons why the current COO format is limited to be sorted: > > a) I wanted to start with a simple implementation > b) I thought that Arrow is often used to exchange sparse matrices that > are canonicalized for computation > > I think it would be better for Arrow to support unsorted indexes, and > I think it would be even better to be able to handle sparse matrices > with duplicated values. Since scipy.sparse can handle such data, I > think that supporting non-canonicalized formats are necessary for > adding scipy integration. That's fair. Sorting is most relevant for matrix-matrix operations and/or parallelism. It doesn't affect the code for sequential matrix-vector products, though it does affect cache reuse for the vector. But someone who wants high performance for such operations would rarely use COO. > I forgot to denote the CSR index suppose to be sorted. I'll issue PR > for fixing it. Great, thanks! > If SparseTensor supports unsorted indexes or duplicated values, it > would be nice to add the ability to canonicalize them as well. > I am busy with preparations for RubyKaigi in mid-April now, but after > RubyKaigi, I will return to the improvement of sparse matrix > implementation again. > Before that, I would like to support as much as I can if someone is > working on it. > > For 3: > Since I have only used sparse matrices to create and process bug of > words matrices in natural language processing, I have no experience of > partitioning sparse matrices for distributed computing. > I think it's great that Arrow can handle such a use case, so I would > like to cooperate if there is something I can do for the realization. I'd be happy to discuss and plan whenever you have time/interest. I haven't seen a clear statement of Arrow's intentions in this space, particularly in terms of what should be defined in the Arrow format versus what metadata should be exchanged independently from Arrow. Current parallel graph partitioning libraries (like ParMETIS and SCOTCH) use a different approach than solver libraries (like PETSc, Hypre, and others). I don't have an immediate need, but it would significantly lower the bar for creative new packages if they could say "we support parallel Arrow matrices" and immediately be callable by the bigger packages that have many users. That said, the diversity of auxiliary information in this space is unbounded because we're trying to produce robust O(N) methods for problems that are O(N^2) by black-box methods.