Hi Jed, I'd like to describe the current status of the implementation of SparseTensor. I hope the following explanation will help you.
First of all, I designed the current SparseTensor format as the first interim implementation. At this time I used scipy.sparse as a reference. The reason why I started with the two formats, COO and CSR, is that I thought they were commonly used and appropriate as the first implementation of SparseTensor. For 1: The current SparseTensor uses int64_t for its indices value type. If the "long" you mentioned is one in SparseTensor.fbs, that "long" is of Flatbuffers' type, which is 64-bit signed integer. I hope we can improve the current SparseTensor so that users can specify whether to use int32_t or int64_t as the index data type. Then we can use int64_t to handle larger matrices, as well as to improve performance with 32-bit indices. Moreover, we can share not-too-large canonicalized scipy's sparse matrices without conversion. For 2: There are two reasons why the current COO format is limited to be sorted: a) I wanted to start with a simple implementation b) I thought that Arrow is often used to exchange sparse matrices that are canonicalized for computation I think it would be better for Arrow to support unsorted indexes, and I think it would be even better to be able to handle sparse matrices with duplicated values. Since scipy.sparse can handle such data, I think that supporting non-canonicalized formats are necessary for adding scipy integration. I forgot to denote the CSR index suppose to be sorted. I'll issue PR for fixing it. If SparseTensor supports unsorted indexes or duplicated values, it would be nice to add the ability to canonicalize them as well. I am busy with preparations for RubyKaigi in mid-April now, but after RubyKaigi, I will return to the improvement of sparse matrix implementation again. Before that, I would like to support as much as I can if someone is working on it. For 3: Since I have only used sparse matrices to create and process bug of words matrices in natural language processing, I have no experience of partitioning sparse matrices for distributed computing. I think it's great that Arrow can handle such a use case, so I would like to cooperate if there is something I can do for the realization. Thanks, Kenta Murata 2019年3月12日(火) 2:17 Jed Brown <j...@jedbrown.org>: > > Thanks. I'm new to the Arrow community so was hoping to get feedback if > any of these are controversial or subject to constraints that I'm likely > not familiar with. Point 2 is likely simplest and I can start with > that. > > Point 3 isn't coherent as a PR concept, but is a potential audience > whose relation to Arrow I don't understand (could be an explicit > non-goal for all I know). > > Wes McKinney <wesmck...@gmail.com> writes: > > > hi Jed, > > > > Would you like to submit a pull request to propose the changes or > > additions you are escribing? > > > > Thanks > > Wes > > > > On Sat, Mar 9, 2019 at 11:32 PM Jed Brown <j...@jedbrown.org> wrote: > >> > >> Wes asked me to bring this discussion here. I'm a developer of PETSc > >> and, with Arrow is getting into the sparse representation space, would > >> like for it to interoperate as well as possible. > >> > >> 1. Please guarantee support for 64-bit offsets and indices. The current > >> spec uses "long", which is 32-bit on some common platforms (notably > >> Windows). Specifying int64_t would bring that size support to LLP64 > >> architectures. > >> > >> Meanwhile, there is a significant performance benefit to using 32-bit > >> indices when sizes allow it. If using 4-byte floats, a CSR format costs > >> 12 bytes per nonzero with int64, versus 8 bytes per nonzero with int32. > >> Sparse matrix operations are typically dominated by bandwidth, so this > >> is a substantial performance impact. > >> > >> 2. This relates to sorting of indices for CSR. Many algorithms need to > >> operate on upper vs lower-triangular parts of matrices, which is much > >> easier if the CSR column indices are sorted within rows. Typically, one > >> finds the diagonal (storing its location in each row, if it exists). > >> Given that the current spec says COO entries are sorted, it would be > >> simple to specify this also for CSR. > >> > >> https://github.com/apache/arrow/blob/master/format/SparseTensor.fbs > >> > >> > >> 3. This is more nebulous and relates to partitioned representations of > >> matrices, such as arise when using distributed memory or when optimizing > >> for locality when using threads. Some packages store "global" indices > >> in the CSR representation (in which case you can have column indices > >> that are larger than the specified sizes) -- I discourage this except as > >> intermediate representations in matrix-matrix computations. Others > >> (including PETSc) store local matrices plus a "local-to-global" > >> translation of local indices to the distributed setting. This may be a > >> no-op for Arrow (no support, but don't prevent), but I'm curious whether > >> Arrow has a philosophy regarding collective use of distributed memory > >> (e.g., via MPI). > >> > >> > >> Finally, there are many other matrix formats to at least be aware of. > >> These include blocked CSR (fewer indices to store), sliced Ellpack-style > >> formats (better vectorization for some sparse matrix operations), and > >> compressed row offsets for CSR (good if many rows are empty). > >> Conventions regarding diagonals and how to express parallel matrices > >> vary by package, but often requires some conversion (can sometimes be > >> done in-place) to use different packages. PETSc has interfaces to many > >> such packages, so we have many examples of such conversions. > >> > >> Thanks for reading. I'm happy to discuss further. -- Kenta Murata OpenPGP FP = 1D69 ADDE 081C 9CC2 2E54 98C1 CEFE 8AFB 6081 B062 本を書きました!! 『Ruby 逆引きレシピ』 http://www.amazon.co.jp/dp/4798119881/mrkn-22 E-mail: m...@mrkn.jp twitter: http://twitter.com/mrkn/ blog: http://d.hatena.ne.jp/mrkn/