Oops, for some reason I forgot to add Sajjadur to the email list! Aditya
On Sun, Dec 8, 2019 at 3:14 PM Aditya Parameswaran <adity...@berkeley.edu> wrote: > Michael, > > Great to connect with you. Sajjadur forwarded your email to me. > > Just a bit of context about the work that you're referring to -- > > I'm Aditya Parameswaran, an assistant professor at UC Berkeley. Along > with Prof. Karrie Karahalios at the University of Illinois and many Ph.D. > student researchers, we've been working on developing a scalable > spreadsheet system, DataSpread (http://dataspread.github.io), for about > half a decade now. We've published a number of papers on this topic as well > as developed a prototype system. Sajjadur led the effort on benchmarking > that you referenced in your email and a navigation plugin (more on that in > a bit), while Mangesh (also cc-ed, when he was at the University of > Illinois) led the efforts to integrate spreadsheets and databases and to > build a more interactive (asynchronous) formula execution engine when he > was at Illinois. > > We'd be very keen to collaborate to see if some of the ideas that we've > developed and opportunities we've identified would make sense in Calc. Our > ultimate aim is to percolate some of these ideas back into popular > spreadsheet systems like Calc, so I'm excited to have this opportunity. > > Answers to some of your questions in-line below: > >> >> I was interested in a number of things: particularly whether we >> can get >> your test sheets / macros so that we can run the tests under a profiler >> & of course see what stupid things jump out that we can fix =) >> > > Yes, of course. Sajjadur, with Kelly's help, is looking into packaging > this and sending it your way. > >> >> We do have a columnular layout; checkout: >> >> >> https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111 >> >> ... assuming you have reasonably uniform, contiguous data types down a >> column your test shouldn't have concluded: >> >> "Takeaway: Spreadsheet systems do not employ a columnar >> data layout to improve computational (e.g., aggregation) performance" >> > > Looking at Figure 10 in our preprint, Calc does seem to be somewhat > faster for sequential read than random read at least on larger sizes. > > So I am not sure why we concluded outright that none of the spreadsheet > systems employ a columnar layout -- this is a good catch; we will fix. > > That said, looking at Figure 10, it is surprising that the gains for the > sequential read are not a lot more; and the gains should increase > proportionally. So something funky is going on. Worth investigating. > > > >> For incremental updates - re-computation is frequently chosen >> over more >> smarts since correctness is a far more dominating concern than >> performance generally, and there are plenty of know performance >> optimizations that can be done before we try to complicate things >> further. Also the assumption that after deleting A100: >> >> SUM(A1:A100) - A100 === SUM(A1:A99) >> >> falls foul of potential precision problems. >> > > Makes sense. > > >> >> The idea of converting to SQL queries is an interesting one but I >> find >> it very hard to believe it would provide any performance advantage at >> the same memory footprint. Furthermore - I'd be interested to know how >> you do other spreadsheet operations: row & column insertion, addressing, >> and dependency work on top of a SQL database with any efficiency. >> > > We started by having the relational database be a simple persistent > storage layer, when coupled with an index to retrieve data by position, can > allow us to scroll through large datasets of billions of rows at ease. We > developed a new positional index to handle insertions and deletions in > O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree that pushing > the computation to the relational database does have overheads; but at the > same time, it allows for scaling to arbitrarily large datasets. > > Happy to explain this on the phone. > > >> It is also worth bearing in mind that dependency management and >> tracking of what to update when something changes consumes a far larger >> proportion of a spreadsheet than is commonly expected: what to >> re-calculate having changed A1 is far from trivial for large, twisty >> real-world sheets. >> > > Ah, for this, we have a new asynchronous formula computation capability > (led by Mangesh) that looks something like this: > https://twitter.com/adityagp/status/1146105708024229888?s=20 (gif in > tweet). When a change is made, we use a compact dependency maintenance > structure to identify dependent cells that are hidden away with a progress > bar, and then compute them in the background. > >> >> Anyhow - would be happy to have a chat with your team at some >> stage if >> you're interested in helping us to improving things here: a good start >> would be just getting more representative benchmarks for the workload >> you're interested in would be really useful. >> > > Would love to chat and see if any of the work that we're doing can > translate into Calc, and how we can contribute. > > One other project that may be of interest is one where we're trying to > build a spreadsheet summarization and navigation tool, which can be > especially helpful on very large spreadsheets. > http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf > > >> >> And finally of course, you used a really old version. I'd >> recommend >> using the 6.4 snapshots, or 6.3 if you must. >> >> >> https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3 >> >> https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe >> ... >> > > Agreed. We started the benchmarking effort a couple years ago, and the old > version was the new version back then :-) > > > >> And lots of others. Indeed, we have a customer who has large >> ~100k+ row >> spreadsheets with complex sorts across many rows who claims that Calc >> sorts the data in second to minutes vs. Excels' hours - so I was >> surprised to see your sort results; would be good to inspect what you do. >> > I hope that helps, would love to get in touch with you & your team, >> there is plenty of fun stuff to do here to make things quicker & better =) >> > > Again, happy to share what we know! Let's find a time to chat. I see > that you're in Europe, so mornings for us (PT/CT) may work better? > Sajjadur is traveling, so I'm not entirely sure if he's around, but I > should be able to find time to chat early in the morning any day next week. > > Aditya > >> >
_______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/libreoffice