Welcome to the community Bob. On Tue, Sep 22, 2020 at 12:27 PM Bob Tinsman <bobt...@pacbell.net> wrote:
> I'd like to introduce myself, because I've had an interest in Arrow for a > long time and now I have a chance to help out.Up until now, I haven't > really contributed much in open source, although I've been an avid > consumer, so I'd like to change that! > My main areas of work have been performance optimization, Java, databases > (mostly relational), and optimizing/refactoring architecture, but I also > have some C/C++ background, and I'm a quick learner of new languages. > > The reason that I'm so interested in Arrow is that I've already created > two in-memory columnar dataset implementations for two different companies, > so I'm a believer in the power of this model, although I came to it from a > different perspective.I was just watching this discussion with Wes and > Jacques: Starting Apache Arrow > > | > | > | > | | | > > | > > | > | > | | > Starting Apache Arrow > > Our CTO Jacques Nadeau sat down for a fireside chat with Wes Mckinnney, > discussing the past, present, and future... > | > > | > > | > > > Wes lays out two phases of Arrow:- Phase one: Arrow used as a common > format- Phase two: Arrow used for actual calculationBecause I was working > on my own, I skipped to phase two. > I worked for an online marketing survey company called MarketTools in the > early 00's. Survey results were stored in SQL Server, and we had to > implement crosstabs on the data; for example, if you wanted to see answers > to survey answers broken down by age, gender, income range, etc. > The original implementation would generate some pretty hairy SQL, which > got pretty slow if there were a lot of questions on the crosstab.I thought > "why are we asking the DB to run multiple queries on the same data when we > could pull it into memory once, then do aggregate calculations there?"That > managed to produce a 5x speedup in running the crosstabs.In my most recent > company, I created a new in-memory dataset implementation as the basis for > an interactive data analysis tool. Again I was working with mostly > relational databases. I was able to push the scalability of the in-memory > columns a lot more using dictionaries. I also developed a hybrid engine > combining SQL generation and in-memory calculation, sort of like what Spark > is doing.If I knew about Arrow, I would have definitely used it, but it > wasn't around yet. You guys have accomplished a lot--congrats on your 1.0.0 > release, by the way!I'm starting out by grokking all the source and doc, > and looking at JIRA issues that I could potentially work on, but I'm > looking forward to helping out however I can. >