I'd like to introduce myself, because I've had an interest in Arrow for a long time and now I have a chance to help out.Up until now, I haven't really contributed much in open source, although I've been an avid consumer, so I'd like to change that! My main areas of work have been performance optimization, Java, databases (mostly relational), and optimizing/refactoring architecture, but I also have some C/C++ background, and I'm a quick learner of new languages.
The reason that I'm so interested in Arrow is that I've already created two in-memory columnar dataset implementations for two different companies, so I'm a believer in the power of this model, although I came to it from a different perspective.I was just watching this discussion with Wes and Jacques: Starting Apache Arrow | | | | | | | | | | | Starting Apache Arrow Our CTO Jacques Nadeau sat down for a fireside chat with Wes Mckinnney, discussing the past, present, and future... | | | Wes lays out two phases of Arrow:- Phase one: Arrow used as a common format- Phase two: Arrow used for actual calculationBecause I was working on my own, I skipped to phase two. I worked for an online marketing survey company called MarketTools in the early 00's. Survey results were stored in SQL Server, and we had to implement crosstabs on the data; for example, if you wanted to see answers to survey answers broken down by age, gender, income range, etc. The original implementation would generate some pretty hairy SQL, which got pretty slow if there were a lot of questions on the crosstab.I thought "why are we asking the DB to run multiple queries on the same data when we could pull it into memory once, then do aggregate calculations there?"That managed to produce a 5x speedup in running the crosstabs.In my most recent company, I created a new in-memory dataset implementation as the basis for an interactive data analysis tool. Again I was working with mostly relational databases. I was able to push the scalability of the in-memory columns a lot more using dictionaries. I also developed a hybrid engine combining SQL generation and in-memory calculation, sort of like what Spark is doing.If I knew about Arrow, I would have definitely used it, but it wasn't around yet. You guys have accomplished a lot--congrats on your 1.0.0 release, by the way!I'm starting out by grokking all the source and doc, and looking at JIRA issues that I could potentially work on, but I'm looking forward to helping out however I can.