I'd like to introduce myself, because I've had an interest in Arrow for a long 
time and now I have a chance to help out.Up until now, I haven't really 
contributed much in open source, although I've been an avid consumer, so I'd 
like to change that!
My main areas of work have been performance optimization, Java, databases 
(mostly relational), and optimizing/refactoring architecture, but I also have 
some C/C++ background, and I'm a quick learner of new languages.

The reason that I'm so interested in Arrow is that I've already created two 
in-memory columnar dataset implementations for two different companies, so I'm 
a believer in the power of this model, although I came to it from a different 
perspective.I was just watching this discussion with Wes and Jacques: Starting 
Apache Arrow

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
Starting Apache Arrow

Our CTO Jacques Nadeau sat down for a fireside chat with Wes Mckinnney, 
discussing the past, present, and future...
 |

 |

 |


Wes lays out two phases of Arrow:- Phase one: Arrow used as a common format- 
Phase two: Arrow used for actual calculationBecause I was working on my own, I 
skipped to phase two.
I worked for an online marketing survey company called MarketTools in the early 
00's. Survey results were stored in SQL Server, and we had to implement 
crosstabs on the data; for example, if you wanted to see answers to survey 
answers broken down by age, gender, income range, etc.
The original implementation would generate some pretty hairy SQL, which got 
pretty slow if there were a lot of questions on the crosstab.I thought "why are 
we asking the DB to run multiple queries on the same data when we could pull it 
into memory once, then do aggregate calculations there?"That managed to produce 
a 5x speedup in running the crosstabs.In my most recent company, I created a 
new in-memory dataset implementation as the basis for an interactive data 
analysis tool. Again I was working with mostly relational databases. I was able 
to push the scalability of the in-memory columns a lot more using dictionaries. 
I also developed a hybrid engine combining SQL generation and in-memory 
calculation, sort of like what Spark is doing.If I knew about Arrow, I would 
have definitely used it, but it wasn't around yet. You guys have accomplished a 
lot--congrats on your 1.0.0 release, by the way!I'm starting out by grokking 
all the source and doc, and looking at JIRA issues that I could potentially 
work on, but I'm looking forward to helping out however I can.

Reply via email to