Hi Folks,
I am Hari, a developer with a company called ThoughtWorks. We've been
developing data pipelines using on Hadoop,Spark etc for a while now. From our
experiences with different customers we've noticed a recurring need to carry
out tasks such as data preparation, data anonymization etc on large datasets
using Java MR and Spark.Based on this experience, we have been working on
building a couple of libraries targeted at data preparation and data protection
to begin with. Its hosted under an umbrella project called Data Commons at the
moment (inspired by the Apache Commons project which is organized around a
similar theme).
At the moment this is a fledgling project and its contributions are driven by
our data team. However we are very keen on making this part of the larger
Apache collective and make it a community driven effort.
Hence, I am reaching out to you folks for advise on what could be the best way
forward for this effort. We are also open to explore collaborations with other
existing projects that are already part of Apache. Please share your thoughts,
advise.
-- Hari