Hi Folks,
I am Hari, a developer with a company called ThoughtWorks. We've been 
developing data pipelines using on Hadoop,Spark etc for a while now. From our 
experiences with different customers we've noticed a recurring need to carry 
out tasks such as data preparation, data anonymization etc on large datasets 
using Java MR and Spark.Based on this experience, we have been working on 
building a couple of libraries targeted at data preparation and data protection 
to begin with. Its hosted under an umbrella project called Data Commons at the 
moment (inspired by the Apache Commons project which is organized around a 
similar theme).
At the moment this is a fledgling project and its contributions are driven by 
our data team. However we are very keen on making this part of the larger 
Apache collective and make it a community driven effort. 
Hence, I am reaching out to you folks for advise on what could be the best way 
forward for this effort. We are also open to explore collaborations with other 
existing projects that are already part of Apache. Please share your thoughts, 
advise.
-- Hari

Reply via email to