Hi, I know this is a broad question. If this is not the right forum, appreciate if you can point to other sites/areas that may be helpful.
Before posing this question, I did use our friend Google, but sanitizing the query results from my need angle hasn't been easy. Who I am: - Have done data processing and analytics, but relatively new to Spark world What I am looking for: - Architecture/Design of a ML system using Spark - In particular, looking for best practices that can support/bridge both Engineering and Data Science teams Engineering: - Build a system that has typical engineering needs, data processing, scalability, reliability, availability, fault-tolerance etc. - System monitoring etc. Data Science: - Build a system for Data Science team to do data exploration activities - Develop models using supervised learning and tweak models Data: - Batch and incremental updates - mostly structured or semi-structured (some data from transaction systems, weblogs, click stream etc.) - Steaming, in near term, but not to begin with Data Storage: - Data is expected to grow on a daily basis...so, system should be able to support and handle big data - May be, after further analysis, there might be a possibility/need to archive some of the data...it all depends on how the ML models were built and results were stored/used for future usage Data Analysis: - Obvious data related aspects, such as data cleansing, data transformation, data partitioning etc - May be run models on windows of data. For example: last 1-year, 2-years etc. ML models: - Ability to store model versions and previous results - Compare results of different variants of models Consumers: - RESTful webservice clients to look at the results *So, the questions I have are:* 1) Are there architectural and design patterns that I can use based on industry best-practices. In particular: - data ingestion - data storage (for eg. go with HDFS or not) - data partitioning, especially in Spark world - running parallel ML models and combining results etc. - consumption of final results by clients (for eg. by pushing results to Cassandra, NoSQL dbs etc.) Again, I know this is a broad question....Pointers to some best-practices in some of the areas, if not all, would be highly appreciated. Open to purchase any books that may have relevant information. Thanks much folks, Vasu.