I was hoping for someone to answer this question, As it  resonates with many 
developers who are new to Spark and trying to adopt it at their work.
Regards
Pradeep

On Dec 3, 2016, at 9:00 AM, Vasu Gourabathina 
<vgour...@gmail.com<mailto:vgour...@gmail.com>> wrote:

Hi,

I know this is a broad question. If this is not the right forum, appreciate if 
you can point to other sites/areas that may be helpful.

Before posing this question, I did use our friend Google, but sanitizing the 
query results from my need angle hasn't been easy.

Who I am:
   - Have done data processing and analytics, but relatively new to Spark world

What I am looking for:
  - Architecture/Design of a ML system using Spark
  - In particular, looking for best practices that can support/bridge both 
Engineering and Data Science teams

Engineering:
   - Build a system that has typical engineering needs, data processing, 
scalability, reliability, availability, fault-tolerance etc.
   - System monitoring etc.
Data Science:
   - Build a system for Data Science team to do data exploration activities
   - Develop models using supervised learning and tweak models

Data:
  - Batch and incremental updates - mostly structured or semi-structured (some 
data from transaction systems, weblogs, click stream etc.)
  - Steaming, in near term, but not to begin with

Data Storage:
  - Data is expected to grow on a daily basis...so, system should be able to 
support and handle big data
  - May be, after further analysis, there might be a possibility/need to 
archive some of the data...it all depends on how the ML models were built and 
results were stored/used for future usage

Data Analysis:
  - Obvious data related aspects, such as data cleansing, data transformation, 
data partitioning etc
  - May be run models on windows of data. For example: last 1-year, 2-years etc.

ML models:
  - Ability to store model versions and previous results
  - Compare results of different variants of models

Consumers:
  - RESTful webservice clients to look at the results

So, the questions I have are:
1) Are there architectural and design patterns that I can use based on industry 
best-practices. In particular:
      - data ingestion
      - data storage (for eg. go with HDFS or not)
      - data partitioning, especially in Spark world
      - running parallel ML models and combining results etc.
      - consumption of final results by clients (for eg. by pushing results to 
Cassandra, NoSQL dbs etc.)

Again, I know this is a broad question....Pointers to some best-practices in 
some of the areas, if not all, would be highly appreciated. Open to purchase 
any books that may have relevant information.

Thanks much folks,
Vasu.



This message and any attachments may contain confidential information of View, 
Inc. If you are not the intended recipient you are hereby notified that any 
dissemination, copying or distribution of this message, or files associated 
with this message, is strictly prohibited. If you have received this message in 
error, please notify us immediately by replying to the message and delete the 
message from your computer.

Reply via email to