It really depends on your needs and your data.
Do you want to store 1 TB, 1 PB or far more? Do you want to just read that data, retrieve it then do little work on it and then read it, have a complex machine learning pipeline? Depending on the workload, the ratio between cores and storage will vary. First start with a subset of your data and do some tests on your own computer or (that’s better) with a little cluster of 3 nodes. This will help you to find your ratio between storage/cores and the needs of memory that you might expect if you are not using just a subset of your data but the whole bunch available that you (can) have. Then using this information and indications on Spark website (http://spark.apache.org/docs/latest/hardware-provisioning.html), you will be able to specify the hardware of one node, and how many nodes you need (at least 3). Yohann Jardin Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit : Hi I would like to know the details of implementing a cluster. What kind of machines one would require, how many nodes, number of cores etc. thanks rakesh