It really depends on your needs and your data.

Do you want to store 1 TB, 1 PB or far more? Do you want to just read that 
data, retrieve it then do little work on it and then read it, have a complex 
machine learning pipeline? Depending on the workload, the ratio between cores 
and storage will vary.


First start with a subset of your data and do some tests on your own computer 
or (that’s better) with a little cluster of 3 nodes. This will help you to find 
your ratio between storage/cores and the needs of memory that you might expect 
if you are not using just a subset of your data but the whole bunch available 
that you (can) have.


Then using this information and indications on Spark website 
(http://spark.apache.org/docs/latest/hardware-provisioning.html), you will be 
able to specify the hardware of one node, and how many nodes you need (at least 
3).


Yohann Jardin

Le 4/30/2017 à 10:26 AM, rakesh sharma a écrit :

Hi

I would like to know the details of implementing a cluster.

What kind of machines one would require, how many nodes, number of cores etc.


thanks

rakesh

Reply via email to