Hi Saurabh You can have hadoop cluster running YARN as scheduler. Configure spark to run with the same YARN setup. Then you need R only on 1 node , and connect to the cluster using the SparkR.
Thanks Deepak On Mon, May 30, 2016 at 12:12 PM, Jörn Franke <[email protected]> wrote: > > Well if you require R then you need to install it (including all > additional packages) on each node. I am not sure why you store the data in > Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on > relevant columns) and you use the SparkR libraries to access them. > > On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) < > [email protected]> wrote: > > Hi Team, > > I am using Apache spark to build scalable Analytic engine. My setup is as > follows. > > Flow of processing is as follows: > > Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL > Database > R process data fom Postgre-XL to process in distributed mode. > > I have 6 nodes cluster setup for ETL operations which have > > > 1. Spark slaves installed on all 6 of them. > 2. HDFS data nodes on each of 6 nodes with replication factor 2. > 3. PosGRE –XL 9.5 Database coordinator on each of 6 nodes. > 4. R software is installed on all nodes and Uses process Data from > Postgre-XL in distributed manner. > > > > > > Can you please guide me about pros and cons of this setup. > Installing all component on every machines is recommended or there is any > drawback? > R software should run on spark cluster ? > > > > Thanks & Regards > Saurabh Kumar > R&D Engineer, T&I TED Technology Explorat&Disruption > Nokia Networks > L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045 > Mobile: +91-8861012418 > http://networks.nokia.com/ > > > > > -- Thanks Deepak www.bigdatabig.com www.keosha.net
