Hi Saurabh
You can have hadoop cluster running YARN as scheduler.
Configure spark to run with the same YARN setup.
Then you need R only on 1 node , and connect to the cluster using the
SparkR.

Thanks
Deepak

On Mon, May 30, 2016 at 12:12 PM, Jörn Franke <[email protected]> wrote:

>
> Well if you require R then you need to install it (including all
> additional packages) on each node. I am not sure why you store the data in
> Postgres . Storing it in Parquet and Orc is sufficient in HDFS (sorted on
> relevant columns) and you use the SparkR libraries to access them.
>
> On 30 May 2016, at 08:38, Kumar, Saurabh 5. (Nokia - IN/Bangalore) <
> [email protected]> wrote:
>
> Hi Team,
>
> I am using Apache spark to build scalable Analytic engine. My setup is as
> follows.
>
> Flow of processing is as follows:
>
> Raw Files > Store to HDFS > Process by Spark and Store to Postgre_XL
> Database > R process data fom Postgre-XL to process in distributed mode.
>
> I have 6 nodes cluster setup for ETL operations which have
>
>
>    1. Spark slaves installed on all 6 of them.
>    2. HDFS data nodes on each of 6 nodes with replication factor 2.
>    3. PosGRE –XL 9.5 Database coordinator on each of 6 nodes.
>    4. R software is installed on all nodes and Uses process Data from
>    Postgre-XL in distributed manner.
>
>
>
>
>
> Can you please guide me about pros and cons of this setup.
> Installing all component on every machines is recommended or there is any
> drawback?
> R software should run on spark cluster ?
>
>
>
> Thanks & Regards
> Saurabh Kumar
> R&D Engineer, T&I TED Technology Explorat&Disruption
> Nokia Networks
> L5, Manyata Embassy Business Park, Nagavara, Bangalore, India 560045
> Mobile: +91-8861012418
> http://networks.nokia.com/
>
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Reply via email to