Thanks for reply and your valuable suggestions I have 10 GB data generated every day so this data I need to write in my database also this data is schema base and schema changes frequently , so consider this as unstructured data sometimes I may have to serve 10000 write/secs with 4 m1.xLarge machine so using spark SQL with hive thrift server will be good enough? As per my understanding spark Sql works on schemaRDD will there not be any problem when schema changes?
Also I have complex queries for real time analytics something like AND queries involved multiple field queries like "list all user who bought flats in mumbai in last 30 minutes" if I use Hbase/Cassandra i need to set up the NOSQL cluster so now two cluster one for spark and another one for NOSQl,so its not better to start with HDP? On 23 July 2015 at 11:33, fightf...@163.com <fightf...@163.com> wrote: > Hi, there > > Per for your analytical and real time recommendations request, I would > recommend you use spark sql and hive thriftserver > > to store and process your spark streaming data. As thriftserver would be > run as a long-term application and it would be > > quite feasible to cyclely comsume data and provide some analytical > requitements. > > On the other hand, hbase or cassandra would also be sufficient and I think > you may want to integrate spark sql with hbase / cassandra > > for your data digesting. You could deploy a CDH or HDP platform to > support your productive environment running. I suggest you > > firstly to deploy a spark standalone cluster to run some integration > tests, and also you can consider running spark on yarn for > > the later development use cases. > > Best, > Sun. > > ------------------------------ > fightf...@163.com > > > *From:* Jeetendra Gangele <gangele...@gmail.com> > *Date:* 2015-07-23 13:39 > *To:* user <user@spark.apache.org> > *Subject:* Re: Need help in setting up spark cluster > Can anybody help here? > > On 22 July 2015 at 10:38, Jeetendra Gangele <gangele...@gmail.com> wrote: > >> Hi All, >> >> I am trying to capture the user activities for real estate portal. >> >> I am using RabbitMS and Spark streaming combination where all the Events >> I am pushing to RabbitMQ and then 1 secs micro job I am consuming using >> Spark streaming. >> >> Later on I am thinking to store the consumed data for analytics or near >> real time recommendations. >> >> Where should I store this data in Spark RDD itself and using SparkSQL >> people can query this data for analytics or real time recommendations, this >> data is not huge currently its 10 GB per day. >> >> Another alternatiove will be either Hbase or Cassandra, which one will be >> better? >> >> Any suggestions? >> >> >> Also for this use cases should I use any existing big data platform like >> hortonworks or I can deploy standalone spark cluster ? >> > > > >