I think the Relational Database will be faster for ordinal data (eg where you answer from 1..x). For free text fields I would recommend solr or elastic search, because they have a lot more text analytics capabilities that do not exist in a relational database or MongoDB and are not likely to be there in the near future.
> On 06 Mar 2016, at 18:25, Guillaume Bilodeau <guillaume.bilod...@gmail.com> > wrote: > > The data is currently stored in a relational database, but a migration to a > document-oriented database such as MongoDb is something we are definitely > considering. How does this factor in? > >> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <gourav.sengu...@gmail.com> >> wrote: >> Hi, >> >> That depends on a lot of things, but as a starting point I would ask whether >> you are planning to store your data in JSON format? >> >> >> Regards, >> Gourav Sengupta >> >>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi >>> <guillaume.bilod...@gmail.com> wrote: >>> Our problem space is survey analytics. Each survey comprises a set of >>> questions, with each question having a set of possible answers. Survey >>> fill-out tasks are sent to users, who have until a certain date to complete >>> it. Based on these survey fill-outs, reports need to be generated. Each >>> report deals with a subset of the survey fill-outs, and comprises a set of >>> data points (average rating for question 1, min/max for question 2, etc.) >>> >>> We are dealing with rather large data sets - although reading the internet >>> we get the impression that everyone is analyzing petabytes of data... >>> >>> Users: up to 100,000 >>> Surveys: up to 100,000 >>> Questions per survey: up to 100 >>> Possible answers per question: up to 10 >>> Survey fill-outs / user: up to 10 >>> Reports: up to 100,000 >>> Data points per report: up to 100 >>> >>> Data is currently stored in a relational database but a migration to a >>> different kind of store is possible. >>> >>> The naive algorithm for report generation can be summed up as this: >>> >>> for each report to be generated { >>> for each report data point to be calculated { >>> calculate data point >>> add data point to report >>> } >>> publish report >>> } >>> >>> In order to deal with the upper limits of these values, we will need to >>> distribute this algorithm to a compute / data cluster as much as possible. >>> >>> I've read about frameworks such as Apache Spark but also Hadoop, GridGain, >>> HazelCast and several others, and am still confused as to how each of these >>> can help us and how they fit together. >>> >>> Is Spark the right framework for us? >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >