I think the Relational Database will be faster for ordinal data (eg where you 
answer from 1..x). For free text fields I would recommend solr or elastic 
search, because they have a lot more text analytics capabilities that do not 
exist in a relational database or MongoDB and are not likely to be there in the 
near future.

> On 06 Mar 2016, at 18:25, Guillaume Bilodeau <guillaume.bilod...@gmail.com> 
> wrote:
> 
> The data is currently stored in a relational database, but a migration to a 
> document-oriented database such as MongoDb is something we are definitely 
> considering.  How does this factor in?
> 
>> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <gourav.sengu...@gmail.com> 
>> wrote:
>> Hi,
>> 
>> That depends on a lot of things, but as a starting point I would ask whether 
>> you are planning to store your data in JSON format?
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi 
>>> <guillaume.bilod...@gmail.com> wrote:
>>> Our problem space is survey analytics.  Each survey comprises a set of
>>> questions, with each question having a set of possible answers.  Survey
>>> fill-out tasks are sent to users, who have until a certain date to complete
>>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>>> report deals with a subset of the survey fill-outs, and comprises a set of
>>> data points (average rating for question 1, min/max for question 2, etc.)
>>> 
>>> We are dealing with rather large data sets - although reading the internet
>>> we get the impression that everyone is analyzing petabytes of data...
>>> 
>>> Users: up to 100,000
>>> Surveys: up to 100,000
>>> Questions per survey: up to 100
>>> Possible answers per question: up to 10
>>> Survey fill-outs / user: up to 10
>>> Reports: up to 100,000
>>> Data points per report: up to 100
>>> 
>>> Data is currently stored in a relational database but a migration to a
>>> different kind of store is possible.
>>> 
>>> The naive algorithm for report generation can be summed up as this:
>>> 
>>> for each report to be generated {
>>>   for each report data point to be calculated {
>>>     calculate data point
>>>     add data point to report
>>>   }
>>>   publish report
>>> }
>>> 
>>> In order to deal with the upper limits of these values, we will need to
>>> distribute this algorithm to a compute / data cluster as much as possible.
>>> 
>>> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
>>> HazelCast and several others, and am still confused as to how each of these
>>> can help us and how they fit together.
>>> 
>>> Is Spark the right framework for us?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
> 

Reply via email to