Hi everyone, First thanks for taking some time on your Sunday to reply. Some points in no particular order:
. The feedback from everyone tells me that I have a lot of reading to do first. Thanks for all the pointers. . The data is currently stored in a row-oriented database (SQL Server 2012 to be precise), but as I said we're open to moving data to a different kind of data store (column-oriented, document, etc.) . I don't have precise numbers for the size of the database, but I would guess the larger ones have around 100 GB of data. To us, this is huge; obviously, for companies such as Google, it's a second's worth of data. . For this particular issue, we're talking about ordinal data, not free text fields. . I agree that Spark is tooling, but I also see it as an implementation of a specific design, namely distributed computing on a distributed data store, if I understand correctly. . For sure, I would like to avoid introducing a new technology to the mix, so reusing the current infrastructure in a more optimal way would be our first choice. . Our main issue is that we'd like to be able to scale by distributing instead of adding more memory to this single database. The current computations are done using SQL queries. The data set does not fit in memory. So yes, we could distribute query construction and result aggregation, but the database would still be the bottleneck. That's why I'm wondering if we should investigate technologies such as Spark or Hadoop, but maybe I'm completely mistaken and we can leverage our current infrastructure. Thanks, GB On Mon, Mar 7, 2016 at 3:05 AM, Jörn Franke <jornfra...@gmail.com> wrote: > I think the Relational Database will be faster for ordinal data (eg where > you answer from 1..x). For free text fields I would recommend solr or > elastic search, because they have a lot more text analytics capabilities > that do not exist in a relational database or MongoDB and are not likely to > be there in the near future. > > On 06 Mar 2016, at 18:25, Guillaume Bilodeau <guillaume.bilod...@gmail.com> > wrote: > > The data is currently stored in a relational database, but a migration to > a document-oriented database such as MongoDb is something we are definitely > considering. How does this factor in? > > On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >> That depends on a lot of things, but as a starting point I would ask >> whether you are planning to store your data in JSON format? >> >> >> Regards, >> Gourav Sengupta >> >> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi < >> guillaume.bilod...@gmail.com> wrote: >> >>> Our problem space is survey analytics. Each survey comprises a set of >>> questions, with each question having a set of possible answers. Survey >>> fill-out tasks are sent to users, who have until a certain date to >>> complete >>> it. Based on these survey fill-outs, reports need to be generated. Each >>> report deals with a subset of the survey fill-outs, and comprises a set >>> of >>> data points (average rating for question 1, min/max for question 2, etc.) >>> >>> We are dealing with rather large data sets - although reading the >>> internet >>> we get the impression that everyone is analyzing petabytes of data... >>> >>> Users: up to 100,000 >>> Surveys: up to 100,000 >>> Questions per survey: up to 100 >>> Possible answers per question: up to 10 >>> Survey fill-outs / user: up to 10 >>> Reports: up to 100,000 >>> Data points per report: up to 100 >>> >>> Data is currently stored in a relational database but a migration to a >>> different kind of store is possible. >>> >>> The naive algorithm for report generation can be summed up as this: >>> >>> for each report to be generated { >>> for each report data point to be calculated { >>> calculate data point >>> add data point to report >>> } >>> publish report >>> } >>> >>> In order to deal with the upper limits of these values, we will need to >>> distribute this algorithm to a compute / data cluster as much as >>> possible. >>> >>> I've read about frameworks such as Apache Spark but also Hadoop, >>> GridGain, >>> HazelCast and several others, and am still confused as to how each of >>> these >>> can help us and how they fit together. >>> >>> Is Spark the right framework for us? >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>> <http://nabble.com>. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >