Re: Is Spark right for us?

Mich Talebzadeh Sun, 06 Mar 2016 11:53:06 -0800

Hi,

What is the current size of your relational database?

Are we talking about a row based RDBMS (Oracle, Sybase) or a columnar one
(Teradata/ Sybase IQ)?

I assume that you will be using SQL wherever you migrate to. The
SQL-on-Hadoop tools are divided between well thought out solutions like
Hive which actually have the use case of being your Data Warehouse
infrastructure, to others which are relational database replacements to
just query engines. Many SQL query engines whether they are Impala, Drill,
Spark SQL or Presto have varying capabilities to query data in Hive. So
here Spark is effectively a query engine. However, you still have to
migrate your data to it. You can easily use Sqoop to migrate data from your
RDBMS to Hive pretty straight forward (it will do table creation and
population in Hive via JDBC). You mentioned HazelCast but that is just Data
Grid much like Oracle Coherence Cache. You can of course push your data
from your RDBMS to JMS or something similar in XML format using triggers or
replication server (GoldenGate/SAP Replication server)  and eventually you
will want to store that data somewhere in Big Data once in Data Grid. I
have explained the architecture here
<https://www.linkedin.com/pulse/data-grid-big-architecture-hadoop-hive-mich-talebzadeh-ph-d-?trk=pulse_spock-articles>

So there are few questions to be asked:

   1. Choose a Data Warehouse in Big Data. The likelihood will be something
   like Hive that supports ACID properties and has the nearest to Ansi-SQL on
   Big data. Your users will be productive on it assuming they know SQL (which
   they ought)
   2. Once you have chosen your target Data Warehouse, you will need to
   consider various query tools like Spark that provides Spark-shell and
   Spark-sql tools among other things. It provides SQL interface plus
   functional programming through Scala etc. It is a pretty impressive query
   engine with in-memory calculation and DAG
   3. You can also use other visualisation tools like Tableau etc for user
   interface.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 6 March 2016 at 19:14, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:

> Hi,
>
> SPARK is just tooling, and its even not tooling. You can consider SPARK a
> distributed operating system like YARN. You should read books like HADOOP
> Application Architecture, Big Data (Nathan Marz) and other disciplines
> before starting to consider how the solution is built.
>
> Most of the big data projects (like any other BI projects) do not deliver
> value or turn extremely expensive to maintain because the approach is that
> tools solve the problem.
>
>
> Regards,
> Gourav Sengupta
>
> On Sun, Mar 6, 2016 at 5:25 PM, Guillaume Bilodeau <
> guillaume.bilod...@gmail.com> wrote:
>
>> The data is currently stored in a relational database, but a migration to
>> a document-oriented database such as MongoDb is something we are definitely
>> considering.  How does this factor in?
>>
>> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> That depends on a lot of things, but as a starting point I would ask
>>> whether you are planning to store your data in JSON format?
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <
>>> guillaume.bilod...@gmail.com> wrote:
>>>
>>>> Our problem space is survey analytics.  Each survey comprises a set of
>>>> questions, with each question having a set of possible answers.  Survey
>>>> fill-out tasks are sent to users, who have until a certain date to
>>>> complete
>>>> it.  Based on these survey fill-outs, reports need to be generated.
>>>> Each
>>>> report deals with a subset of the survey fill-outs, and comprises a set
>>>> of
>>>> data points (average rating for question 1, min/max for question 2,
>>>> etc.)
>>>>
>>>> We are dealing with rather large data sets - although reading the
>>>> internet
>>>> we get the impression that everyone is analyzing petabytes of data...
>>>>
>>>> Users: up to 100,000
>>>> Surveys: up to 100,000
>>>> Questions per survey: up to 100
>>>> Possible answers per question: up to 10
>>>> Survey fill-outs / user: up to 10
>>>> Reports: up to 100,000
>>>> Data points per report: up to 100
>>>>
>>>> Data is currently stored in a relational database but a migration to a
>>>> different kind of store is possible.
>>>>
>>>> The naive algorithm for report generation can be summed up as this:
>>>>
>>>> for each report to be generated {
>>>>   for each report data point to be calculated {
>>>>     calculate data point
>>>>     add data point to report
>>>>   }
>>>>   publish report
>>>> }
>>>>
>>>> In order to deal with the upper limits of these values, we will need to
>>>> distribute this algorithm to a compute / data cluster as much as
>>>> possible.
>>>>
>>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>>> GridGain,
>>>> HazelCast and several others, and am still confused as to how each of
>>>> these
>>>> can help us and how they fit together.
>>>>
>>>> Is Spark the right framework for us?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Is Spark right for us?

Reply via email to