Hi,  The healthcare industry can do wonderful things with Apache Spark.  But,
there is already a very large base of data and applications firmly rooted in
the relational paradigm and they are resistent to change - stuck on Oracle.  

**
QUESTION 1 - Migrate legacy relational data (plus new transactions) to
distributed storage?  

DISCUSSION 1 - The primary advantage I see is not having to engage in the
lengthy (1+ years) process of creating a relational data warehouse and
cubes.  Just store the data in a distributed system and "analyze first" in
memory with Spark.

**
QUESTION 2 - Will we have to re-write the enormous amount of logic that is
already built for the old relational system?

DISCUSSION 2 - If we move the data to distributed, can we simply run that
existing relational logic as SparkSQL queries?  [existing SQL --> Spark
Context --> Cassandra --> process in SparkSQL --> display in existing UI]. 
Can we create an RDD that uses existing SQL?  Or do we need to rewrite all
our SQL?

**
DATA SIZE - We are adding many new data sources to a system that already
manages health care data for over a million people.  The number of rows may
not be enormous right now compared to the advertising industry, for example,
but the number of dimensions runs well into the thousands.  If we add to
this, IoT data for each health care patient, that creates billions of events
per day, and the number of rows then grows exponentially.  We would like to
be prepared to handle that huge data scenario.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Migrate-Relational-to-Distributed-tp22999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to