>> We service templated queries from the appserver, i.e. user fills >>out some forms, dropdowns: we translate to a query.
and >>The target data >>size is about a billion records, 20'ish fields, distributed throughout a >>year (about 50GB on disk as CSV, uncompressed). tells me that proprietary in memory app will be the best option for you. I do not see any need for neither Spark nor Redshift in your case. On Tue, Nov 4, 2014 at 5:41 PM, agfung <agf...@gmail.com> wrote: > Sounds like context would help, I just didn't want to subject people to a > wall of text if it wasn't necessary :) > > Currently we use neither Spark SQL (or anything in the Hadoop stack) or > Redshift. We service templated queries from the appserver, i.e. user fills > out some forms, dropdowns: we translate to a query. > > Data is "basically" one table containing thousands of independent time > series, with one or two tables of reference data to join to. e.g. median > value of Field1 from Table1 where Field2 from Table 2 matches X filter, T1 > and T2 joining on a surrogate key, group by a different Field3. The data > structure is a little bit dynamic. User can upload any CSV, as long as > they > tell us the name of each column and the programmatic type. The target data > size is about a billion records, 20'ish fields, distributed throughout a > year (about 50GB on disk as CSV, uncompressed). > > So we're currently doing "historical" analytics (e.g. see analytic results > of only yesterday's data or older, but want to see the result "quickly"). > We eventually intend to do "realtime" (or "streaming") analytics (i.e. see > the impact of new data on analytics "quickly"). Machine learning is also > on > the roadmap. > > One proposition is for Spark SQL as a complete replacement for Redshift. > It > would simplify the architecture, since our long term strategy is to handle > data intake and ETL on HDFS (regardless of Redshift or Spark SQL). The > other parts of the Hadoop family that would come into play for ETL is > undetermined right now. Spark SQL appears to have relational ability, and > if we're going to use the Hadoop stack for ML and streaming analytics, and > it has the ability, why not do it all on one stack and not shovel data > around? Also, lots of people talking about it. > > The other proposition is Redshift as the historical analytics solution, and > something else (could be Spark, doesn't matter) for streaming analytics and > ML. If we need to relate the two, we'll have an API or process to stitch > it together. I've read about the "lambda architecture", which more or > less > describes this approach. The motivation is Redshift has the AWS > reliability/scalability/operational concerns worked out, richer query > language (SQL and pgsql functions are designed for slice-n-dice analytics) > so we can spend our coding time elsewhere, and a measure of safety against > design issues and bugs: Spark just came out of incubator status this year, > and it's much easier to find people on the web raving positively about > Redshift in real-world usage (i.e. part of live, client-facing system) than > Spark. > > category_theory's observation that most of the speed comes from fitting in > memory is helpful. It's what I would have surmised from the AMPLab Big > Data > benchmark, but confirmation from the hands-on community is invaluable, > thank > you. > > I understand a lot of it simply has to do with what-do-you-value-more > weightings, and we'll do prototypes/benchmarks if we have to, just wasn't > sure if there were any other "key assumptions/requirements/gotchas" to > consider. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112p18127.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >