Andres, A couple points:
1) If you look at my post, you can see that you could use Spark for low-latency - many sub-second queries could be executed in under a second, with the right technology. It really depends on "real time" definition, but I believe low latency is definitely possible. 2) Akka-http over SparkContext - this is essentially what Spark Job Server does. (it uses Spray, whic is the predecessor to akka-http.... we will upgrade once Spark 2.0 is incorporated) 3) Someone else can probably talk about Ignite, but it is based on a distributed object cache. So you define your objects in Java, POJOs, annotate which ones you want indexed, upload your jars, then you can execute queries. It's a different use case than typical OLAP. There is some Spark integration, but then you would have the same bottlenecks going through Spark. On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote: > nice discussion , I've a question about Web Service with Spark. > > What Could be the problem using Akka-http as web service (Like play does ) , > with one SparkContext created , and the queries over -http akka using only > the instance of that SparkContext , > > Also about Analytics , we are working on real- time Analytics and as Hemant > said Spark is not a solution for low latency queries. What about using > Ingite for that? > > > On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat <hemant9...@gmail.com> > wrote: >> >> Spark-jobserver is an elegant product that builds concurrency on top of >> Spark. But, the current design of DAGScheduler prevents Spark to become a >> truly concurrent solution for low latency queries. DagScheduler will turn >> out to be a bottleneck for low latency queries. Sparrow project was an >> effort to make Spark more suitable for such scenarios but it never made it >> to the Spark codebase. If Spark has to become a highly concurrent solution, >> scheduling has to be distributed. >> >> Hemant Bhanawat >> www.snappydata.io >> >> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote: >>> >>> great discussion, indeed. >>> >>> Mark Hamstra and i spoke offline just now. >>> >>> Below is a quick recap of our discussion on how they've achieved >>> acceptable performance from Spark on the user request/response path (@mark- >>> feel free to correct/comment). >>> >>> 1) there is a big difference in request/response latency between >>> submitting a full Spark Application (heavy weight) versus having a >>> long-running Spark Application (like Spark Job Server) that submits >>> lighter-weight Jobs using a shared SparkContext. mark is obviously using >>> the latter - a long-running Spark App. >>> >>> 2) there are some enhancements to Spark that are required to achieve >>> acceptable user request/response times. some links that Mark provided are >>> as follows: >>> >>> https://issues.apache.org/jira/browse/SPARK-11838 >>> https://github.com/apache/spark/pull/11036 >>> https://github.com/apache/spark/pull/11403 >>> https://issues.apache.org/jira/browse/SPARK-13523 >>> https://issues.apache.org/jira/browse/SPARK-13756 >>> >>> Essentially, a deeper level of caching at the shuffle file layer to >>> reduce compute and memory between queries. >>> >>> Note that Mark is running a slightly-modified version of stock Spark. >>> (He's mentioned this in prior posts, as well.) >>> >>> And I have to say that I'm, personally, seeing more and more >>> slightly-modified versions of Spark being deployed to production to >>> workaround outstanding PR's and Jiras. >>> >>> this may not be what people want to hear, but it's a trend that i'm >>> seeing lately as more and more customize Spark to their specific use cases. >>> >>> Anyway, thanks for the good discussion, everyone! This is why we have >>> these lists, right! :) >>> >>> >>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.git...@gmail.com> >>> wrote: >>>> >>>> One of the premises here is that if you can restrict your workload to >>>> fewer cores - which is easier with FiloDB and careful data modeling - >>>> you can make this work for much higher concurrency and lower latency >>>> than most typical Spark use cases. >>>> >>>> The reason why it typically does not work in production is that most >>>> people are using HDFS and files. These data sources are designed for >>>> running queries and workloads on all your cores across many workers, >>>> and not for filtering your workload down to only one or two cores. >>>> >>>> There is actually nothing inherent in Spark that prevents people from >>>> using it as an app server. However, the insistence on using it with >>>> HDFS is what kills concurrency. This is why FiloDB is important. >>>> >>>> I agree there are more optimized stacks for running app servers, but >>>> the choices that you mentioned: ES is targeted at text search; Cass >>>> and HBase by themselves are not fast enough for analytical queries >>>> that the OP wants; and MySQL is great but not scalable. Probably >>>> something like VectorWise, HANA, Vertica would work well, but those >>>> are mostly not free solutions. Druid could work too if the use case >>>> is right. >>>> >>>> Anyways, great discussion! >>>> >>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote: >>>> > you are correct, mark. i misspoke. apologies for the confusion. >>>> > >>>> > so the problem is even worse given that a typical job requires >>>> > multiple >>>> > tasks/cores. >>>> > >>>> > i have yet to see this particular architecture work in production. i >>>> > would >>>> > love for someone to prove otherwise. >>>> > >>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra >>>> > <m...@clearstorydata.com> >>>> > wrote: >>>> >>> >>>> >>> For example, if you're looking to scale out to 1000 concurrent >>>> >>> requests, >>>> >>> this is 1000 concurrent Spark jobs. This would require a cluster >>>> >>> with 1000 >>>> >>> cores. >>>> >> >>>> >> >>>> >> This doesn't make sense. A Spark Job is a driver/DAGScheduler >>>> >> concept >>>> >> without any 1:1 correspondence between Worker cores and Jobs. Cores >>>> >> are >>>> >> used to run Tasks, not Jobs. So, yes, a 1000 core cluster can run at >>>> >> most >>>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything >>>> >> about how >>>> >> many Jobs are or can be concurrently tracked by the DAGScheduler, >>>> >> which will >>>> >> be apportioning the Tasks from those concurrent Jobs across the >>>> >> available >>>> >> Executor cores. >>>> >> >>>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com> >>>> >> wrote: >>>> >>> >>>> >>> Good stuff, Evan. Looks like this is utilizing the in-memory >>>> >>> capabilities of FiloDB which is pretty cool. looking forward to the >>>> >>> webcast >>>> >>> as I don't know much about FiloDB. >>>> >>> >>>> >>> My personal thoughts here are to removed Spark from the user >>>> >>> request/response hot path. >>>> >>> >>>> >>> I can't tell you how many times i've had to unroll that architecture >>>> >>> at >>>> >>> clients - and replace with a real database like Cassandra, >>>> >>> ElasticSearch, >>>> >>> HBase, MySql. >>>> >>> >>>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to >>>> >>> believe that Spark could be used as an application server. This is >>>> >>> not a >>>> >>> good use case for Spark. >>>> >>> >>>> >>> Remember that every job that is launched by Spark requires 1 CPU >>>> >>> core, >>>> >>> some memory, and an available Executor JVM to provide the CPU and >>>> >>> memory. >>>> >>> >>>> >>> Yes, you can horizontally scale this because of the distributed >>>> >>> nature of >>>> >>> Spark, however it is not an efficient scaling strategy. >>>> >>> >>>> >>> For example, if you're looking to scale out to 1000 concurrent >>>> >>> requests, >>>> >>> this is 1000 concurrent Spark jobs. This would require a cluster >>>> >>> with 1000 >>>> >>> cores. this is just not cost effective. >>>> >>> >>>> >>> Use Spark for what it's good for - ad-hoc, interactive, and >>>> >>> iterative >>>> >>> (machine learning, graph) analytics. Use an application server for >>>> >>> what >>>> >>> it's good - managing a large amount of concurrent requests. And use >>>> >>> a >>>> >>> database for what it's good for - storing/retrieving data. >>>> >>> >>>> >>> And any serious production deployment will need failover, >>>> >>> throttling, >>>> >>> back pressure, auto-scaling, and service discovery. >>>> >>> >>>> >>> While Spark supports these to varying levels of >>>> >>> production-readiness, >>>> >>> Spark is a batch-oriented system and not meant to be put on the user >>>> >>> request/response hot path. >>>> >>> >>>> >>> For the failover, throttling, back pressure, autoscaling that i >>>> >>> mentioned >>>> >>> above, it's worth checking out the suite of Netflix OSS - >>>> >>> particularly >>>> >>> Hystrix, Eureka, Zuul, Karyon, etc: http://netflix.github.io/ >>>> >>> >>>> >>> Here's my github project that incorporates a lot of these: >>>> >>> https://github.com/cfregly/fluxcapacitor >>>> >>> >>>> >>> Here's a netflix Skunkworks github project that packages these up in >>>> >>> Docker images: https://github.com/Netflix-Skunkworks/zerotodocker >>>> >>> >>>> >>> >>>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github >>>> >>> <velvia.git...@gmail.com> >>>> >>> wrote: >>>> >>>> >>>> >>>> Hi, >>>> >>>> >>>> >>>> I just wrote a blog post which might be really useful to you -- I >>>> >>>> have >>>> >>>> just >>>> >>>> benchmarked being able to achieve 700 queries per second in Spark. >>>> >>>> So, >>>> >>>> yes, >>>> >>>> web speed SQL queries are definitely possible. Read my new blog >>>> >>>> post: >>>> >>>> >>>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/ >>>> >>>> >>>> >>>> and feel free to email me (at vel...@gmail.com) if you would like >>>> >>>> to >>>> >>>> follow >>>> >>>> up. >>>> >>>> >>>> >>>> -Evan >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> View this message in context: >>>> >>>> >>>> >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html >>>> >>>> Sent from the Apache Spark User List mailing list archive at >>>> >>>> Nabble.com. >>>> >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> >>>> >>> Chris Fregly >>>> >>> Principal Data Solutions Engineer >>>> >>> IBM Spark Technology Center, San Francisco, CA >>>> >>> http://spark.tc | http://advancedspark.com >>>> >> >>>> >> >>>> > >>>> > >>>> > >>>> > -- >>>> > >>>> > Chris Fregly >>>> > Principal Data Solutions Engineer >>>> > IBM Spark Technology Center, San Francisco, CA >>>> > http://spark.tc | http://advancedspark.com >>> >>> >>> >>> >>> -- >>> >>> Chris Fregly >>> Principal Data Solutions Engineer >>> IBM Spark Technology Center, San Francisco, CA >>> http://spark.tc | http://advancedspark.com >> >> > > > > -- > Ing. Ivaldi Andres --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org