Re: Can we use spark inside a web service?

Evan Chan Mon, 14 Mar 2016 15:22:52 -0700

Andres,

A couple points:


1) If you look at my post, you can see that you could use Spark for
low-latency - many sub-second queries could be executed in under a
second, with the right technology.  It really depends on "real time"
definition, but I believe low latency is definitely possible.
2) Akka-http over SparkContext - this is essentially what Spark Job
Server does.  (it uses Spray, whic is the predecessor to akka-http....
we will upgrade once Spark 2.0 is incorporated)
3) Someone else can probably talk about Ignite, but it is based on a
distributed object cache. So you define your objects in Java, POJOs,
annotate which ones you want indexed, upload your jars, then you can
execute queries.   It's a different use case than typical OLAP.
There is some Spark integration, but then you would have the same
bottlenecks going through Spark.


On Fri, Mar 11, 2016 at 5:02 AM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
> nice discussion , I've a question about  Web Service with Spark.
>
> What Could be the problem using Akka-http as web service (Like play does ) ,
> with one SparkContext created , and the queries over -http akka using only
> the instance of  that SparkContext ,
>
> Also about Analytics , we are working on real- time Analytics and as Hemant
> said Spark is not a solution for low latency queries. What about using
> Ingite for that?
>
>
> On Fri, Mar 11, 2016 at 6:52 AM, Hemant Bhanawat <hemant9...@gmail.com>
> wrote:
>>
>> Spark-jobserver is an elegant product that builds concurrency on top of
>> Spark. But, the current design of DAGScheduler prevents Spark to become a
>> truly concurrent solution for low latency queries. DagScheduler will turn
>> out to be a bottleneck for low latency queries. Sparrow project was an
>> effort to make Spark more suitable for such scenarios but it never made it
>> to the Spark codebase. If Spark has to become a highly concurrent solution,
>> scheduling has to be distributed.
>>
>> Hemant Bhanawat
>> www.snappydata.io
>>
>> On Fri, Mar 11, 2016 at 7:02 AM, Chris Fregly <ch...@fregly.com> wrote:
>>>
>>> great discussion, indeed.
>>>
>>> Mark Hamstra and i spoke offline just now.
>>>
>>> Below is a quick recap of our discussion on how they've achieved
>>> acceptable performance from Spark on the user request/response path (@mark-
>>> feel free to correct/comment).
>>>
>>> 1) there is a big difference in request/response latency between
>>> submitting a full Spark Application (heavy weight) versus having a
>>> long-running Spark Application (like Spark Job Server) that submits
>>> lighter-weight Jobs using a shared SparkContext.  mark is obviously using
>>> the latter - a long-running Spark App.
>>>
>>> 2) there are some enhancements to Spark that are required to achieve
>>> acceptable user request/response times.  some links that Mark provided are
>>> as follows:
>>>
>>> https://issues.apache.org/jira/browse/SPARK-11838
>>> https://github.com/apache/spark/pull/11036
>>> https://github.com/apache/spark/pull/11403
>>> https://issues.apache.org/jira/browse/SPARK-13523
>>> https://issues.apache.org/jira/browse/SPARK-13756
>>>
>>> Essentially, a deeper level of caching at the shuffle file layer to
>>> reduce compute and memory between queries.
>>>
>>> Note that Mark is running a slightly-modified version of stock Spark.
>>> (He's mentioned this in prior posts, as well.)
>>>
>>> And I have to say that I'm, personally, seeing more and more
>>> slightly-modified versions of Spark being deployed to production to
>>> workaround outstanding PR's and Jiras.
>>>
>>> this may not be what people want to hear, but it's a trend that i'm
>>> seeing lately as more and more customize Spark to their specific use cases.
>>>
>>> Anyway, thanks for the good discussion, everyone!  This is why we have
>>> these lists, right!  :)
>>>
>>>
>>> On Thu, Mar 10, 2016 at 7:51 PM, Evan Chan <velvia.git...@gmail.com>
>>> wrote:
>>>>
>>>> One of the premises here is that if you can restrict your workload to
>>>> fewer cores - which is easier with FiloDB and careful data modeling -
>>>> you can make this work for much higher concurrency and lower latency
>>>> than most typical Spark use cases.
>>>>
>>>> The reason why it typically does not work in production is that most
>>>> people are using HDFS and files.  These data sources are designed for
>>>> running queries and workloads on all your cores across many workers,
>>>> and not for filtering your workload down to only one or two cores.
>>>>
>>>> There is actually nothing inherent in Spark that prevents people from
>>>> using it as an app server.   However, the insistence on using it with
>>>> HDFS is what kills concurrency.   This is why FiloDB is important.
>>>>
>>>> I agree there are more optimized stacks for running app servers, but
>>>> the choices that you mentioned:  ES is targeted at text search;  Cass
>>>> and HBase by themselves are not fast enough for analytical queries
>>>> that the OP wants;  and MySQL is great but not scalable.   Probably
>>>> something like VectorWise, HANA, Vertica would work well, but those
>>>> are mostly not free solutions.   Druid could work too if the use case
>>>> is right.
>>>>
>>>> Anyways, great discussion!
>>>>
>>>> On Thu, Mar 10, 2016 at 2:46 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>> > you are correct, mark.  i misspoke.  apologies for the confusion.
>>>> >
>>>> > so the problem is even worse given that a typical job requires
>>>> > multiple
>>>> > tasks/cores.
>>>> >
>>>> > i have yet to see this particular architecture work in production.  i
>>>> > would
>>>> > love for someone to prove otherwise.
>>>> >
>>>> > On Thu, Mar 10, 2016 at 5:44 PM, Mark Hamstra
>>>> > <m...@clearstorydata.com>
>>>> > wrote:
>>>> >>>
>>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>>> >>> requests,
>>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>>> >>> with 1000
>>>> >>> cores.
>>>> >>
>>>> >>
>>>> >> This doesn't make sense.  A Spark Job is a driver/DAGScheduler
>>>> >> concept
>>>> >> without any 1:1 correspondence between Worker cores and Jobs.  Cores
>>>> >> are
>>>> >> used to run Tasks, not Jobs.  So, yes, a 1000 core cluster can run at
>>>> >> most
>>>> >> 1000 simultaneous Tasks, but that doesn't really tell you anything
>>>> >> about how
>>>> >> many Jobs are or can be concurrently tracked by the DAGScheduler,
>>>> >> which will
>>>> >> be apportioning the Tasks from those concurrent Jobs across the
>>>> >> available
>>>> >> Executor cores.
>>>> >>
>>>> >> On Thu, Mar 10, 2016 at 2:00 PM, Chris Fregly <ch...@fregly.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Good stuff, Evan.  Looks like this is utilizing the in-memory
>>>> >>> capabilities of FiloDB which is pretty cool.  looking forward to the
>>>> >>> webcast
>>>> >>> as I don't know much about FiloDB.
>>>> >>>
>>>> >>> My personal thoughts here are to removed Spark from the user
>>>> >>> request/response hot path.
>>>> >>>
>>>> >>> I can't tell you how many times i've had to unroll that architecture
>>>> >>> at
>>>> >>> clients - and replace with a real database like Cassandra,
>>>> >>> ElasticSearch,
>>>> >>> HBase, MySql.
>>>> >>>
>>>> >>> Unfortunately, Spark - and Spark Streaming, especially - lead you to
>>>> >>> believe that Spark could be used as an application server.  This is
>>>> >>> not a
>>>> >>> good use case for Spark.
>>>> >>>
>>>> >>> Remember that every job that is launched by Spark requires 1 CPU
>>>> >>> core,
>>>> >>> some memory, and an available Executor JVM to provide the CPU and
>>>> >>> memory.
>>>> >>>
>>>> >>> Yes, you can horizontally scale this because of the distributed
>>>> >>> nature of
>>>> >>> Spark, however it is not an efficient scaling strategy.
>>>> >>>
>>>> >>> For example, if you're looking to scale out to 1000 concurrent
>>>> >>> requests,
>>>> >>> this is 1000 concurrent Spark jobs.  This would require a cluster
>>>> >>> with 1000
>>>> >>> cores.  this is just not cost effective.
>>>> >>>
>>>> >>> Use Spark for what it's good for - ad-hoc, interactive, and
>>>> >>> iterative
>>>> >>> (machine learning, graph) analytics.  Use an application server for
>>>> >>> what
>>>> >>> it's good - managing a large amount of concurrent requests.  And use
>>>> >>> a
>>>> >>> database for what it's good for - storing/retrieving data.
>>>> >>>
>>>> >>> And any serious production deployment will need failover,
>>>> >>> throttling,
>>>> >>> back pressure, auto-scaling, and service discovery.
>>>> >>>
>>>> >>> While Spark supports these to varying levels of
>>>> >>> production-readiness,
>>>> >>> Spark is a batch-oriented system and not meant to be put on the user
>>>> >>> request/response hot path.
>>>> >>>
>>>> >>> For the failover, throttling, back pressure, autoscaling that i
>>>> >>> mentioned
>>>> >>> above, it's worth checking out the suite of Netflix OSS -
>>>> >>> particularly
>>>> >>> Hystrix, Eureka, Zuul, Karyon, etc:  http://netflix.github.io/
>>>> >>>
>>>> >>> Here's my github project that incorporates a lot of these:
>>>> >>> https://github.com/cfregly/fluxcapacitor
>>>> >>>
>>>> >>> Here's a netflix Skunkworks github project that packages these up in
>>>> >>> Docker images:  https://github.com/Netflix-Skunkworks/zerotodocker
>>>> >>>
>>>> >>>
>>>> >>> On Thu, Mar 10, 2016 at 1:40 PM, velvia.github
>>>> >>> <velvia.git...@gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> I just wrote a blog post which might be really useful to you -- I
>>>> >>>> have
>>>> >>>> just
>>>> >>>> benchmarked being able to achieve 700 queries per second in Spark.
>>>> >>>> So,
>>>> >>>> yes,
>>>> >>>> web speed SQL queries are definitely possible.   Read my new blog
>>>> >>>> post:
>>>> >>>>
>>>> >>>> http://velvia.github.io/Spark-Concurrent-Fast-Queries/
>>>> >>>>
>>>> >>>> and feel free to email me (at vel...@gmail.com) if you would like
>>>> >>>> to
>>>> >>>> follow
>>>> >>>> up.
>>>> >>>>
>>>> >>>> -Evan
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>> View this message in context:
>>>> >>>>
>>>> >>>> http://apache-spark-user-list.1001560.n3.nabble.com/Can-we-use-spark-inside-a-web-service-tp26426p26451.html
>>>> >>>> Sent from the Apache Spark User List mailing list archive at
>>>> >>>> Nabble.com.
>>>> >>>>
>>>> >>>>
>>>> >>>> ---------------------------------------------------------------------
>>>> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> >>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>>
>>>> >>> Chris Fregly
>>>> >>> Principal Data Solutions Engineer
>>>> >>> IBM Spark Technology Center, San Francisco, CA
>>>> >>> http://spark.tc | http://advancedspark.com
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> > Chris Fregly
>>>> > Principal Data Solutions Engineer
>>>> > IBM Spark Technology Center, San Francisco, CA
>>>> > http://spark.tc | http://advancedspark.com
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Chris Fregly
>>> Principal Data Solutions Engineer
>>> IBM Spark Technology Center, San Francisco, CA
>>> http://spark.tc | http://advancedspark.com
>>
>>
>
>
>
> --
> Ing. Ivaldi Andres

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can we use spark inside a web service?

Reply via email to