1. reading from kafka has exactly once guarantees - we are using it in
production today (with the direct receiver)
* you will probably have 2 topics, loading both into spark and joining
/ unioning as needed is not an issue
* tons of optimizations you can do there, assuming everything else works
2. for ad-hoc query I would say you absolutely need to look at external
storage
* querying the Dstream or spark's RDD's directly should be done mostly
for aggregates/metrics, not by users
* if you look at HBase or Cassandra for storage then 50k writes /sec are
not a problem at all, especially combined with a smart client that does batch
puts (like async hbase<https://github.com/OpenTSDB/asynchbase>)
* you could also consider writing the updates to another kafka topic and
have a different component that updates the DB, if you think of other
optimisations there
3. by stats I assume you mean metrics (operational or business)
* there are multiple ways to do this, however I would not encourage you
to query spark directly, especially if you need an archive/history of your
datapoints
* we are using OpenTSDB (we already have a HBase cluster) + Grafana for
dashboarding
* collecting the metrics is a bit hairy in a streaming app - we have
experimented with both accumulators and RDDs specific for metrics - chose the
RDDs that write to OpenTSDB using foreachRdd
-adrian
________________________________
From: Thúy Hằng Lê <[email protected]>
Sent: Sunday, September 20, 2015 7:26 AM
To: Jörn Franke
Cc: [email protected]
Subject: Re: Using Spark for portfolio manager app
Thanks Adrian and Jorn for the answers.
Yes, you're right there are lot of things I need to consider if I want to use
Spark for my app.
I still have few concerns/questions from your information:
1/ I need to combine trading stream with tick stream, I am planning to use
Kafka for that
If I am using approach #2 (Direct Approach) in this tutorial
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
[https://spark.apache.org/docs/latest/img/spark-logo-hd.png]<https://spark.apache.org/docs/latest/streaming-kafka-integration.html>
Spark Streaming + Kafka Integration Guide - Spark 1.4.1 ...
Spark Streaming + Kafka Integration Guide. Apache Kafka is publish-subscribe
messaging rethought as a distributed, partitioned, replicated commit log
service.
Read
more...<https://spark.apache.org/docs/latest/streaming-kafka-integration.html>
Will I receive exactly one semantics? Or I have to add some logic in my code to
archive that.
As your suggestion of using delta update, exactly one semantic is required for
this application.
2/ For ad-hoc query, I must output of Spark to external storage and query on
that right?
Is there any way to do ah-hoc query on Spark? my application could have 50k
updates per second at pick time.
Persistent to external storage lead to high latency in my app.
3/ How to get real-time statistics from Spark,
In most of the Spark streaming examples, the statistics are echo to the stdout.
However, I want to display those statics on GUI, is there any way to retrieve
data from Spark directly without using external Storage?
2015-09-19 16:23 GMT+07:00 Jörn Franke
<[email protected]<mailto:[email protected]>>:
If you want to be able to let your users query their portfolio then you may
want to think about storing the current state of the portfolios in
hbase/phoenix or alternatively a cluster of relationaldatabases can make sense.
For the rest you may use Spark.
Le sam. 19 sept. 2015 à 4:43, Thúy Hằng Lê
<[email protected]<mailto:[email protected]>> a écrit :
Hi all,
I am going to build a financial application for Portfolio Manager, where each
portfolio contains a list of stocks, the number of shares purchased, and the
purchase price.
Another source of information is stocks price from market data. The application
need to calculate real-time gain or lost of each stock in each portfolio (
compared to the purchase price).
I am new with Spark, i know using Spark Streaming I can aggregate portfolio
possitions in real-time, for example:
user A contains:
- 100 IBM stock with transactionValue=$15000
- 500 AAPL stock with transactionValue=$11400
Now given the stock prices change in real-time too, e.g if IBM price at 151, i
want to update the gain or lost of it: gainOrLost(IBM) = 151*100 - 15000 = $100
My questions are:
* What is the best method to combine 2 real-time streams( transaction
made by user and market pricing data) in Spark.
* How can I use real-time Adhoc SQL again portfolio's positions, is
there any way i can do SQL on the output of Spark Streamming.
For example,
select sum(gainOrLost) from portfolio where user='A';
* What are prefered external storages for Spark in this use case.
* Is spark is right choice for my use case?