RE: RE: Fast write datastore...

yohann jardin Thu, 16 Mar 2017 02:45:00 -0700

Hello everyone,

I'm also really interested in the answers as I will be facing the same issue 
soon.
Muthu, if you evaluate again Apache Ignite, can you share your results? I also 
noticed Alluxio to store spark results in memory that you might want to 
investigate.

In my case I want to use them to have a real time dashboard (or like waiting 
very few seconds to refine a dashboard), and that use case seems similar to 
your filter/aggregate previously computed spark results.

Regards,
Yohann

________________________________
De : Rick Moritz <rah...@gmail.com>
Envoyé : jeudi 16 mars 2017 10:37
À : user
Objet : Re: RE: Fast write datastore...

If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet 
might also be an option. Of course, management-wise it has much more overhead 
than using ES, since you need to manually define partitions and buckets, which 
is suboptimal. On the other hand, for querying, you can probably get some 
decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark were 
too slow/cumbersome.
Depending on your particular access patterns, this may not be very practical, 
but as a general approach it might be one way to get intermediate results 
quicker, and with less of a storage-zoo than some alternatives.

On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal 
<tr.s...@gmail.com<mailto:tr.s...@gmail.com>> wrote:
I do think Kafka is an overkill in this case. There are no streaming use- cases 
that needs a queue to do pub-sub.

On 16-Mar-2017 11:47 AM, "vvshvv" <vvs...@gmail.com<mailto:vvs...@gmail.com>> 
wrote:
Hi,

>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

I do not think so, in this case you will be able to process Parquet files as 
usual, but Kafka will allow your Elasticsearch cluster to be stable and survive 
regarding the number of rows.

Regards,
Uladzimir

On jasbir.s...@accenture.com<mailto:jasbir.s...@accenture.com>, Mar 16, 2017 
7:52 AM wrote:
Hi,

Will MongoDB not fit this solution?

From: Vova Shelgunov [mailto:vvs...@gmail.com<mailto:vvs...@gmail.com>]
Sent: Wednesday, March 15, 2017 11:51 PM
To: Muthu Jayakumar <bablo...@gmail.com<mailto:bablo...@gmail.com>>
Cc: vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>>; Richard 
Siebeling <rsiebel...@gmail.com<mailto:rsiebel...@gmail.com>>; user 
<user@spark.apache.org<mailto:user@spark.apache.org>>; Shiva Ramagopal 
<tr.s...@gmail.com<mailto:tr.s...@gmail.com>>
Subject: Re: Fast write datastore...

Hi Muthu,.

I did not catch from your message, what performance do you expect from 
subsequent queries?

Regards,
Uladzimir

On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" 
<bablo...@gmail.com<mailto:bablo...@gmail.com>> wrote:
Hello Uladzimir / Shiva,

>From ElasticSearch documentation (i have to see the logical plan of a query to 
>confirm), the richness of filters (like regex,..) is pretty good while 
>comparing to Cassandra. As for aggregates, i think Spark Dataframes is quite 
>rich enough to tackle.
Let me know your thoughts.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 10:55 AM, vvshvv 
<vvs...@gmail.com<mailto:vvs...@gmail.com>> wrote:
Hi muthu,

I agree with Shiva, Cassandra also supports SASI indexes, which can partially 
replace Elasticsearch functionality.

Regards,
Uladzimir

Sent from my Mi phone
On Shiva Ramagopal <tr.s...@gmail.com<mailto:tr.s...@gmail.com>>, Mar 15, 2017 
5:57 PM wrote:
Probably Cassandra is a good choice if you are mainly looking for a datastore 
that supports fast writes. You can ingest the data into a table and define one 
or more materialized views on top of it to support your queries. Since you 
mention that your queries are going to be simple you can define your indexes in 
the materialized views according to how you want to query the data.
Thanks,
Shiva

On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
<bablo...@gmail.com<mailto:bablo...@gmail.com>> wrote:
Hello Vincent,

Cassandra may not fit my bill if I need to define my partition and other 
indexes upfront. Is this right?

Hello Richard,

Let me evaluate Apache Ignite. I did evaluate it 3 months back and back then 
the connector to Apache Spark did not support Spark 2.0.

Another drastic thought may be repartition the result count to 1 (but have to 
be cautions on making sure I don't run into Heap issues if the result is too 
large to fit into an executor)  and write to a relational database like mysql / 
postgres. But, I believe I can do the same using ElasticSearch too.

A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

More thoughts welcome please.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
<rsiebel...@gmail.com<mailto:rsiebel...@gmail.com>> wrote:
maybe Apache Ignite does fit your requirements

On 15 March 2017 at 08:44, vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> wrote:
Hi
If queries are statics and filters are on the same columns, Cassandra is a good 
option.

Le 15 mars 2017 7:04 AM, "muthu" 
<bablo...@gmail.com<mailto:bablo...@gmail.com>> a écrit :
Hello there,

I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
ElasticSearch Hadoop connector. But I am running into connector write issues
if the number of Spark executors are too many for ElasticSearch to handle.
But in the schema sense, this seems a great fit as ElasticSearch has smartz
in place to discover the schema. Also in the query sense, I can perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?

Thanks,
Muthu

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write-datastore-tp28497.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_Fast-2Dwrite-2Ddatastore-2Dtp28497.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=9OzGCUHXXQLjuS_SpMHII54QWHNzFKrwMma4qV3ADxE&s=6305WvqHeyTC5S2ZSBXamJrcO03n3MQyoU4tkMQlM_k&e=>
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

________________________________

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
______________________________________________________________________________________

www.accenture.com<http://www.accenture.com>

RE: RE: Fast write datastore...

Reply via email to