Re: spark sql StackOverflow

Alessandro Solimando Tue, 15 May 2018 02:49:15 -0700

>From the information you provided I would tackle this as a batch problem,
because this way you have access to more sophisticated techniques and you
have more flexibility (maybe HDFS and a SparkJob, but also think about a
datastore offering good indexes for the kind of data types and values you
have for your keys, and benefit from filter push-downs).


I personally use streaming only when real-time ingestion is needed.

Hth,
Alessandro

On 15 May 2018 at 09:11, onmstester onmstester <onmstes...@zoho.com> wrote:

>
> How many distinct key1 (resp. key2) values do you have? Are these values
> reasonably stable over time?
>
> less than 10 thousands and this filters would change each 2-3 days. They
> would be written and loaded from a database
>
> Are these records ingested in real-time or are they loaded from a
> datastore?
>
>
> records would be loaded from some text files that would be copied in some
> directory over and over
>
> Are you suggesting that i dont need to use spark-streaming?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ---- On Tue, 15 May 2018 11:26:42 +0430 *Alessandro Solimando
> <alessandro.solima...@gmail.com <alessandro.solima...@gmail.com>>* wrote
> ----
>
> Hi,
> I am not familiar with ATNConfigSet, but some thoughts that might help.
>
> How many distinct key1 (resp. key2) values do you have? Are these values
> reasonably stable over time?
>
> Are these records ingested in real-time or are they loaded from a
> datastore?
>
> If the latter case the DB might be able to efficiently perform the
> filtering, especially if equipped with a proper index over key1/key2 (or a
> composite one).
>
> In such case the filter push-down could be very effective (I didn't get if
> you just need to count or do something more with the matching record).
>
> Alternatively, you could try to group by (key1,key2), and then filter (it
> again depends on the kind of output you have in mind).
>
> If the datastore/stream is distributed and supports partitioning, you
> could partition your records by either key1 or key2 (or key1+key2), so they
> are already "separated" and can be consumed more efficiently (e.g., the
> groupby could then be local to a single partition).
>
> Best regards,
> Alessandro
>
> On 15 May 2018 at 08:32, onmstester onmstester <onmstes...@zoho.com>
> wrote:
>
>
> Hi,
>
> I need to run some queries on huge amount input records. Input rate for
> records are 100K/seconds.
> A record is like (key1,key2,value) and the application should report
> occurances of kye1 = something && key2 == somethingElse.
> The problem is there are too many filters in my query: more than 3
> thousands pair of key1 and key2 should be filtered.
> I was simply puting 1 millions of records in a temptable each time and
> running a query sql using spark-sql on temp table:
> select * from mytemptable where (kye1 = something && key2 ==
> somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or
> ...(3thousands or!!!)
> And i encounter StackOverFlow at ATNConfigSet.java line 178.
>
> So i have two options IMHO:
> 1. Either put all key1 and key2 filter pairs in another temp table and do
> a join between  two temp table
> 2. Or use spark-stream that i'm not familiar with and i don't know if it
> could handle 3K of filters.
>
> Which way do you suggest? what is the best solution for my problem
> 'performance-wise'?
>
> Thanks in advance
>
>
>
>

Re: spark sql StackOverflow

Reply via email to