>From the information you provided I would tackle this as a batch problem, because this way you have access to more sophisticated techniques and you have more flexibility (maybe HDFS and a SparkJob, but also think about a datastore offering good indexes for the kind of data types and values you have for your keys, and benefit from filter push-downs).
I personally use streaming only when real-time ingestion is needed. Hth, Alessandro On 15 May 2018 at 09:11, onmstester onmstester <onmstes...@zoho.com> wrote: > > How many distinct key1 (resp. key2) values do you have? Are these values > reasonably stable over time? > > less than 10 thousands and this filters would change each 2-3 days. They > would be written and loaded from a database > > Are these records ingested in real-time or are they loaded from a > datastore? > > > records would be loaded from some text files that would be copied in some > directory over and over > > Are you suggesting that i dont need to use spark-streaming? > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > > ---- On Tue, 15 May 2018 11:26:42 +0430 *Alessandro Solimando > <alessandro.solima...@gmail.com <alessandro.solima...@gmail.com>>* wrote > ---- > > Hi, > I am not familiar with ATNConfigSet, but some thoughts that might help. > > How many distinct key1 (resp. key2) values do you have? Are these values > reasonably stable over time? > > Are these records ingested in real-time or are they loaded from a > datastore? > > If the latter case the DB might be able to efficiently perform the > filtering, especially if equipped with a proper index over key1/key2 (or a > composite one). > > In such case the filter push-down could be very effective (I didn't get if > you just need to count or do something more with the matching record). > > Alternatively, you could try to group by (key1,key2), and then filter (it > again depends on the kind of output you have in mind). > > If the datastore/stream is distributed and supports partitioning, you > could partition your records by either key1 or key2 (or key1+key2), so they > are already "separated" and can be consumed more efficiently (e.g., the > groupby could then be local to a single partition). > > Best regards, > Alessandro > > On 15 May 2018 at 08:32, onmstester onmstester <onmstes...@zoho.com> > wrote: > > > Hi, > > I need to run some queries on huge amount input records. Input rate for > records are 100K/seconds. > A record is like (key1,key2,value) and the application should report > occurances of kye1 = something && key2 == somethingElse. > The problem is there are too many filters in my query: more than 3 > thousands pair of key1 and key2 should be filtered. > I was simply puting 1 millions of records in a temptable each time and > running a query sql using spark-sql on temp table: > select * from mytemptable where (kye1 = something && key2 == > somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or > ...(3thousands or!!!) > And i encounter StackOverFlow at ATNConfigSet.java line 178. > > So i have two options IMHO: > 1. Either put all key1 and key2 filter pairs in another temp table and do > a join between two temp table > 2. Or use spark-stream that i'm not familiar with and i don't know if it > could handle 3K of filters. > > Which way do you suggest? what is the best solution for my problem > 'performance-wise'? > > Thanks in advance > > > >