Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-21 Thread Filip
Hi, I'm considering using Apache Spark for the development of an application. This would replace a legacy program which reads CSV files and does lots (tens/hundreds) of aggregations on them. The aggregations are fairly simple: counts, sums, etc. while applying some filtering conditions on some of

Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-22 Thread Filip
Hi, I don't have any code for the forEachBatch approach, I mentioned it due to this response to my question on SO: https://stackoverflow.com/a/65803718/1017130 I have added some very simple code below that I think shows what I'm trying to do: val schema = StructType( Array( StructFiel

Developing a spark streaming application

2014-08-27 Thread Filip Andrei
Hey guys, so the problem i'm trying to tackle is the following: - I need a data source that emits messages at a certain frequency - There are N neural nets that need to process each message individually - The outputs from all neural nets are aggregated and only when all N outputs for each message

Ensuring object in spark streaming runs on specific node

2014-08-29 Thread Filip Andrei
Say you have a spark streaming setup such as JavaReceiverInputDStream<...> rndLists = jssc.receiverStream(new JavaRandomReceiver(...)); rndLists.map(new NeuralNetMapper(...)) .foreach(new JavaSyncBarrier(...)); Is there any way of ensuring that, say, a JavaRandomReceiver and Java

Odd error when using a rdd map within a stream map

2014-09-18 Thread Filip Andrei
here i wrote a simpler version of the code to get an understanding of how it works: final List nns = new ArrayList(); for(int i = 0; i < numberOfNets; i++){ nns.add(NeuralNet.createFrom(...)); } final JavaRDD nnRdd = sc.parallelize(nns); JavaDStream results = rndLists.flatMap(new FlatMapFu

Re: Odd error when using a rdd map within a stream map

2014-09-19 Thread Filip Andrei
Hey, i don't think that's the issue, foreach is called on 'results' which is a DStream of floats, so naturally it passes RDDs to its function. And either way, changing the code in the first mapper to comment out the map reduce process on the RDD Float f = 1.0f; //nnRdd.map(new Function() {