Bloom Filter to filter huge dataframes with PySpark

2020-09-23 Thread Breno Arosa
https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) but there is no correspondent function in pyspark. Is there any way to call this within pyspark? I'm using spark 2.4.3. Thanks, Breno Arosa. ps: I'm not sharing the real query but here is a very s

Re: Submitting Spark Job thru REST API?

2020-09-02 Thread Breno Arosa
Maybe there are other ways but I think the most common path is using Apache Livy (https://livy.apache.org/). On 02/09/2020 17:58, Eric Beabes wrote: Under Spark 2.4 is it possible to submit a Spark job thru REST API - just like the Flink job? Here's the use case: We need to submit a Spark Job

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Breno Arosa
Kafka-connect (https://docs.confluent.io/current/connect/index.html) may be an easier solution for this use case of just dumping kafka topics. On 17/06/2020 18:02, Jungtaek Lim wrote: Just in case if anyone prefers ASF projects then there are other alternative projects in ASF as well, alphabeti

Re: Standard practices for building dashboards for spark processed data

2020-02-26 Thread Breno Arosa
I have been using Athena/Presto to read the parquet files in datalake, if your are already saving data to s3 I think this is the easiest option. Then I use Redash or Metabase to build dashboards (they have different limitations), both are very intuitive to use and easy to setup with docker. -- S