Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Gabor Somogyi
I've doubled checked this and came to the same conclusion just like Jungtaek. I've added a comment to the stackoverflow post to reach more poeple with the answer. G On Thu, Jan 21, 2021 at 6:53 AM Jungtaek Lim wrote: > I quickly looked into the attached log in SO post, and the problem doesn't

Query on entrypoint.sh Kubernetes spark

2021-01-21 Thread Sachit Murarka
Hi All, To run spark on kubernetes . I see following lines in entrypoint.sh script available case "$1" in driver) shift 1 CMD=( "$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client Could you pls suggest Why d

Re: Query on entrypoint.sh Kubernetes spark

2021-01-21 Thread Jacek Laskowski
Hi, I'm a beginner in Spark on Kubernetes so bear with me and watch out for possible mistakes :) The key to understand entrypoint.sh and -deploy-mode client is to think about the environment where the script is executed in. That's k8s already where the Docker image is brought to life as a contain

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Jacek Laskowski
Hi, I'd look at stages and jobs as it's possible that the only task running is the missing one in a stage of a job. Just guessing... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on https://twitter.com/jacekla

Re: Application Timeout

2021-01-21 Thread Jacek Laskowski
Hi Brett, No idea why it happens, but got curious about this "Cores" column being 0. Is this always the case? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on https://twitter.com/jaceklaskowski

Re: Process each kafka record for structured streaming

2021-01-21 Thread Jacek Laskowski
Hi, Can you use console sink and make sure that the pipeline shows some progress? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on https://twitter.com/jaceklaskowski On

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Jacek Laskowski
Hi, Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on http

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Jungtaek Lim
I'm not sure how many people could even guess possible reasons - I'd say there's not enough information. No driver/executor logs, no job/stage/executor information, no code. On Thu, Jan 21, 2021 at 8:25 PM Jacek Laskowski wrote: > Hi, > > I'd look at stages and jobs as it's possible that the onl

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Eric Beabes
I see a lot of messages such as this in the Driver log even though this is not the first batch. Job has been running for more than 3 days Jan 21, 2021 @ 17:09:42.48421/01/21 11:39:34 WARN state.HDFSBackedStateStoreProvider: The state for version 43405 doesn't exist in loadedMaps. Reading

Facing memory leak with Pyarrow enabled and toPandas()

2021-01-21 Thread Divyanshu Kumar
Hi, I am facing this issue while using toPandas() and Pyarrow simultaneously. pandas - toPandas() giving memory leak error with PySpark arrow enabled - Stack Overflow

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Sean Owen
Is your app accumulating a lot of streaming state? that's one reason something could slow down after a long time. Some memory leak in your app putting GC/memory pressure on the JVM, etc too. On Thu, Jan 21, 2021 at 5:13 AM Eric Beabes wrote: > Hello, > > My Spark Structured Streaming application

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Gabor Somogyi
If you have an exact version which version of the google connector is used then the source can be checked to see what really happened: https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/83a6c9809ad49a44895d59558e666e5fc183e0bf/gcs/src/main/java/com/google/cloud/hadoop/fs/gcs/GoogleHadoop

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Mich Talebzadeh
Most enterprise databases provide Data Encryption of some form. For example Introduction to Transparent Data Encryption (oracle.com) As far as I know Hive supports text and sequence file column

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Gourav Sengupta
Hi John, as always I would start by asking what is that y0u are trying to achieve here? What is the exact security requirement? We can then start looking at the options available. Regards, Gourav Sengupta On Thu, Jan 21, 2021 at 1:59 PM Mich Talebzadeh wrote: > Most enterprise databases provi

Pyspark How to groupBy -> fit

2021-01-21 Thread Riccardo Ferrari
Hi list, I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy. This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline. The ultimate goal is to fit multiple models, one

Re: Spark standalone - reading kerberos hdfs

2021-01-21 Thread Sudhir Babu Pothineni
Any other insights into this issue? I tried multiple way to supply keytab to executor Does spark standalone doesn’t support Kerberos? > On Jan 8, 2021, at 1:53 PM, Sudhir Babu Pothineni > wrote: > >  > Incase of Spark on Yarn, Application Master shares the token. > > I think incase of spa

Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-21 Thread Filip
Hi, I'm considering using Apache Spark for the development of an application. This would replace a legacy program which reads CSV files and does lots (tens/hundreds) of aggregations on them. The aggregations are fairly simple: counts, sums, etc. while applying some filtering conditions on some of

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data tha

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Mich Talebzadeh
I guess one drawback would be that the data cannot be processed and stored in Pandas DataFrames as these DataFrames store data in RAM. If you are going to run multiple parallel jobs then a single machine may not be viable? On Thu, 21 Jan 2021 at 16:29, Sean Owen wrote: > If you mean you want

Re: Connection to Presto via Spark

2021-01-21 Thread Gourav Sengupta
Terribly fascinating. Any insights into why are we not trying to use spark itself? Regards Gourav On Wed, 13 Jan 2021, 12:46 Vineet Mishra, wrote: > Hi, > > I am trying to connect to Presto via Spark shell using the following > connection string, however ending up with exception > > *-bash-4.2$

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Eric Beabes
Yes. For this particular use case the state size could be big but I doubt if there's a leak. Maybe adding more memory would help. On Thu, Jan 21, 2021 at 5:55 PM Sean Owen wrote: > Is your app accumulating a lot of streaming state? that's one reason > something could slow down after a long time.

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Lalwani, Jayesh
If you are going aggregations, you need to watermark the data. Depending on what aggrgations you are doing, state might keep accumulating till failure. From: Eric Beabes Date: Thursday, January 21, 2021 at 12:19 PM To: Sean Owen Cc: spark-user Subject: RE: [EXTERNAL] Only one Active task in Sp

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Riccardo Ferrari
Thanks for the answers. I am trying to avoid reading the same data multiple times (each per model). One approach I can think of is 'filtering' on the column I want to split on and train each model. I was hoping to find a more elegant approach. On Thu, Jan 21, 2021 at 5:28 PM Sean Owen wrote:

unsubscribe

2021-01-21 Thread Aironman DirtDiver
-- Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread gshen
I am now testing with to stream into a Delta table. Interestingly I have gotten it working within a community version of Databricks, which leads me to think there might be something to do with my dependencies. I am checkpointing to ADLS Gen2 adding the following dependencies: delta-core_2.12-0.7.0

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
Yep that's one approach. That may not really re-read the data N times; for example if the filtering aligns with partitioning, you'd be reading subsets each time. You can also cache the input first to avoid I/O N times. But again I wonder if you are at a scale that really needs distributed training.

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Jungtaek Lim
Looks like it's a driver side error log, and I think executor log would have much more warning/error logs and probably with stack traces. I'd also suggest excluding the external dependency whatever possible while experimenting/investigating. If you're suspecting Apache Spark I'd rather say you'll

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread gshen
Thanks for the tips! I think I figured out what might be causing it. It's the checkpointing to Microsoft Azure Data Lake Storage (ADLS). When I use "local checkpointing" it works, but when i use fails when there's a groupBy in the stream. Weirdly it works when there is no groupBy clause in the st

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Gabor Somogyi
The most interesting part is that you've added this: kafka-clients-0.10.2.2.jar Spark 3.0.1 uses Kafka clients 2.4.1. Downgrading with such a big step doesn't help. Please remove that also togrther w/ Spark-Kafka dependency. G On Thu, 21 Jan 2021, 22:45 gshen, wrote: > Thanks for the tips! > >