Re: Question about 'maxOffsetsPerTrigger'

2020-06-30 Thread Jungtaek Lim
As Spark uses micro-batch for streaming, it's unavoidable to adjust the batch size properly to achieve your expectation of throughput vs latency. Especially, Spark uses global watermark which doesn't propagate (change) during micro-batch, you'd want to make the batch relatively small to make waterm

Re: REST Structured Steaming Sink

2020-07-01 Thread Jungtaek Lim
t bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin wrote: > Hi All, > > > We ingest alot of restful A

Re: Failure Threshold in Spark Structured Streaming?

2020-07-02 Thread Jungtaek Lim
Structured Streaming is basically following SQL semantic, which doesn't have such a semantic of "max allowance of failures". If you'd like to tolerate malformed data, please read with raw format (string or binary) which won't fail with such data, and try converting. e.g. from_json() will produce nu

Re: Spark streaming with Kafka

2020-07-02 Thread Jungtaek Lim
I can't reproduce. Could you please make sure you're running spark-shell with official spark 3.0.0 distribution? Please try out changing the directory and using relative path like "./spark-shell". On Thu, Jul 2, 2020 at 9:59 PM dwgw wrote: > Hi > I am trying to stream kafka topic from spark shel

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread Jungtaek Lim
the progress in point of Kafka's view, especially the gap between highest offset and committed offset. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Jul 6, 2020 at 2:53 AM Gabor Somogyi wrote: > In 3.0 the community just added it. > > On Sun, 5 Jul 2020, 14:28 K

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread Jungtaek Lim
d for some time. > > Thanks, > Asmath > > On Sun, Jul 5, 2020 at 6:22 PM Jungtaek Lim > wrote: > >> There're sections in SS programming guide which exactly answer these >> questions: >> >> >> http://spark.apache.org/docs/latest/structured-s

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Jungtaek Lim
Please provide logs and dump file for the OOM case - otherwise no one could say what's the cause. Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="...dir..." On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava wrote: > *Issue:* I am trying to process 5000+

Re: Lazy Spark Structured Streaming

2020-07-27 Thread Jungtaek Lim
tputs on OutputMode.Append compared to OutputMode.Update. Unfortunately there's no mechanism on SSS to move forward only watermark without actual input, so if you want to test some behavior on OutputMode.Append you would need to add a dummy record to move watermark forward. Hope this helps.

Re: Pyspark: Issue using sql in foreachBatch sink

2020-07-31 Thread Jungtaek Lim
Python doesn't allow abbreviating () with no param, whereas Scala does. Use `write()`, not `write`. On Wed, Jul 29, 2020 at 9:09 AM muru wrote: > In a pyspark SS job, trying to use sql instead of sql functions in > foreachBatch sink > throws AttributeError: 'JavaMember' object has no attribute '

Re: Lazy Spark Structured Streaming

2020-08-02 Thread Jungtaek Lim
56> refer if not fixing > the "need to add a dummy record to move watermark forward"? > > Kind regards, > > Phillip > > > > > On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> I'm not sure

Re: [SPARK-STRUCTURED-STREAMING] IllegalStateException: Race while writing batch 4

2020-08-12 Thread Jungtaek Lim
ngs to maintain the table in good shape) Thanks, Jungtaek Lim (HeartSaVioR) On Sat, Aug 8, 2020 at 4:19 AM Amit Joshi wrote: > Hi, > > I have 2spark structure streaming queries writing to the same outpath in > object storage. > Once in a while I am getting the "IllegalStateEx

Re: Structured Streaming metric for count of delayed/late data

2020-08-20 Thread Jungtaek Lim
One more thing to say, unfortunately, the number is not accurate compared to the input rows on streaming aggregation, because Spark does local-aggregate and counts dropped inputs based on "pre-locally-aggregated" rows. You may want to treat the number as whether dropping inputs is happening or not.

Re: Structured Streaming metric for count of delayed/late data

2020-08-22 Thread Jungtaek Lim
Is there any workaround for this limitation of inaccurate count, maybe by > adding some additional streaming operation in SS job without impacting perf > too much ? > > > > Regards, > > Rajat > > > > *From: *Jungtaek Lim > *Date: *Friday, 21 August 2020 at 12

Re: [Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-28 Thread Jungtaek Lim
Hi Amit, if I remember correctly, you don't need to restart the query to reflect the newly added topic and partition, if your subscription covers the topic (like subscribe pattern). Please try it out. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Aug 28, 2020 at 1:56 PM

Re: Keeping track of how long something has been in a queue

2020-09-06 Thread Jungtaek Lim
te. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Sep 4, 2020 at 11:21 PM Hamish Whittal wrote: > Sorry, I moved a paragraph, > > (2) If Ms green.th was first seen at 13:04:04, then at 13:04:05 and >> finally at 13:04:17, she's been in the queue for 13 seconds (ignoring the >> ms). >> >

Re: Elastic Search sink showing -1 for numOutputRows

2020-09-07 Thread Jungtaek Lim
I don't know about ES sink. The availability of "numOutputRows" depends on the API version the sink is implementing (DSv1 vs DSv2), so you may be better to ask a question to the author of ES sink and confirm the case. On Tue, Sep 8, 2020 at 5:15 AM jainshasha wrote: > Hi, > > Using structured sp

Re: Query around Spark Checkpoints

2020-09-27 Thread Jungtaek Lim
reading it, and evaluate whether your target storage can fulfill the requirement. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi wrote: > Hi, > > As far as I know, it depends on whether you are using spark streaming or > structured streaming. > In

Re: Query around Spark Checkpoints

2020-09-29 Thread Jungtaek Lim
ind a workaround then probably something is going wrong. Please feel free to share it. Thanks, Jungtaek Lim (HeartSaVioR) 2020년 9월 30일 (수) 오전 1:14, Bryan Jeffrey 님이 작성: > Jungtaek, > > How would you contrast stateful streaming with checkpoint vs. the idea of > writing updates to a Delta Lak

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim
andra connector leverages it. I see some external data sources starting to support catalog, and in Spark itself there's some effort to support catalog for JDBC. https://databricks.com/fr/session_na20/datasource-v2-and-cassandra-a-whole-new-world Hope this helps. Thanks, Jungtaek Lim (HeartS

Re: Arbitrary stateful aggregation: updating state without setting timeout

2020-10-05 Thread Jungtaek Lim
emove state for the group (key). Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) ‪On Mon, Oct 5, 2020 at 6:16 PM ‫Yuri Oleynikov (יורי אולייניקוב‬‎ < yur...@gmail.com> wrote:‬ > Hi all, I have following question: > > What happens to the state (in terms of expiration) if I’m up

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
ame (in short of how checkpoint works in SS, temp file is atomically renamed to be the final file), and as a workaround (SPARK-28025 [3]) Spark tries to delete the crc file which two additional operations (exist -> delete) may occur per crc file. Hope this helps. Thanks, Jungtaek Lim

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
en on the next cycle Spark/Hadoop does > not re-use the knowledge of a previously found utility location, and > repeats the search from the very start causing useless file system search > operations over and over again. > > This may or may not matter when HDFS is used for checkpoint store &

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim
ot;) > > and in Python: > > from gresearch.spark.dgraph.connector import *triples = > spark.read.dgraph.triples("localhost:9080") > > I agree that 3rd parties should also support the official > spark.read.format() and the new catalog approaches. > > Enrico > > Am 05.10.20

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Jungtaek Lim
I can't spend too much time on explaining one by one. I strongly encourage you to do a deep-dive instead of just looking around as you want to know about "details" - that's how open source works. I'll go through a general explanation instead of replying inline; probably I'd write a blog doc if the

Re: States get dropped in Structured Streaming

2020-10-23 Thread Jungtaek Lim
Unfortunately your information wouldn't provide any hint that rows in the state are evicted correctly on watermark advance or there's an unknown bug which some of the rows in state are silently dropped. I haven't heard of the case for the latter - probably you'd like to double check it with focusin

Re: Cannot perform operation after producer has been closed

2020-11-02 Thread Jungtaek Lim
Which Spark version do you use? There's a known issue on Kafka producer pool in Spark 2.x which was fixed in Spark 3.0, so you'd like to check whether your case is bound to the known issue or not. https://issues.apache.org/jira/browse/SPARK-21869 On Tue, Nov 3, 2020 at 1:53 AM Eric Beabes wrote

Re: Best way to emit custom metrics to Prometheus in spark structured streaming

2020-11-02 Thread Jungtaek Lim
You can try out "Dataset.observe" added in Spark 3, which enables arbitrary metrics to be logged and exposed to streaming query listeners. On Tue, Nov 3, 2020 at 3:25 AM meetwes wrote: > Hi I am looking for the right approach to emit custom metrics for spark > structured streaming job. *Actual S

Re: Excessive disk IO with Spark structured streaming

2020-11-05 Thread Jungtaek Lim
FYI, SPARK-30294 is merged and will be available in Spark 3.1.0. This reduces the number of temp files for the state store to half when you use streaming aggregation. 1. https://issues.apache.org/jira/browse/SPARK-30294 On Thu, Oct 8, 2020 at 11:55 AM Jungtaek Lim wrote: > I can't s

Re: Structured Streaming Checkpoint Error

2020-12-02 Thread Jungtaek Lim
In theory it would work, but works very inefficiently on checkpointing. If I understand correctly, it will write the content to the temp file on s3, and rename the file which actually gets the temp file from s3 and write the content of temp file to the final path on s3. Compared to checkpoint with

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread Jungtaek Lim
Please refer my previous answer - https://lists.apache.org/thread.html/r7dfc9e47cd9651fb974f97dde756013fd0b90e49d4f6382d7a3d68f7%40%3Cuser.spark.apache.org%3E Probably we may want to add it in the SS guide doc. We didn't need it as it just didn't work with eventually consistent model, and now it wo

Re: Data source v2 streaming sinks does not support Update mode

2021-01-12 Thread Jungtaek Lim
Which exact Spark version did you use? Did you make sure the version for Spark and the version for spark-sql-kafka artifact are the same? (I asked this because you've said you've used Spark 3.0 but spark-sql-kafka dependency pointed to 3.1.0.) On Tue, Jan 12, 2021 at 11:04 PM Eric Beabes wrote:

Re: Data source v2 streaming sinks does not support Update mode

2021-01-12 Thread Jungtaek Lim
ed 3.1.0. > > On Wed, Jan 13, 2021 at 11:35 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Which exact Spark version did you use? Did you make sure the version for >> Spark and the version for spark-sql-kafka artifact are the same? (I asked >> this b

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Jungtaek Lim
t; org.apache.spark >>>> spark-sql-kafka-0-10_${scala.binary.version} >>>> ${spark.version} >>>> >>>> >>>> >>>> org.slf4j >>>> slf4j-log4j12 >>>> 1.7.7

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Jungtaek Lim
ore releasing Spark 3.0.0. Clearing Spark artifacts in your .m2 cache may work. On Tue, Jan 19, 2021 at 8:50 AM Jungtaek Lim wrote: > And also include some test data as well. I quickly looked through the code > and the code may require a specific format of the record. > > On Tue, Jan

Re: Structured Streaming Spark 3.0.1

2021-01-20 Thread Jungtaek Lim
I quickly looked into the attached log in SO post, and the problem doesn't seem to be related to Kafka. The error stack trace is from checkpointing to GCS, and the implementation of OutputStream for GCS seems to be provided with Google. Could you please elaborate the stack trace or upload the log

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Jungtaek Lim
I'm not sure how many people could even guess possible reasons - I'd say there's not enough information. No driver/executor logs, no job/stage/executor information, no code. On Thu, Jan 21, 2021 at 8:25 PM Jacek Laskowski wrote: > Hi, > > I'd look at stages and jobs as it's possible that the onl

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Jungtaek Lim
Looks like it's a driver side error log, and I think executor log would have much more warning/error logs and probably with stack traces. I'd also suggest excluding the external dependency whatever possible while experimenting/investigating. If you're suspecting Apache Spark I'd rather say you'll

Re: How to handle spark state which is growing too big even with timeout set.

2021-02-14 Thread Jungtaek Lim
quot; which lets users migrate their state from state store provider A to B. The hopeful plan is to support any arbitrary providers between the two. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Feb 11, 2021 at 5:01 PM Kuttaiah Robin wrote: > Hello, > > I have a use case where I need to r

Re: KafkaUtils module not found on spark 3 pyspark

2021-02-17 Thread Jungtaek Lim
ave willingness to maintain DStream. Honestly, contributions on DStream have been quite rare.) Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Feb 17, 2021 at 4:19 PM aupres wrote: > I use hadoop 3.3.0 and spark 3.0.1-bin-hadoop3.2. And my python ide is > eclipse version 2020-12. I

Re: [Spark SQL] - Not able to consume Kafka topics

2021-02-18 Thread Jungtaek Lim
(Dropping Kafka user mailing list as this is more likely Spark issue) Do you have a full stack trace for a log message? It would help to make clear where the issue lays. On Thu, Feb 18, 2021 at 8:01 PM Rathore, Yashasvini wrote: > Hello, > > Issues : > > * I and my team are trying to consum

Re: Controlling Spark StateStore retention

2021-02-20 Thread Jungtaek Lim
ed. Even in upcoming Spark 3.1, the query having such a pattern is disallowed unless end users set the config explicitly to force run. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Sun, Feb 21, 2021 at 8:49 AM Sergey Oboguev wrote: > I am trying to write a Spark Structured Streaming

Re: Structured streaming, Writing Kafka topic to BigQuery table, throws error

2021-02-23 Thread Jungtaek Lim
If your code doesn't require "end to end exactly-once" then you could leverage foreachBatch which enables you to use batch sink. If your code requires "end to end exactly-once", then well, that's the different story. I'm not familiar with BigQuery and even have no idea how sink is implemented, but

Re: Spark 2.3 Stream-Stream Join with left outer join lost left stream value

2021-02-27 Thread Jungtaek Lim
We figured out edge-case from stream-stream left/right outer join in Spark 2.x and fixed in Spark 3.0.0. Please refer SPARK-26154 for more details. The fix brought another regression which was fixed in 3.0.1, so you may want to move to Spark 3.0.1

Re: Spark job crashing - Spark Structured Streaming with Kafka

2021-03-02 Thread Jungtaek Lim
I feel this quite lacks information. Full stack traces from driver/executors are essential at least to determine what was happening. On Tue, Mar 2, 2021 at 5:26 PM Sachit Murarka wrote: > Hi All, > > My spark job is crashing (Structured stream) . Can anyone help please. I > am using spark 3.0.1

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Jungtaek Lim
Thanks Hyukjin for driving the huge release, and thanks everyone for contributing the release! On Wed, Mar 3, 2021 at 6:54 PM angers zhu wrote: > Great work, Hyukjin ! > > Bests, > Angers > > Wenchen Fan 于2021年3月3日周三 下午5:02写道: > >> Great work and congrats! >> >> On Wed, Mar 3, 2021 at 3:51 PM K

Re: Spark job crashing - Spark Structured Streaming with Kafka

2021-03-03 Thread Jungtaek Lim
c480-4d00-866b-0fbd88e9520e, runId = > 8f1f1756-da8d-4983-9f76-dc1af626ad84] > Current Committed Offsets: {} > Current Available Offsets: {KafkaV2[Subscribe[test-topic]]: > {"test-topic":{"0":4628}}} > Current State: ACTIVE > Thread State: RUNNABLE > Logical Plan: > WriteToMicro

Re: FlatMapGroupsWithStateFunction is called thrice - Production use case.

2021-03-11 Thread Jungtaek Lim
that others aren't interested in your own code even if they are interested in the problematic behavior itself. It'd be nice if you can minimize the hurdle on debugging. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Mar 11, 2021 at 4:54 PM Kuttaiah Robin wrote: > Hello, > > I have

Re: Detecting latecomer events in Spark structured streaming

2021-03-11 Thread Jungtaek Lim
;minority", but nothing comes into conclusion it worths to put efforts. If your business logic requires it, you could be a hacker and try to deal with this, and share if you succeed to make it.) I'd skip answering questions as I explained you'd be stuck even before raising these que

Re: Using Spark as a fail-over platform for Java app

2021-03-12 Thread Jungtaek Lim
That's what resource managers provide to you. So you can code and deal with resource managers, but I assume you're finding ways to not deal with resource managers directly and let Spark do it instead. I admit I have no experience (I did the similar with Apache Storm on standalone setup 5+ years ag

Re: Spark Structured Streaming and Kafka message schema evolution

2021-03-15 Thread Jungtaek Lim
rage the part of schema you've integrated, and the latest schema is "backward compatible" with the integrated schema. Hope this helps. Thanks Jungtaek Lim (HeartSaVioR) On Mon, Mar 15, 2021 at 9:25 PM Mich Talebzadeh wrote: > This is just a query. > > In general Kafka-connect

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Jungtaek Lim
Hadoop 2.x doesn't support JDK 11. See Hadoop Java version compatibility with JDK: https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions That said, you'll need to use Spark 3.x with Hadoop 3.1 profile to make Spark work with JDK 11. On Tue, Mar 16, 2021 at 10:06 PM Sean Owen w

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Jungtaek Lim
layer), but worth to know in any way that it's not in official support from the Hadoop community. On Wed, Mar 17, 2021 at 6:54 AM Jungtaek Lim wrote: > Hadoop 2.x doesn't support JDK 11. See Hadoop Java version compatibility > with JDK: > > https://cwiki.apache.org/confluence

Re: ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

2021-03-18 Thread Jungtaek Lim
We've fixed the single case for "onJobStart", please check SPARK-34731 [1]. The patch will be available in Spark 3.1.2 / 3.2.0, but if someone reports the same for lower version lines I think we could port back to lower version lines as well. 1. https://issues.apache.org/jira/browse/SPARK-34731 O

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Jungtaek Lim
Theoretically, the composed value of batchId + monotonically_increasing_id() would achieve the goal. The major downside is that you'll need to deal with "deduplication" of output based on batchID as monotonically_increasing_id() is indeterministic. You need to ensure there's NO overlap on output ag

Re: Appending a static dataframe to a stream create Parquet file fails

2021-09-02 Thread Jungtaek Lim
will only read the files which are written from the streaming query. There are 3rd party projects dealing with transactional write from multiple writes, (alphabetically) Apache Iceberg, Delta Lake, and so on. You may want to check them out. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Sep 2, 2021 at

Re: Appending a static dataframe to a stream create Parquet file fails

2021-09-06 Thread Jungtaek Lim
from a static dataframe. > > Anyhow, best regards > Eugen > > On Fri, 2021-09-03 at 11:44 +0900, Jungtaek Lim wrote: > > Hi, > > The file stream sink maintains the metadata in the output directory. The > metadata retains the list of files written by the streaming query, a

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Jungtaek Lim
Thanks to Gengliang for driving this huge release! On Wed, Oct 20, 2021 at 1:50 AM Dongjoon Hyun wrote: > Thank you so much, Gengliang and all! > > Dongjoon. > > On Tue, Oct 19, 2021 at 8:48 AM Xiao Li wrote: > >> Thank you, Gengliang! >> >> Congrats to our community and all the contributors! >

Re: DataStreamReader cleanSource option

2022-02-03 Thread Jungtaek Lim
1.) If it doesn't help, please turn on the DEBUG log level for the package "org.apache.spark.sql.execution.streaming" and grep the log messages from SourceFileArchiver & SourceFileRemover. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Jan 27, 2022 at 9:56 PM Gabriela Dvořáková wrot

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Jungtaek Lim
I have no context on ML, but your "streaming" query exposes the possibility of memory issues. *flattenedNER.registerTempTable(**"df"**) >>> >>> >>> querySelect = **"SELECT col as entity, avg(sentiment) as sentiment, >>> count(col) as count FROM df GROUP BY col"** >>> finalDF = spark.sql(querySele

Re: Creating a Spark 3 Connector

2022-11-23 Thread Jungtaek Lim
onnectors. It's encouraged to look at reference implementations like Kafka and understand interfaces. Each interface has its own documentation so it will guide you to implement your own. Please post any question on dev@ mailing list if you have doubts or are stuck with implementing it.

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Jungtaek Lim
Thanks Chao for driving the release! On Wed, Nov 30, 2022 at 6:03 PM Wenchen Fan wrote: > Thanks, Chao! > > On Wed, Nov 30, 2022 at 1:33 AM Chao Sun wrote: > >> We are happy to announce the availability of Apache Spark 3.2.3! >> >> Spark 3.2.3 is a maintenance release containing stability fixes

Re: Create a Jira account

2022-12-01 Thread Jungtaek Lim
There is a guide in the page - send the request mail to priv...@spark.apache.org. On Thu, Dec 1, 2022 at 10:07 PM ideal wrote: > hello > i need to open a Jira ticket for spark about thrift server operation > log output is empty. but i do not have an ASF Jira account. recently Infra > ended

Re: Slack for PySpark users

2023-03-30 Thread Jungtaek Lim
I'm reading through the page "Briefing: The Apache Way", and in the section of "Open Communications", restriction of communication inside ASF INFRA (mailing list) is more about code and decision-making. https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define It's unavoidab

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
fficial Slack channel has 602 subscribers. > > May I ask if the users prefer to use the ASF Official Slack channel > than the user mailing list? > > Dongjoon. > > > > On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim > wrote: > >> I'm reading through the pa

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
there. On Tue, Apr 4, 2023 at 7:04 AM Jungtaek Lim wrote: > The number of subscribers doesn't give any meaningful value. Please look > into the number of mails being sent to the list. > > https://lists.apache.org/list.html?user@spark.apache.org > The latest month there were m

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Jungtaek Lim
Hi, we are aware of your ticket and plan to look into it. We can't say about ETA but just wanted to let you know that we are going to look into it. Thanks for reporting! Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Oct 27, 2023 at 5:22 AM Andrzej Zera wrote: > Hey All, > >

Re: How exactly does dropDuplicatesWithinWatermark work?

2023-11-21 Thread Jungtaek Lim
is new API will ensure that these duplicated writes are deduplicated once users provide the max distance of time (max - min) among duplicated events as delay threshold of watermark. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Nov 20, 2023 at 10:18 AM Perfect Stranger wrote: > Hel

Re: Spark structured streaming tab is missing from spark web UI

2023-11-24 Thread Jungtaek Lim
The feature was added in Spark 3.0. Btw, you may want to check out the EOL date for Apache Spark releases - https://endoflife.date/apache-spark 2.x is already EOLed. On Fri, Nov 24, 2023 at 11:13 PM mallesh j wrote: > Hi Team, > > I am trying to test the performance of a spark streaming applica

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-11 Thread Jungtaek Lim
more late events to be accepted. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Jan 11, 2024 at 6:13 AM Andrzej Zera wrote: > I'm struggling with the following issue in Spark >=3.4, related to > multiple stateful operations. > > When spark.sql.streaming.s

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Jungtaek Lim
If you use RocksDB state store provider, you can turn on changelog checkpoint to put the single changelog file per partition per batch. With disabling changelog checkpoint, Spark uploads newly created SST files and some log files. If compaction had happened, most SST files have to be re-uploaded. U

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Jungtaek Lim
Hi, Streaming query clones the spark session - when you create a temp view from DataFrame, the temp view is created under the cloned session. You will need to use micro_batch_df.sparkSession to access the cloned session. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jan 31, 2024 at 3:29 PM

[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
you. Jungtaek Lim ps. Yikun is helping us through releasing the official docker image for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
s to update the version. (For automatic bumping I don't have a good idea.) I'll look into it. Please expect some delay during the holiday weekend in S. Korea. Thanks again. Jungtaek Lim (HeartSaVioR) On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun wrote: > BTW, Jungtaek. > > PySp

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
0 versions? What's the criteria of pruning the version? Unless we have a good answer to these questions, I think it's better to revert the functionality - it missed various considerations. On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim wrote: > Thanks for reporting - this is odd - the dr

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
ark/pull/42428? > > cc @Yang,Jie(INF) > > On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim > wrote: > >> Shall we revisit this functionality? The API doc is built with individual >> versions, and for each individual version we depend on other released >> versions. Thi

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
> > https://github.com/apache/spark/pull/42881 > > So, we need to manually update this file. I can manually submit an update > first to get this feature working. > -- > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月4日 6:34:42 > *收件人:* Dongjoon Hyun > *

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
it's > also possible. > > Only by sharing the same version. json file in each version. > -- > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月5日 16:47:30 > *收件人:* Pan,Bingkun > *抄送:* Dongjoon Hyun; dev; user > *主题:* Re: [ANNOUNCE] Apache Spark

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
y, I see. > > Perhaps we can solve this confusion by sharing the same file `version.json` > across `all versions` in the `Spark website repo`? Make each version of > the document display the `same` data in the dropdown menu. > ------ > *发件人:* Jungtaek Lim > *发送时

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are different p

Re: can we use mapGroupsWithState in raw sql?

2018-04-17 Thread Jungtaek Lim
regation requires Update/Complete mode but join requires Append mode. (Guide page of structured streaming clearly explains such limitation: "Cannot use streaming aggregation before joins.") If you can achieve with mapGroupWithState, you may want to stick with that. Btw, when you deal with

Re: can we use mapGroupsWithState in raw sql?

2018-04-17 Thread Jungtaek Lim
even putting all five records together to the socket (by nc), two micro-batches were handling the records and provide two results. --- Batch: 0 --- +---+--++ | ID|AMOUNT|MY_TIMESTAMP| +---+--+--

Re: can we use mapGroupsWithState in raw sql?

2018-04-17 Thread Jungtaek Lim
P") .groupBy($"ID") .agg(maxrow(col("AMOUNT"), col("MY_TIMESTAMP")).as("maxrow")) .selectExpr("ID", "maxrow.st.AMOUNT", "maxrow.st.MY_TIMESTAMP") .writeStream .format("console") .trigger(Trigger.ProcessingTim

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Jungtaek Lim
columnName: String)org.apache.spark.sql.Column (e: org.apache.spark.sql.Column)org.apache.spark.sql.Column cannot be applied to (org.apache.spark.sql.ColumnName, org.apache.spark.sql.Column) Could you check your code to see it works with Spark 2.3 (via spark-shell or whatever)? Thanks! Ju

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Jungtaek Lim
uot;) .agg(max(struct($"AMOUNT", $"*")).as("data")) .select($"data.*") .writeStream .format("console") .trigger(Trigger.ProcessingTime("1 seconds")) .outputMode(OutputMode.Update()) .start() It still have a minor

Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-24 Thread Jungtaek Lim
Another thing you may want to be aware is, if the result is not idempotent, your query result is also not idempotent. For fault-tolerance there's a chance for record (row) to be replayed (recomputed). -Jungtaek Lim (HeartSaVioR) 2018년 4월 24일 (화) 오후 2:07, Jörn Franke 님이 작성: > What is

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-22 Thread Jungtaek Lim
1. Could you share your Spark version? 2. Could you reduce "spark.sql.ui.retainedExecutions" and see whether it helps? This configuration is available in 2.3.0, and default value is 1000. Thanks, Jungtaek Lim (HeartSaVioR) 2018년 5월 22일 (화) 오후 4:29, weand 님이 작성: > You can see it e

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-23 Thread Jungtaek Lim
The issue looks like fixed in https://issues.apache.org/jira/browse/SPARK-23670, and likely 2.3.1 will include the fix. -Jungtaek Lim (HeartSaVioR) 2018년 5월 23일 (수) 오후 7:12, weand 님이 작성: > Thanks for clarification. So it really seem a Spark UI OOM Issue. > > After setting: >

Re: RepartitionByKey Behavior

2018-06-21 Thread Jungtaek Lim
It is not possible because the cardinality of the partitioning key is non-deterministic, while partition count should be fixed. There's a chance that cardinality > partition count and then the system can't ensure the requirement. Thanks, Jungtaek Lim (HeartSaVioR) 2018년 6월 22

Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-21 Thread Jungtaek Lim
6541967934 1529640103,1.0606060606060606 1529640113,0.9997000899730081 Could you add streaming query listener and see the value of sources -> numInputRows, inputRowsPerSecond, processedRowsPerSecond? They should provide some valid numbers. Thanks, Jungtaek Lim (HeartSaVioR) 2018년 6월 22일 (금) 오전

Re: [Structured Streaming] Metrics or logs of events that are ignored due to watermark

2018-07-02 Thread Jungtaek Lim
and provide late input rows based on this. So I think this is valuable to address, and I'm planning to try to address it, but it would be OK for someone to address it earlier. Thanks, Jungtaek Lim (HeartSaVioR) 2018년 7월 3일 (화) 오전 3:39, subramgr 님이 작성: > Hi all, > > Do we have som

Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread Jungtaek Lim
Could you please describe the version of Spark, and how did you run your app? If you don’t mind to share minimal app which can reproduce this, it would be really great. - Jungtaek Lim (HeartSaVioR) On Mon, 2 Jul 2018 at 7:56 PM kant kodali wrote: > Hi All, > > I get the below error qu

Re: [Structured Streaming] Custom StateStoreProvider

2018-07-12 Thread Jungtaek Lim
Girish, I think reading through implementation of HDFSBackedStateStoreProvider as well as relevant traits should bring the idea to you how to implement custom one. HDFSBackedStateStoreProvider is not that complicated to read and understand. You just need to deal with your underlying storage engine

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-08 Thread Jungtaek Lim
ot;. Not sure there's any available trick to achieve it without calling repartition. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://github.com/apache/spark/blob/a40806d2bd84e9a0308165f0d6c97e9cf00aa4a3/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2918-L2937 2018년 8월 9일 (목) 오전 5:55

Re: groupBy and then coalesce impacts shuffle partitions in unintended way

2018-08-09 Thread Jungtaek Lim
ss tasks. We just can't apply coalesce to individual operator in narrow dependency. -Jungtaek Lim (HeartSaVioR) 2018년 8월 9일 (목) 오후 3:07, Koert Kuipers 님이 작성: > well an interesting side effect of this is that i can now control the > number of partitions for every shuffle in a dataframe

Re: Structured Streaming doesn't write checkpoint log when I use coalesce

2018-08-09 Thread Jungtaek Lim
$"mod", $"word") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .coalesce(8) val query = outDf.writeStream .format("memory") .optio

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-05 Thread Jungtaek Lim
You may need to import implicits from your spark session like below: (Below code is borrowed from https://spark.apache.org/docs/latest/sql-programming-guide.html) import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("Spark SQL basic example") .config("spark.s

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-05 Thread Jungtaek Lim
[T] = { DatasetHolder(_sqlContext.createDataset(rdd)) } You can see lots of Encoder implementations in the scala code. If your type doesn't match anything it may not work and you need to provide custom Encoder. -Jungtaek Lim (HeartSaVioR) 2018년 9월 5일 (수) 오후 5:24, Mich Talebzadeh 님이 작성: >

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-05 Thread Jungtaek Lim
Sorry I guess I pasted another method. the code is... implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(s)) } 2018년 9월 5일 (수) 오후 5:30, Jungtaek Lim 님이 작성: > I guess you need to have encoder for the type of result

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-05 Thread Jungtaek Lim
-of-org-apache/m-p/29994/highlight/true#M973 And which Spark version do you use? 2018년 9월 5일 (수) 오후 5:32, Jungtaek Lim 님이 작성: > Sorry I guess I pasted another method. the code is... > > implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): > DatasetHolder[T] = { >

Re: getting error: value toDF is not a member of Seq[columns]

2018-09-06 Thread Jungtaek Lim
: string ... 2 more fields] Maybe need to know about actual type of key, ticker, timeissued, price from your variables. Jungtaek Lim (HeartSaVioR) 2018년 9월 6일 (목) 오후 5:57, Mich Talebzadeh 님이 작성: > I am trying to understand why spark cannot convert a simple comma > separated columns as DF. &

  1   2   >