Appreciate a second opinion – Metadata Analysis of PDF Files

2025-04-16 Thread Mich Talebzadeh
iginal .msg file and metadata screenshots for review. Please feel free to reply or DM if interested. Regards, Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Spark Shuffle - in kubeflow spark operator installation on k8s

2025-03-31 Thread Mich Talebzadeh
yes apache celeborn may be useful. You need to do some research though. https://celeborn.apache.org/ Have a look at this link as well Spark Executor Shuffle Storage Options <https://iomete.com/resources/k8s/spark-executor-shuffle-storage-options>HTHDr Mich Talebzadeh, Architect | Data S

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
ur specific requirements in Spark. HTH <https://medium.com/@manutej/mastering-sql-window-functions-guide-e6dc17eb1995#:~:text=Window%20functions%20can%20perform%20a,related%20to%20the%20current%20row.> Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
False) Output root |-- code: integer (nullable = true) |-- AB_amnt: long (nullable = true) |-- AA_amnt: long (nullable = true) |-- AC_amnt: long (nullable = true) |-- load_date: date (nullable = true) ++---+---+---+--+ |code|AB_amnt|AA_amnt|AC_amnt|load_date | ++---+---+---+------+ |1

Re: Apply pivot only on some columns in pyspark

2025-03-09 Thread Mich Talebzadeh
|load_date | ++---+---+---+--+ |1 |12 |22 |11 |2022-01-01| |2 |22 |28 |25 |2022-02-01| ++---+---+---+--+ Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile

Re: Apache - GSOC'25 projects / Contributions

2025-02-24 Thread Mich Talebzadeh
more informed knowledge. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 24 Feb 2025 at 19:13, D. Mohith Akshay wrote: > Hello Everyone, &g

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Mich Talebzadeh
ollect data to it. - Temporary views and caching/persisting are different mechanisms with different memory implications. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5

Re: Spark connect: Table caching for global use?

2025-02-16 Thread Mich Talebzadeh
yep. created on driver memory. watch for OOM if the size becomes too large spark-submit --driver-memory 8G ... HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-520

Re: Drop Python 2 support from GraphFrames?

2025-01-31 Thread Mich Talebzadeh
+1 long overdue Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Sat, 1 Feb 2025 at 02:16, Russell Jurney wrote: > Oh, wonderful! That should about

Re: [Spark SQL]: Are SQL User-Defined Functions on the Roadmap?

2025-01-31 Thread Mich Talebzadeh
Hi Frank, I think this would be for the Spark dev team. I have added to the email. HTH Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Fri, 31 Jan 2025

Re: Help choose a GraphFrames logo

2025-01-31 Thread Mich Talebzadeh
Hi Russell, Has this been finalised as I ticked my preference today (could be too late) Dr Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Sun, 19 Jan 2025

Re: [Spark Stream]: Batch processing time reduce over time causing Kafka Lag

2025-01-29 Thread Mich Talebzadeh
the entire stream processing pipeline. Let me think about your design and come back. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Mon, 27 Jan 2025 at

Re: [Spark Stream]: Batch processing time reduce over time causing Kafka Lag

2025-01-26 Thread Mich Talebzadeh
low or skewed tasks that might be overloading specific executors. Optimize these tasks or redistribute them for better load balancing. - Investigate Network Timeouts: Address the root cause of the Kafka connection timeout exceptions to prevent delays during message publishing. HTH Mi

Re: [start-connect-server.sh] connecting with org.apache.spark.deploy.worker.Worker

2025-01-24 Thread Mich Talebzadeh
ok great is sorted out Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> On Fri, 24 Jan 2025 at 15:27, Andrew Petersen wrote: > Thank you Mich > > It

Re: Feature store in bigquery

2025-01-23 Thread Mich Talebzadeh
27; AND '2024-01-01' BETWEEN effective_from AND effective_to; use as_of_date and the effective_from and effective_to range to retrieve the correct feature value for a given date. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Lin

Re: [start-connect-server.sh] connecting with org.apache.spark.deploy.worker.Worker

2025-01-23 Thread Mich Talebzadeh
Connect server. Instead, client applications connect to the Spark Connect server, which then interacts with the Spark cluster on their behalf. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mi

Re: Re: Increasing Shading & Relocating for 4.0

2025-01-19 Thread Mich Talebzadeh
Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed .

Re: [Spark Core][BlockManager] Spark job fails if blockmgr dirs are cleaned up

2025-01-17 Thread Mich Talebzadeh
. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin pr

Re: LLM based data pre-processing

2025-01-04 Thread Mich Talebzadeh
Hi Russel, Spark's GPU scheduling capabilities have improved significantly with the advent of tools like the NVIDIA RAPIDS Accelerator for Spark. The NVIDIA RAPIDS Accelerator for Spark is directly relevant to

Re: LLM based data pre-processing

2025-01-04 Thread Mich Talebzadeh
Let us add some more detail to DFD diagram Data for the Entire Pipeline as attached Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia

Re: LLM based data pre-processing

2025-01-03 Thread Mich Talebzadeh
put. HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Li

Re: Spark 2.4 to Spark 3.5 migration - waiting for HMS

2024-12-05 Thread Mich Talebzadeh
e London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my kno

Re: Spark 2.4 to Spark 3.5 migration - waiting for HMS

2024-12-04 Thread Mich Talebzadeh
calls and improve performance. 2) check this link <https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html> for more info . HTH Mich Talebzadeh, Architect | Data Science | Financial Crime | GDPR & Compliance Specialist PhD <https://en.wikipedia.org/wiki/Doctor_of_Ph

Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

2024-12-03 Thread Mich Talebzadeh
Yes but your SSS job has to be stopped gracefully. Originally I raised this SPIP request https://issues.apache.org/jira/browse/SPARK-42485 Then I requested "Adding pause() method to pyspark.sql.streaming.StreamingQuery" I believe they are still open. HTH Mich Talebzadeh, Archit

Re: Which shuffle operations trigger AQE and which don't?

2024-11-12 Thread Mich Talebzadeh
n> London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essentia

Re: [ANNOUNCE] Apache Spark 3.4.4 released

2024-10-27 Thread Mich Talebzadeh
Upgraded from Spark 3.4.0 to 3.4.4 Looks good with the following versions I have tested - openjdk 11.0.8 - hadoop-3.1.0 - hive-3.1.1 - hbase-1.2.6 - GoogleBigQuery with spark-3.4-bigquery-0.41.0.jar HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime

Re: Issue with VAE Model - KerasTensor Incompatibility with TensorFlow Functions

2024-10-21 Thread Mich Talebzadeh
? Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://www.

Issue with VAE Model - KerasTensor Incompatibility with TensorFlow Functions

2024-10-21 Thread Mich Talebzadeh
inside appropriate Keras layers. It is becoming very time consuming. If anyone has faced a similar issue or has recommendations on the best practices for handling, I will appreciate it. Thanks Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia

Re: Spark Docker image with added packages

2024-10-15 Thread Mich Talebzadeh
/CD pipelines are all equally useful for managing Spark JARs. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> Lo

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-05 Thread Mich Talebzadeh
business and technical realities. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view

Re: Structured Streaming and Spark Connect

2024-09-23 Thread Mich Talebzadeh
<https://www.linkedin.com/pulse/building-event-driven-real-time-data-processor-spark-mich-zy3ef/?trackingId=RIwY%2FePi0jslLiXqOP8mxQ%3D%3D> HTH, Mich Talebzadeh Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Impe

Compatibility Issue: Spark 3.5.2 Schema Recognition vs. Spark 3.4.0 with Hive Metastore (Case Sensitivity)

2024-09-21 Thread Mich Talebzadeh
, particularly with metastore interactions and case sensitivity. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, Un

Re: Question about Releases and EOL

2024-08-29 Thread Mich Talebzadeh
ement declaring Spark 2.4.0 as the final minor release, the fact that 2.4.8 is still being maintained suggests it might be an LTS release. This is likely due to its continued usage? HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.

Re: Redundant(?) shuffle after join

2024-08-16 Thread Mich Talebzadeh
determine if it was influencing the shuffle. - Force Specific Partitioning: Repartition the DataFrame explicitly by key_co` before applying the window function to see if this prevents the shuffle. HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.

Re: Redundant(?) shuffle after join

2024-08-15 Thread Mich Talebzadeh
F df = spark.table("your_bucketed_table") df = df.withColumn("approx_count", F.approx_count_distinct("key_col")) df.groupBy("key_col").agg(F.avg("approx_count").alias("avg_count")).show() HTH Mich Talebzadeh, Architect | Data Engineer |

Re: dynamically infer json data not working as expected

2024-08-05 Thread Mich Talebzadeh
I gave an answer in SO HTH Mich Talebzadeh, Architect | Data Engineer | Data Science | Financial Crime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Link

Feature Engineering for Data Engineers: Building Blocks for ML Success

2024-08-03 Thread Mich Talebzadeh
dIn <https://www.linkedin.com/pulse/feature-engineering-data-engineers-building-blocks-ml-mich-ektwe/> Mich Talebzadeh, Architect | Data Engineer | Data Science | Writer PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_Colle

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Mich Talebzadeh
It will help if you mention the Spark version and the piece of problematic code HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia

Re: Help in understanding Exchange in Spark UI

2024-06-20 Thread Mich Talebzadeh
OK, I gave an answer in StackOverflow. Happy reading Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> Lo

Re: Update mode in spark structured streaming

2024-06-15 Thread Mich Talebzadeh
|4 |8 |2024-06-15 16:26:23.642| |9 |9 |2024-06-15 16:21:23.642| |5 |5 |2024-06-15 16:17:23.642| |1 |2 |2024-06-15 16:23:23.642| |3 |6 |2024-06-15 16:25:23.642| |6 |6 |2024-06-15 16:18:23.642| |7 |7 |2024-06-15 16:19:23.642| +---+

Re: Re: OOM issue in Spark Driver

2024-06-11 Thread Mich Talebzadeh
GUI HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD Imperial College London London, United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my

Re: Kubernetes cluster: change log4j configuration using uploaded `--files`

2024-06-06 Thread Mich Talebzadeh
configuration file is not found at the time the JVM is looking for it. In summary, you need to ensure the file is in place before the Spark driver or executor JVM starts. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/w

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
mit(load_segment, segment) for segment in segments] for future in as_completed(futures): try: df_segment = future.result() all_dfs.append(df_segment) except Exception as e: print(f"Error: {e}") # Union all DataFrames into a single Dat

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
Process df_segment as needed except Exception as e: print(f"Error: {e}") ThreadPoolExecutor enables parallel execution of tasks using multiple threads. Each thread can be responsible for loading a segment of the data. HTH Mich Talebzadeh, Technologist | Architect | Data Eng

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Mich Talebzadeh
Tell Spark to read from a single file data = spark.read.text("s3a://test-bucket/testfile.csv") This clarifies to Spark that you are dealing with a single file and avoids any bucket-like interpretation. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
ot be essential for your current operation, and Spark's attempt to read it could be causing the issue. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
. Review Code: 3. Check Spark UI: HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
to handle the data transfer rate, regardless of the service you choose. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_Col

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is wo

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
it myself, so I cannot provide a definitive answer. Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, Un

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
memory usage. 5. Spark UI Monitoring: Utilize the Spark UI to monitor memory usage throughout your job execution and identify potential memory bottlenecks. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
ts shuffle data when caching is involved. Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my kno

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
UI's display, not necessarily a bug in the Spark framework itself. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provid

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
actual number of records processed. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Discl

Re: Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread Mich Talebzadeh
VIEW hive1_table1 AS SELECT * FROM hive1.table1") spark.sql("CREATE TEMPORARY VIEW hive2_table2 AS SELECT * FROM hive2.table2") spark.sql("CREATE TEMPORARY VIEW mysql_table1 AS SELECT * FROM mysql.table1") HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | G

Re: [s3a] Spark is not reading s3 object content

2024-05-23 Thread Mich Talebzadeh
; ") \ # ensure this is apace .csv("s3a://input/testfile.csv") # Show the data df.show(n=1) except AnalysisException as e: print(f"AnalysisException: {e}") except Exception as e: print(f"Error: {e}") finally: # Stop the Spark sessi

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
storage you had with DStreams.* HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disc

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
\ .start() HTH <https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/> <https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/> Mich Talebzadeh, Technologist | Architect | Data Engineer

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
t a direct equivalent of DStream HasOffsetRanges in Spark Structured Streaming. However, Structured Streaming provides mechanisms to achieve similar functionality: HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin prof

Re: A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
;: "string", "null_count": 21921, "null_percentage": 23.48, "distinct_count": 38726, "distinct_percentage": 41.49 } } Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United King

A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
umn_analyzer.git>* The details are in the attached RENAME file Let me know what you think! Feedback is always welcome. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/i

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
a| | b| | d| +---+ df_1: +-+ | data| +-+ |[a, b, c]| | []| +-+ Result: ++ |data| ++ | a| | b| ++ HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrimeLondon United Kingdom view my Linkedin

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
utes", "5 minutes")). \ avg('temperature') - Write to Sink: Write the filtered records (dropped records) to a separate Kafka topic. - Consume and Store: Consume the dropped records topic with another streaming job and store them in a Postgres t

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
think about another way and revert HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disc

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
at the ticket and add your comments. Thanks Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
nCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essentia

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
t of was that uUsing materialized views with Spark Structured Streaming and Change Data Capture (CDC) is a potential solution for efficiently streaming view data updates in this scenario. . Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
similar issue or if there are any insights into why this discrepancy exists between Spark SQL and Hive. Thanks Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzade

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
monotonically_increasing_id() sequence might restart from the beginning. This could again cause duplicate IDs if other Spark applications are running concurrently or if data is processed across multiple runs of the same application.. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer

Re: spark.sql.shuffle.partitions=auto

2024-04-30 Thread Mich Talebzadeh
spark.sql.shuffle.partitions=auto Because Apache Spark does not build clusters. This configuration option is specific to Databricks, with their managed Spark offering. It allows Databricks to automatically determine an optimal number of shuffle partitions for your workload. HTH Mich Talebzadeh

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
tors HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information prov

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
are limited (memory, CPU). b) Data Skew: Uneven distribution of values in certain columns could lead to imbalanced processing across machines. Check Spark UI (4040) on staging and execution tabs HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London Uni

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
creation? Are joins matching columns correctly? 4) Specific Edge Issues: Can you share examples of vertex IDs with incorrect connections? Is this related to ID generation or edge creation logic? HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI, FinCrime London United

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
ssages. HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Spark column headings, camelCase or snake case?

2024-04-11 Thread Mich Talebzadeh
| Now I recently saw a note (if i recall correctly) that Spark should be using camelCase in new spark related documents. What are the accepted views or does it matter? Thanks Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
ROM streaming_df GROUP BY provinceId, window(start, '1 hour', '30 minutes') ORDER BY window.start """ # Write the aggregated results to Kafka sink stream = session.sql(query) \ .writeStream \ .format("kafka") \ .option("checkpointLocation", "

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
, numBytes) => host }.toArray } } HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
with your work. say with Oracle (as an example), utilise tools like OEM, VM StatPack, SQL*Plus scripts etc or third-party monitoring tools to collect detailed database health metrics directly from the Oracle database server. HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
anks Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is cor

Re: Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Mich Talebzadeh
ta within the trigger interval, preventing backlogs and potential OOM issues. >From Spark UI, look at the streaming tab. There are various statistics there. In general your Processing Time has to be less than your batch interval. The scheduling Delay and Total Delay are additional indicato

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look. Cheers Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywi

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
a Kubernetes cluster. They can include these configurations in the Spark application code or pass them as command-line arguments or environment variables during application submission. HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
better performance and scalability for handling larger datasets efficiently. Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
files systems come into it. I will be interested in hearing more about any progress on this. Thanks . Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
, "10 seconds") # Watermark on window-generated 'start' # Rest of the code remains the same streaming_df.createOrReplaceTempView("streaming_df") spark.sql(""" SELECT window.start, window.end, provinceId, totalPayAmount FROM streaming_df ORDER BY window.

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
# Define the watermark here # Create a temporary view from the streaming DataFrame with watermark streaming_df.createOrReplaceTempView("michboy") # Execute SQL queries on the temporary view result_df = (spark.sql(""" SELECT window.start, window.end, provinceId,

Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn <https://www.linkedin.com/pulse/leveraging-generative-ai-apache-spark-transforming-mich-lxbte/?trackingId=aqZMBOg4O1KYRB4Una7NEg%3D%3D> Mich Talebzadeh, Technologist | Data | Generat

Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
You may find this link of mine in Linkedin for the said article. We can use Linkedin for now. Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn Mich Talebzadeh, Technologist | Data | Generative AI | Financial Fraud London United Kingdom view my Linkedin

Re:

2024-03-21 Thread Mich Talebzadeh
g("MDVariables.targetDataset"), config.getString("MDVariables.targetTable")) df.unpersist() // println("wrote to DB") } else { println("DataFrame df is empty") } } If the DataFrame is empty, it prints a message indicating that the DataFrame is empty. You

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
n entertain this idea. They seem to have a well defined structure for hosting topics. Let me know your thoughts Thanks <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub> Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kin

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
the information (topics) are provided as best efforts and cannot be guaranteed. Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywi

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
- Databricks <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub> Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/&

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
+1 for me Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct

Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
iling tools like Spark UI or third-party libraries.for this purpose. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the be

A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
uld not be that difficult. If anyone is supportive of this proposal, let the usual +1, 0, -1 decide HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The informat

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
pipelining transformations and removing unnecessary computations. "I may need something like that for synthetic data for testing. Any way to do that ?" Have a look at this. https://github.com/joke2k/faker <https://github.com/joke2k/faker>HTH Mich Talebzadeh, Dad | Technologist | Sol

Python library that generates fake data using Faker

2024-03-16 Thread Mich Talebzadeh
fraudulent transactions to build a machine learning model to detect fraudulent transactions using PySpark's MLlib library. You can install it via pip install Faker Details from https://github.com/joke2k/faker HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London U

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
You can get additional info from Spark UI default port 4040 tabs like SQL and executors - Spark uses Catalyst optimiser for efficient execution plans. df.explain("extended") shows both logical and physical plans HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect |

  1   2   3   4   5   6   7   8   9   10   >