from:"ashok34...@yahoo.com.INVALID"

repartition in Spark

2020-11-09 Thread ashok34...@yahoo.com.INVALID

Hi, Just need some advise. - When we have multiple spark nodes running code, under what conditions a repartition make sense? - Can we repartition and cache the result --> df = spark.sql("select from ...").repartition(4).cache - If we choose a repartition (4), will that repartition ap

Python level of knowledge for Spark and PySpark

2021-04-14 Thread ashok34...@yahoo.com.INVALID

Hi gurus, I have knowledge of Java, Scala and good enough knowledge of Spark, Spark SQL and Spark Functional programing with Scala. I have started using Python with Spark PySpark. Wondering, in order to be proficient in PySpark, how much good knowledge of Python programing is needed? I know the a

Spark Streaming non functional requirements

2021-04-26 Thread ashok34...@yahoo.com.INVALID

Hello, When we design a typical spark streaming process, the focus is to get functional requirements. However, I have been asked to provide non-functional requirements as well. Likely things I can consider are Fault tolerance and Reliability (component failures). Are there a standard list of no

Re: Spark Streaming non functional requirements

2021-04-27 Thread ashok34...@yahoo.com.INVALID

17:16, ashok34...@yahoo.com.INVALID wrote: Hello, When we design a typical spark streaming process, the focus is to get functional requirements. However, I have been asked to provide non-functional requirements as well. Likely things I can consider are Fault tolerance and Reliability (comp

Recovery when two spark nodes out of 6 fail

2021-06-25 Thread ashok34...@yahoo.com.INVALID

Greetings, This is a scenario that we need to come up with a comprehensive answers to fulfil please. If we have 6 spark VMs each running two executors via spark-submit. - we have two VMs failures at H/W level, rack failure - we lose 4 executors of spark out of 12 - Happening half way

Re: Recovery when two spark nodes out of 6 fail

2021-06-25 Thread ashok34...@yahoo.com.INVALID

idempotent; ie; rerunning them shouldn’t change the outcome. Streaming jobs have benchmarking, and they will start from the last microbatch. This means that they might have to repeat the last microbatch. From: "ashok34...@yahoo.com.INVALID" Date: Friday, June 25, 2021 at 10:38 AM

Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread ashok34...@yahoo.com.INVALID

Hello team Someone asked me regarding well developed Python code with Panda dataframe and comparing that to PySpark. Under what situations one choose PySpark instead of Python and Pandas. Appreciate AK

How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-14 Thread ashok34...@yahoo.com.INVALID

Gurus, I have an RDD in PySpark that I can convert to DF through df = rdd.toDF() However, when I do df.printSchema() I see the columns as nullable. = true by default root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- COl-3: string (nullable = true) What would be the e

Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-15 Thread ashok34...@yahoo.com.INVALID

arise from relying on this email's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID wrote: Gurus, I have an RDD in PySpa

Spark with parallel processing and event driven architecture

2022-01-14 Thread ashok34...@yahoo.com.INVALID

Hi gurus, I am trying to understand the role of Spark in an event driven architecture. I know Spark deals with massive parallel processing. However, does Spark follow event driven architecture like Kafka as well? Say handling producers, filtering and pushing the events to consumers like database

What are the most common operators for shuffle in Spark

2022-01-23 Thread ashok34...@yahoo.com.INVALID

Hello, I know some operators in Spark are expensive because of shuffle. This document describes shuffle https://www.educba.com/spark-shuffle/ and saysMore shufflings in numbers are not always bad. Memory constraints and other impossibilities can be overcome by shuffling. In RDD, the below are a

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread ashok34...@yahoo.com.INVALID

Thanks Mich. Very insightful. AKOn Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh wrote: Good question. However, we ought to look at what options we have so to speak. Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow Spark on DataProc is proven a

Re: Issue while creating spark app

2022-02-28 Thread ashok34...@yahoo.com.INVALID

Thanks for all these useful info Hi all What is the current trend. Is it Spark on Scala with intellij or Spark on python with pycharm. I am curious because I have moderate experience with Spark on both Scala and python and want to focus on Scala OR python going forward with the intention of jo

Re: spark+kafka+dynamic resource allocation

2023-01-28 Thread ashok34...@yahoo.com.INVALID

Hi, Worth checking this link https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation On Saturday, 28 January 2023 at 06:18:28 GMT, Lingzhe Sun wrote: #yiv9684413148 body {line-height:1.5;}#yiv9684413148 ol, #yiv9684413148 ul {margin-top:0px;margin-bottom:0

Online classes for spark topics

2023-03-07 Thread ashok34...@yahoo.com.INVALID

Hello gurus, Does Spark arranges online webinars for special topics like Spark on K8s, data science and Spark Structured Streaming? I would be most grateful if experts can share their experience with learners with intermediate knowledge like myself. Hopefully we will find the practical experienc

Re: Online classes for spark topics

2023-03-08 Thread ashok34...@yahoo.com.INVALID

citly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID wrote: Hello gurus, Does Spark arranges online webinars for special topics like Spark on K8s, data scienc

Potability of dockers built on different cloud platforms

2023-04-05 Thread ashok34...@yahoo.com.INVALID

Hello team Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? Will that work please. AK

Shuffle with Window().partitionBy()

2023-05-12 Thread ashok34...@yahoo.com.INVALID

Hello, In Spark windowing does call with Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID

, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: Hello, In Spark windowing does call with Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-06 Thread ashok34...@yahoo.com.INVALID

Hello Mich, Thanking you for providing these useful feedbacks and responses. We appreciate your contribution to this community forum. I for myself find your posts insightful. +1 for me Best, AK On Wednesday, 6 September 2023 at 18:34:27 BST, Mich Talebzadeh wrote: Hi Varun, In answer t

Re: Filter out 20% of rows

2023-09-16 Thread ashok34...@yahoo.com.INVALID

021836|2023-03-01 04:44:14| | 84.183.253.20| 7.707176860385722|2021-08-26 23:24:31| |218.163.165.232| 9.458673015973213|2021-02-22 12:13:15| | 62.57.20.153|1.5764916247359229|2021-11-06 12:41:59| | 98.171.202.249| 3.546118349483626|2022-07-05 10:55:26| |180.140.248.193|0.9512956363005021|2021-

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID

Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to select the top 5 incoming IP addresses with the highest total volume of data tran

Clarification with Spark Structured Streaming

2023-10-08 Thread ashok34...@yahoo.com.INVALID

Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID

il's technical content is explicitly disclaimed.The author will in no case be liable for any monetary damages arising from suchloss, damage or destruction. On Sun, 8 Oct 2023 at 19:50, ashok34...@yahoo.com.INVALID wrote: Hello team 1) In Spark Structured Streaming does commit mean streaming

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread ashok34...@yahoo.com.INVALID

Hey Mich, Thanks for this introduction on your forthcoming proposal "Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I recently came across an article by Databricks with title Scalable Spark Structured Streaming for REST API Destinations. Their use cas

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID

Good idea. Will be useful +1 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh wrote: Some of you may be aware that Databricks community Home | Databricks have just launched a knowledge sharing hub. I thought it would be a good idea for the Apache Spark user group to have the sa

Re: A handy tool called spark-column-analyser

2024-05-21 Thread ashok34...@yahoo.com.INVALID

Great work. Very handy for identifying problems thanks On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh wrote: A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output { "Postcode":

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID

Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure s

repartition in Spark

Python level of knowledge for Spark and PySpark

Spark Streaming non functional requirements

Re: Spark Streaming non functional requirements

Recovery when two spark nodes out of 6 fail

Re: Recovery when two spark nodes out of 6 fail

Well balanced Python code with Pandas compared to PySpark

How to change a DataFrame column from nullable to not nullable in PySpark

Re: How to change a DataFrame column from nullable to not nullable in PySpark

Spark with parallel processing and event driven architecture

What are the most common operators for shuffle in Spark

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Re: Issue while creating spark app

Re: spark+kafka+dynamic resource allocation

Online classes for spark topics

Re: Online classes for spark topics

Potability of dockers built on different cloud platforms

Shuffle with Window().partitionBy()

Re: Shuffle with Window().partitionBy()

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

Re: Filter out 20% of rows

Need to split incoming data into PM on time column and find the top 5 by volume of data

Clarification with Spark Structured Streaming

Re: Clarification with Spark Structured Streaming

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

Re: A handy tool called spark-column-analyser

Re: Dstream HasOffsetRanges equivalent in Structured streaming

28 matches

Site Navigation

Mail list logo

Footer information