Can’t write to PVC in K8S

2021-08-30 Thread Bjørn Jørgensen
Hi, I have built and running spark on k8s. A link to my repo https://github.com/bjornjorgensen/jlpyk8s Everything seems to be running fine, but I can’t save to PVC. If I convert the dataframe to pandas, then I can save it. from pyspark.sql import SparkSession spark = SparkSession.builder \

Re: Can’t write to PVC in K8S

2021-08-30 Thread Bjørn Jørgensen
05b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be li

Re: Can’t write to PVC in K8S

2021-08-31 Thread Bjørn Jørgensen
t; >> > >> However, once your parquet file is written to the work-dir, how are you > >> going to utilise it? > >> > >> HTH > >> > >> > >> > >> > >>view my Linkedin profile > >> <https://www.linkedin

Re: Can’t write to PVC in K8S

2021-09-02 Thread Bjørn Jørgensen
:50, Holden Karau wrote: > > > You can change the UID of one of them to match, or you could add them both > > to a group and set permissions to 770. > > > > On Tue, Aug 31, 2021 at 12:18 PM Bjørn Jørgensen > > wrote: > > > >> Hi and thanks for

Problems with update function in koalas - pyspark pandas.

2021-09-11 Thread Bjørn Jørgensen
Hi I am using "from pyspark import pandas as ps" in a master build yesterday. I do have some columns that I need to join to one. In pandas I use update. 54 FD_OBJECT_SUPPLIES_SERVICES_OBJECT_SUPPLY_SERVICE_ADDITIONAL_INFORMATION

Re: Problems with update function in koalas - pyspark pandas.

2021-09-12 Thread Bjørn Jørgensen
https://issues.apache.org/jira/browse/SPARK-36722 https://github.com/apache/spark/pull/33968 On 2021/09/11 10:06:50, Bj��rn J��rgensen wrote: > Hi I am using "from pyspark import pandas as ps" in a master build yesterday. > I do have some columns that I need to join to one. > In pandas I u

Re: Choice of IDE for Spark

2021-10-06 Thread Bjørn Jørgensen
I use jupyterlab on k8s with minio as s3 storage. https://github.com/bjornjorgensen/jlpyk8s With this code to start it all :) from pyspark import pandas as ps import re import numpy as np import pandas as pd from pyspark.sql import SparkSession from pyspark.sql.functions import concat, concat

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Bjørn Jørgensen
.option("inferSchema", "true") \ >>> .load("/home/.../Documents/test_excel.xlsx") >>> >>> It is giving me the below error message: >>> >>> java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager >>> >>> I tried several Jars for this error but no luck. Also, what would be the >>> efficient way to load it? >>> >>> Thanks, >>> Sid >>> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Bjørn Jørgensen
7;t be able to achieve spark functionality while loading the file in > distributed manner. > > Thanks, > Sid > > On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen > wrote: > >> from pyspark import pandas as ps >> >> >> ps.read_excel? >> "Support b

Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bjørn Jørgensen
ail's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Wed, 23 Feb 2022 at 04:06, bo yang wrote: >>>> >>>>> Hi Spark Community, >>>>> >>>>> We built an open source tool to deploy and run Spark on Kubernetes >>>>> with a one click command. For example, on AWS, it could automatically >>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. Then >>>>> you will be able to use curl or a CLI tool to submit Spark application. >>>>> After the deployment, you could also install Uber Remote Shuffle Service >>>>> to >>>>> enable Dynamic Allocation on Kuberentes. >>>>> >>>>> Anyone interested in using or working together on such a tool? >>>>> >>>>> Thanks, >>>>> Bo >>>>> >>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
n.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 26 Feb 2022 at 22:48, Sean Owen wrote: >> >>> I don't think any of that is related, no. >>> How are you dependencies set up? manually with IJ, or in a build file >>> (Maven, Gradle)? Normally you do the latter and dependencies are taken care >>> of for you, but you app would definitely have to express a dependency on >>> Scala libs. >>> >>> On Sat, Feb 26, 2022 at 4:25 PM Bitfox wrote: >>> >>>> Java SDK installed? >>>> >>>> On Sun, Feb 27, 2022 at 5:39 AM Sachit Murarka >>>> wrote: >>>> >>>>> Hello , >>>>> >>>>> Thanks for replying. I have installed Scala plugin in IntelliJ first >>>>> then also it's giving same error >>>>> >>>>> Cannot find project Scala library 2.12.12 for module SparkSimpleApp >>>>> >>>>> Thanks >>>>> Rajat >>>>> >>>>> On Sun, Feb 27, 2022, 00:52 Bitfox wrote: >>>>> >>>>>> You need to install scala first, the current version for spark is >>>>>> 2.12.15 >>>>>> I would suggest you install scala by sdk which works great. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Sun, Feb 27, 2022 at 12:10 AM rajat kumar < >>>>>> kumar.rajat20...@gmail.com> wrote: >>>>>> >>>>>>> Hello Users, >>>>>>> >>>>>>> I am trying to create spark application using Scala(Intellij). >>>>>>> I have installed Scala plugin in intelliJ still getting below error:- >>>>>>> >>>>>>> Cannot find project Scala library 2.12.12 for module SparkSimpleApp >>>>>>> >>>>>>> >>>>>>> Could anyone please help what I am doing wrong? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Rajat >>>>>>> >>>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Difference between windowing functions and aggregation functions on big data

2022-02-27 Thread Bjørn Jørgensen
as Department, e.name as Employee,e.salary as >>>>> Salary,dense_rank() over(partition by d.name order by e.salary desc) >>>>> as rnk from Department d join Employee e on e.departmentId=d.id ) a >>>>> where rnk<=3 >>>>> >>>>> Time taken: 1212 ms >>>>> >>>>> But as per my understanding, the aggregation should have run faster. >>>>> So, my whole point is if the dataset is huge I should force some kind of >>>>> map reduce jobs like we have an option called >>>>> df.groupby().reduceByGroups() >>>>> >>>>> So I think the aggregation query is taking more time since the dataset >>>>> size here is smaller and as we all know that map reduce works faster when >>>>> there is a huge volume of data. Haven't tested it yet on big data but >>>>> needed some expert guidance over here. >>>>> >>>>> Please correct me if I am wrong. >>>>> >>>>> TIA, >>>>> Sid >>>>> >>>>> >>>>> >>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 27 Feb 2022 at 20:12, Bjørn Jørgensen > wrote: > >> Mitch: You are using scala 2.11 to do this. Have a look at Building Spark >> <https://spark.apache.org

Re: Issue while creating spark app

2022-02-27 Thread Bjørn Jørgensen
responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction.

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
t; The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
I think it will try to > pull the entire dataframe into the drivers memory. > > > > Kind regards > > > > Andy > > > > p.s. My real problem is that spark does not allow you to bind columns. You > can use union() to bind rows. I could get the equivalent of cbind() usin

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable fo

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
g for a pyspark data frame column_bind() solution for > several months. Hopefully pyspark.pandas works. The only other solutions I > was aware of was to use spark.dataframe.join(). This does not scale for > obvious reason. > > > > Andy > > > > > > *From: *Bjørn J

Re: [EXTERNAL] Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
possible solution. Would someone > be able to speak to the support of this Spark feature? Is there active > development or is GraphX in maintenance mode (e.g. updated to ensure > functionality with new Spark releases)? > > > > Thanks in advance for your help! > > >

Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
to the support of this Spark feature? Is there >>> active development or is GraphX in maintenance mode (e.g. updated to ensure >>> functionality with new Spark releases)? >>> >>> >>> >>> Thanks in advance for your help! >>> >>> >>&

Re: Question for so many SQL tools

2022-03-25 Thread Bjørn Jørgensen
itfox : > Just a question why there are so many SQL based tools existing for data > jobs? > > The ones I know, > > Spark > Flink > Ignite > Impala > Drill > Hive > … > > They are doing the similar jobs IMO. > Thanks > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Bjørn Jørgensen
uns on a healthy spark 2.4 and was optimized already to come to a >> stable job in terms of spark-submit resources parameters like >> driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait). >> Any clue how to “really” clear the memory in between jobs? So basically >> currently I can loop 10x and then need to restart my cluster so all memory >> is cleared completely. >> >> >> Thanks for any info! >> >> > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: how to change data type for columns of dataframe

2022-04-02 Thread Bjørn Jørgensen
--- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- > Best Regards, > Ayan Guha > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Bjørn Jørgensen
t using df) >>>>> ... >>>>> dfx=spark.sql(complex statement using df x-1) >>>>> ... >>>>> dfx15.write() >>>>> >>>>> >>>>> What exactly is meant by "closing resources"? Is it just unpersis

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
>>>>> >>>>> I am using pyspark. Basicially my code (simplified is): >>>>> >>>>> df=spark.read.csv(hdfs://somehdfslocation) >>>>> df1=spark.sql (complex statement using df) >>>>> ... >>>>> dfx=

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, da

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
508da18680def%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637849132452199021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WsBEJsDMomXx8e4bT%2BvMCq4vrH35wPD5jy7ngxZSDcs%3D&reserved=0> >> >> >> *Disclaimer:* Use i

Re: Spark Write BinaryType Column as continues file to S3

2022-04-08 Thread Bjørn Jørgensen
In the New spark 3.3 there Will be an sql function https://github.com/apache/spark/commit/25dd4254fed71923731fd59838875c0dd1ff665a hope this can help you. fre. 8. apr. 2022, 17:14 skrev Philipp Kraus < philipp.kraus.flashp...@gmail.com>: > Hello, > > I have got a data frame with numerical data in

Re: Spark Write BinaryType Column as continues file to S3

2022-04-09 Thread Bjørn Jørgensen
n of LAS > format specification see > http://www.asprs.org/wp-content/uploads/2019/07/LAS_1_4_r15.pdf section > 2.6, Table 7 > > Thank > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Bjørn Jørgensen
ntiment**)* *for* *t* *in* >> *df**.**select**(**"sentiment"**).**collect**()]* >> *counts* *=* *[**int**(**row**.**asDict**()[**'count'**])* *for* *row* >> *in* *df**.**select**(**"count"**).**collect**()]* >> >> *print(**entities**,* *sentiments**,* *counts**)* >> >> >> At first I tried with other NER models from Flair they have the same >> effect, after printing the first batch memory use starts increasing until >> it fails and stops the execution because of the memory error. When applying >> a "simple" function instead of the NER model, such as *return >> words.split()* on the UDF there's no such error so the data ingested >> should not be what's causing the overload but the model. >> >> Is there a way to prevent the excessive RAM consumption? Why is there >> only the driver executor and no other executors are generated? How could I >> prevent it from collapsing when applying the NER model? >> >> Thanks in advance! >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
mat('console').start()* > *query.awaitTermination()* > > Spark version is 3.2.1 and SparkNLP version is 3.4.3, while Java version > is 8. I've tried with a different model but the error is still the same, so > what could be causing it? > > If this error is solved

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet *change spark = sparknlp.start()* to spark = sparknlp.start(spark32=True) tir. 19. apr. 2022 kl. 21:10 skrev Bjørn Jørgensen : > Yes, there are some that have that issue. > > Please open a new issue at > https:

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Bjørn Jørgensen
rue)* > > * |||-- begin: integer (nullable = false)* > > * |||-- end: integer (nullable = false)* > > * |||-- result: string (nullable = true)* > > * |||-- metadata: map (nullable = true)* > > * ||||-- key: string* > > * |

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Bjørn Jørgensen
I could replace the row ids and column name with integers if needed, and >> restore them later >> >> >> >> Maybe I would be better off using many small machines? I assume memory is >> the limiting resource not cpu. I notice that memory usage will reach 100%. >> I added several TB’s of local ssd. I am not convinced that spark is using >> the local disk >> >> >> >> >> >> will this perform better than join? >> >> >> >> · The rows before the final pivot will be very very wide (over 5 >> million columns) >> >> · There will only be 10114 rows before the pivot >> >> >> >> I assume the pivots will shuffle all the data. I assume the Colum vectors >> are trivial. The file table pivot will be expensive however will only need >> to be done once >> >> >> >> >> >> >> >> Comments and suggestions appreciated >> >> >> >> Andy >> >> >> >> >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-04-26 Thread Bjørn Jørgensen
; > > CVE-2019-16335 > > > CVE-2019-14893 > > > CVE-2019-14892 > > > CVE-2019-14540 > > > CVE-2019-14439 > > > CVE-2019-14379 > > > CVE-2019-12086 > > > CVE-2018-7489 > > > CVE-2018-5968 > > >

Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-04-26 Thread Bjørn Jørgensen
4893 > > > CVE-2019-14892 > > > CVE-2019-14540 > > > CVE-2019-14439 > > > CVE-2019-14379 > > > CVE-2019-12086 > > > CVE-2018-7489 > > > CVE-2018-5968 > > > CVE-2018-14719 > > > CVE-2018-14718 > > > CVE-2018-12022 > > > CVE-2018-11307 > > > CVE-2017-7525 > > > CVE-2017-17485 > > > > > CVE-2017-15095 > > > Kind Regards > > Harsh Takkar > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
df = spark.read.json("/*.json") use the *.json tir. 26. apr. 2022 kl. 16:44 skrev Sid : > Hello, > > Can somebody help me with the below problem? > > > https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark >

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
th the below problem? >> >> >> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark >> >> >> Thanks, >> Sid >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
gt; it using below script: > > find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt > > > Thanks, > > Sid > > > On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen > wrote: > >> and the bash script seems to read txt fi

Re: Spark error with jupyter

2022-05-03 Thread Bjørn Jørgensen
I am working on spark in jupyter but I have a small error for each running > . > anyone have the same error or have a solution , please tell me . > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Count() action leading to errors | Pyspark

2022-05-07 Thread Bjørn Jørgensen
? Also what could be the possible reason > for that simple count error? > > Environment: > AWS GLUE 1.X > 10 workers > Spark 2.4.3 > > Thanks, > Sid > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Complexity with the data

2022-05-25 Thread Bjørn Jørgensen
Sid, dump one of yours files. https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ ons. 25. mai 2022, 23:04 skrev Sid : > I have 10 columns with me but in the dataset, I observed that some records > have 11 columns of data(for the additional column it is marked as null). >

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
names. > > PFB link: > > > https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark > > Thanks, > Sid > > On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen > wrote: > >> Sid, dump one of yours files. >> >> htt

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
> an escape character. >> >> Can you check if this may cause any issues? >> >> Regards, >> >> Apostolos >> >> >> >> On 26/5/22 16:31, Sid wrote: >> >> Thanks for opening the issue, Bjorn. However, could you help me to >>

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
Yes, but how do you read it with spark. tor. 26. mai 2022, 18:30 skrev Sid : > I am not reading it through pandas. I am using Spark because when I tried > to use pandas which comes under import pyspark.pandas, it gives me an > error. > > On Thu, May 26, 2022 at 9:52 PM Bjørn Jør

Re: to find Difference of locations in Spark Dataframe rows

2022-06-09 Thread Bjørn Jørgensen
t; .getOrCreate() > > val housingDataDF = > spark.read.csv("~/Downloads/real-estate-sample-data.csv") > > // searching for the property by `ref_id` > val searchPropertyDF = housingDataDF.filter(col("ref_id") === > search_property_id) > > // Similar house in the same city (same postal code) and group one > condition > val similarHouseAndSameCity = housingDataDF.join(searchPropertyDF, > groupThreeCriteria ++ groupOneCriteria, > "inner") > > // Similar house not in the same city but 10km range > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Glue is serverless? how?

2022-06-26 Thread Bjørn Jørgensen
https://en.m.wikipedia.org/wiki/Serverless_computing søn. 26. jun. 2022, 10:26 skrev Sid : > Hi Team, > > I am developing a spark job in glue and have read that glue is serverless. > I know that using glue studio we can autoscale the workers. However, I want > to understand how it is serverless

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
; but I am getting the issue of the duplicate column which was present in >>> the old dataset. So, I am trying to understand how the spark reads the >>> data. Does it full dataset and filter on the basis of the last saved >>> timestamp or does it filter only what is required? If the second case is >>> true, then it should have read the data since the latest data is correct. >>> >>> So just trying to understand. Could anyone help here? >>> >>> Thanks, >>> Sid >>> >>> >>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. duplicate column = duplicate rows tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen : > "*but I am getting the issue of the duplicate column which was present in > the old dataset.*" > &g

Re: How use pattern matching in spark

2022-07-14 Thread Bjørn Jørgensen
r now i.e CSV, .DAT file and .TXT > file. > > So, as per me I could do validation for all these 3 file formats using > spark.read.text().rdd and performing intended operations on Rdds. Just the > validation part. > > Therefore, wanted to understand is there any better

Pyspark and multiprocessing

2022-07-20 Thread Bjørn Jørgensen
tItem(f) .alias(str(col_name + sep + f)), keys)) drop_column_list = [col_name] df = df.select([col_name for col_name in df.columns if col_name not in drop_column_list] + key_cols) # recompute remaining Complex Fields in Schema complex_fields =

Fwd: Pyspark and multiprocessing

2022-07-20 Thread Bjørn Jørgensen
So now I have tried to run this function in a ThreadPool. But it doesn't seem to work. [image: image.png] -- Forwarded message - Fra: Sean Owen Date: ons. 20. jul. 2022 kl. 22:43 Subject: Re: Pyspark and multiprocessing To: Bjørn Jørgensen I don't think you eve

Re: Pyspark and multiprocessing

2022-07-21 Thread Bjørn Jørgensen
need 160cores in total as each will need 16CPUs IMHO. Wouldn't that create > CPU bottleneck? > > Also on the side note, why you need Spark if you use that on local only? > Sparks power can only be (mainly) observed in a cluster env. > I have achieved great parallelism using pandas

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Bjørn Jørgensen
gt;> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Bjørn Jørgensen
rom relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 5 Sept 2022 at 20:58, Bjørn Jørgensen > wrote: > &

Re: Jupyter notebook on Dataproc versus GKE

2022-09-14 Thread Bjørn Jørgensen
th for scheduling. > > On Tue, Sep 6, 2022 at 10:01 AM Mich Talebzadeh > wrote: > >> Thank you all. >> >> Has anyone used Argo for k8s scheduler by any chance? >> >> On Tue, 6 Sep 2022 at 13:41, Bjørn Jørgensen >> wrote: >> >>> "*Jupy

Re: 答复: [how to]RDD using JDBC data source in PySpark

2022-09-19 Thread Bjørn Jørgensen
Does have some way to let rdd can using jdbc data source in pyspark? > > > > i want to get data from mysql, but in PySpark, there is not supported > JDBCRDD like java/scala. > > and i search docs from web site, no answer. > > > > > > So i need yo

Re: Re: [how to]RDD using JDBC data source in PySpark

2022-09-20 Thread Bjørn Jørgensen
dbc") is good way to resolved it. > But in some reasons, i can't using DataFrame API, only can use RDD API in > PySpark. > ...T_T... > > thanks all you guys help. but still need new idea to resolve it. XD > > > > > > ------ >

Re: Issue with SparkContext

2022-09-20 Thread Bjørn Jørgensen
JavaError while running SparkContext. > Can you please help me to resolve this issue. > > > > Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Bjørn Jørgensen
022-2048 (High), which was >> set to 3.4.0 release but that will happen Feb 2023. Is it possible to have >> it in any earlier release such as 3.3.1 or 3.3.2? >> >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: spark - local question

2022-11-04 Thread Bjørn Jørgensen
rio is as follows: >> >> Our team wants to develop an etl component based on python language. Data >> can be transferred between various data sources. >> >> If there is no yarn environment, can we read data from Database A and write >> it to Database B in local mode.Will this function be guaranteed to be stable >> and available? >> >> >> >> Thanks, >> Look forward to your reply >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: spark - local question

2022-11-05 Thread Bjørn Jørgensen
7;} > df = df.withColumnRenamed("id", "itemid").withColumnRenamed("category", > "cateid") \ > .withColumnRenamed('weight', 'score').withColumnRenamed('tag', > 'item_tags') \ > .withColumnRenamed(&

Re: Creating a Spark 3 Connector

2022-11-23 Thread Bjørn Jørgensen
ces. Any such terms are not binding on MarkLogic > unless and until they are included in a definitive agreement executed by > MarkLogic. > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: How can I use backticks in column names?

2022-12-05 Thread Bjørn Jørgensen
gingData Avoiding Dots / Periods in PySpark Column Names <https://mungingdata.com/pyspark/avoid-dots-periods-column-names/> man. 5. des. 2022 kl. 06:56 skrev 한승후 : > Spark throws an exception if there are backticks in the column name. > > Please help me. > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Bjørn Jørgensen
>>>> at >>>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1409) >>>>>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >>>>>>> at >>>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1436) >>>>>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >>>>>>> at >>>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1411) >>>>>>> >>>>>>> >>> >>> -- >>> Thanks >>> Gnana >>> >> > > -- > Thanks > Gnana > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
> On Mon, 19 Dec 2022 at 15:28, Oliver Ruebenacker < >>>>> oliv...@broadinstitute.org> wrote: >>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> How can I retain from each group only the row for which one value >>>>>> is the maximum of the group? For example, imagine a DataFrame containing >>>>>> all major cities in the world, with three columns: (1) City name (2) >>>>>> Country (3) population. How would I get a DataFrame that only contains >>>>>> the >>>>>> largest city in each country? Thanks! >>>>>> >>>>>> Best, Oliver >>>>>> >>>>>> -- >>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>> Institute <http://www.broadinstitute.org/> >>>>>> >>>>> >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
uple. > > On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen > wrote: > >> We have pandas API on spark >> <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html> >> which is very good. >> >> from pyspark import pandas as ps

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Bjørn Jørgensen
https://github.com/apache/spark/pull/39134 tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the suggestion. This would, however, involve converting my > Dataframe to an RDD (and back later), which involves additional costs. > > On Tue, Dec 20, 2022

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > > Hello, > > I'm trying to install SciPy using a bootstrap script and then use it

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
linux Type "help", "copyright", "credits" or "license" for more information. >>> from scipy.stats import norm >>> fre. 6. jan. 2023 kl. 18:12 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the link. I alre

Re: Spark SQL question

2023-01-28 Thread Bjørn Jørgensen
>> Since there's no such field as *data,* I thought the SQL has to look >> like this >> >> select 1 as *`data.group`* from tbl group by `*data.group`* >> >> >> But that gives and error (cannot resolve '`data.group`') ... I'm no >> expert in SQL, but feel like it's a strange behavior... does anybody have a >> good explanation for it ? >> >> Thanks >> >> -- >> Kohki Nishio >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: How to explode array columns of a dataframe having the same length

2023-02-16 Thread Bjørn Jørgensen
> > Hello guys, > > > > I have the following dataframe: > > > > col1 > > > > col2 > > > > col3 > > > > ["A","B","null"] > > > > ["C","D","null"] > > > > ["E","null","null"] > > > > > > > > I want to explode it to the following dataframe: > > > > col1 > > > > col2 > > > > col3 > > > > "A" > > > > "C" > > > > "E" > > > > "B" > > > > "D" > > > > "null" > > > > "null" > > > > "null" > > > > "null" > > > > > > How to do that (preferably in Java) using the explode() method ? knowing > that something like the following won't yield correct output: > > > > for (String colName: dataset.columns()) > > dataset=dataset.withColumn(colName,explode(dataset.col(colName))); > > > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Graceful shutdown SPARK Structured Streaming

2023-02-19 Thread Bjørn Jørgensen
ame >>>>> if(name == 'md'): >>>>> print(f"""Terminating streaming process {name}""") >>>>> e.stop() >>>>> else: >>>>> print("DataFrame newtopic is empty") >>>>> >>>>> This seems to work as I checked it to ensure that in this case data >>>>> was written and saved to the target sink (BigQuery table). It will wait >>>>> until data is written completely meaning the current streaming message is >>>>> processed and there is a latency there (meaning waiting for graceful >>>>> completion) >>>>> >>>>> This is the output >>>>> >>>>> Terminating streaming process md >>>>> wrote to DB ## this is the flag I added to ensure the current >>>>> micro-bath was completed >>>>> 2021-04-23 09:59:18,029 ERROR streaming.MicroBatchExecution: Query md >>>>> [id = 6bbccbfe-e770-4fb0-b83d-0dedd0ee571b, runId = >>>>> 2ae55673-6bc2-4dbe-af60-9fdc0447bff5] terminated with error >>>>> >>>>> The various termination processes are described in >>>>> >>>>> Structured Streaming Programming Guide - Spark 3.1.1 Documentation >>>>> (apache.org) >>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries> >>>>> >>>>> This is the idea I came up with which allows ending the streaming >>>>> process with least cost. >>>>> >>>>> HTH >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 5 May 2021 at 17:30, Gourav Sengupta < >>>>> gourav.sengupta.develo...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> just thought of reaching out once again and seeking out your kind >>>>>> help to find out what is the best way to stop SPARK streaming gracefully. >>>>>> Do we still use the methods of creating a file as in SPARK 2.4.x which is >>>>>> several years old method or do we have a better approach in SPARK 3.1? >>>>>> >>>>>> Regards, >>>>>> Gourav Sengupta >>>>>> >>>>>> -- Forwarded message - >>>>>> From: Gourav Sengupta >>>>>> Date: Wed, Apr 21, 2021 at 10:06 AM >>>>>> Subject: Graceful shutdown SPARK Structured Streaming >>>>>> To: >>>>>> >>>>>> >>>>>> Dear friends, >>>>>> >>>>>> is there any documentation available for gracefully stopping SPARK >>>>>> Structured Streaming in 3.1.x? >>>>>> >>>>>> I am referring to articles which are 4 to 5 years old and was >>>>>> wondering whether there is a better way available today to gracefully >>>>>> shutdown a SPARK streaming job. >>>>>> >>>>>> Thanks a ton in advance for all your kind help. >>>>>> >>>>>> Regards, >>>>>> Gourav Sengupta >>>>>> >>>>> -- >>> Best Regards, >>> Ayan Guha >>> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
art") - >>>> col("position"), col("position") - col("end"), 0)) >>>> ``` >>>> >>>> Basically, the distance is the maximum of three terms. >>>> >>>> This line causes an obscure error: >>>> >

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Bjørn Jørgensen
performance >>>>>>>using Pandas API on Spark? How to tune them in addition to the >>>>>>> conventional >>>>>>>Spark tuning methods applied to Spark SQL users. >>>>>>>6. Spark internals and/or compari

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Bjørn Jørgensen
;> >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli >>>>> wrote: >>>>> >>>>> Hello Mich, >>>>> >>>>> Can you please provide the link for the confluence page? >>>>> >>>>> Many thanks >>>>> Asma >>>>> Ph.D. in Big Data - Applied Machine Learning >>>>> >>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> a écrit : >>>>> >>>>> Apologies I missed the list. >>>>> >>>>> To move forward I selected these topics from the thread "Online >>>>> classes for spark topics". >>>>> >>>>> To take this further I propose a confluence page to be seup. >>>>> >>>>> >>>>>1. Spark UI >>>>>2. Dynamic allocation >>>>>3. Tuning of jobs >>>>>4. Collecting spark metrics for monitoring and alerting >>>>>5. For those who prefer to use Pandas API on Spark since the >>>>>release of Spark 3.2, What are some important notes for those users? >>>>> For >>>>>example, what are the additional factors affecting the Spark >>>>> performance >>>>>using Pandas API on Spark? How to tune them in addition to the >>>>> conventional >>>>>Spark tuning methods applied to Spark SQL users. >>>>>6. Spark internals and/or comparing spark 3 and 2 >>>>>7. Spark Streaming & Spark Structured Streaming >>>>>8. Spark on notebooks >>>>>9. Spark on serverless (for example Spark on Google Cloud) >>>>>10. Spark on k8s >>>>> >>>>> Opinions and how to is welcome >>>>> >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>> Hi guys >>>>> >>>>> To move forward I selected these topics from the thread "Online >>>>> classes for spark topics". >>>>> >>>>> To take this further I propose a confluence page to be seup. >>>>> >>>>> Opinions and how to is welcome >>>>> >>>>> Cheers >>>>> >>>>> >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >> >> -- >> Asma ZGOLLI >> >> PhD in Big Data - Applied Machine Learning >> Email : zgollia...@gmail.com >> Tel : (+49) 015777685768 >> Skype : asma_zgolli >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Slack for PySpark users

2023-03-30 Thread Bjørn Jørgensen
s, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli >>>>>> wrote: >>>>>> >>>>>>> +1 good idea, I d like to join as well. >>>>>>> >>>>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai a >>>>>>> écrit : >>>>>>> >>>>>>>> Please let us know when the channel is created. I'd like to join :) >>>>>>>> >>>>>>>> Thank You & Best Regards >>>>>>>> Winston Lai >>>>>>>> -- >>>>>>>> *From:* Denny Lee >>>>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM >>>>>>>> *To:* Hyukjin Kwon >>>>>>>> *Cc:* keen ; user@spark.apache.org < >>>>>>>> user@spark.apache.org> >>>>>>>> *Subject:* Re: Slack for PySpark users >>>>>>>> >>>>>>>> +1 I think this is a great idea! >>>>>>>> >>>>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon >>>>>>>> wrote: >>>>>>>> >>>>>>>> Yeah, actually I think we should better have a slack channel so we >>>>>>>> can easily discuss with users and developers. >>>>>>>> >>>>>>>> On Tue, 28 Mar 2023 at 03:08, keen wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> I really like *Slack *as communication channel for a tech >>>>>>>> community. >>>>>>>> There is a Slack workspace for *delta lake users* ( >>>>>>>> https://go.delta.io/slack) that I enjoy a lot. >>>>>>>> I was wondering if there is something similar for PySpark users. >>>>>>>> >>>>>>>> If not, would there be anything wrong with creating a new >>>>>>>> Slack workspace for PySpark users? (when explicitly mentioning that >>>>>>>> this is >>>>>>>> *not* officially part of Apache Spark)? >>>>>>>> >>>>>>>> Cheers >>>>>>>> Martin >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Asma ZGOLLI >>>>>>> >>>>>>> Ph.D. in Big Data - Applied Machine Learning >>>>>>> >>>>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Looping through a series of telephone numbers

2023-04-02 Thread Bjørn Jørgensen
> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau >> wrote: >> >>> Hello, >>> I’m looking for an efficient way in Spark to search for a series of >>> telephone numbers, contained in a CSV file, in a data set column. >>> >>> In pseudo code, >>> >>> for tel in [tel1, tel2, …. tel40,000] >>> search for tel in dataset using .like(« %tel% ») >>> end for >>> >>> I’m using the like function because the telephone numbers in the data >>> set main contain prefixes, such as « + « ; e.g., « +331222 ». >>> >>> Any suggestions would be welcome. >>> >>> Many thanks. >>> >>> Philippe >>> >>> >>> >>> >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Slack for PySpark users

2023-04-04 Thread Bjørn Jørgensen
gt; and medium size groups it is good and affordable. Alternatives have >>>>>>>>>> been >>>>>>>>>> suggested as well so those who like investigative search can agree >>>>>>>>>> and come >>>>>&

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
Yes, it looks inside the docker containers folder. It will work if you are using s3 og gs. ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh : > Hi, > > In my spark-submit to eks cluster, I use the standard code to submit to > the cluster as below: > > spark-submit --verbose \ >--master k8s://$

Re: Non string type partitions

2023-04-15 Thread Bjørn Jørgensen
Hi Team, >>>> >>>> We are running into the below error when we are trying to run a simple >>>> query a partitioned table in Spark. >>>> >>>> *MetaException(message:Filtering is supported only on partition keys of >>>> type strin

Re: Change column values using several when conditions

2023-05-01 Thread Bjørn Jørgensen
e list, the call to withColumn() gets ignored. > How to do exactly that in a more efficient way using Spark in Java? > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: maven with Spark 3.4.0 fails compilation

2023-05-28 Thread Bjørn Jørgensen
nkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
com/a/26411339/19476830>can be a a pitfall for you. > > -- > Best Regards! > ... > Lingzhe Sun > Hirain Technology > > > *From:* Mich Talebzadeh > *Date:* 2023-05-29 17:55 > *To:* Bjørn Jør

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
> > > > > org.scala-lang > scala-library > 2.13.8 > > > org.apache.spark > spark-core_2.13 > 3.4.0 > provided > > > org.apache.spark > spark-sql_2.13 > 3.4.0 > provided > > > &

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
privileged or legally protected. > You are not authorized to copy or disclose the contents of this email. If > you are not the intended addressee, please inform the sender and delete > this email. > > > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
al content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen > wrote: > >> This is pandas API on spark >> >> from

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
se it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for an

Re: Rename columns without manually setting them all

2023-06-21 Thread Bjørn Jørgensen
{col}` as > `Status`' for col in date_columns])}) as (`Date`, `Status`)”]) > > result = df.groupby("Date", "Status").count() > > > > > On 21 Jun 2023, at 11:45, John Paul Jayme > wrote: > > Hi, > > This is currently my column definition : > Employee ID Name Client Project Team 01/01/2022 02/01/2022 03/01/2022 > 04/01/2022 05/01/2022 > 12345 Dummy x Dummy a abc team a OFF WO WH WH WH > As you can see, the outer columns are just daily attendance dates. My goal > is to count the employees who were OFF / WO / WH on said dates. I need to > transpose them so it would look like this : > > > > I am still new to pandas. Can you guide me on how to produce this? I am > reading about melt() and set_index() but I am not sure if they are the > correct functions to use. > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [PySpark][UDF][PickleException]

2023-08-10 Thread Bjørn Jørgensen
ist[float]]: > return np.pad(arr, [(n, 0), (0, 0)], "constant", > constant_values=0.0).tolist() > > But this works: > @udf("array>") > def pad(arr, n): > padded_arr = [] > for i in range(n): > padded_arr.append([0.0] * len(arr[0]))

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
s > against viruses. TEMENOS accepts no liability for any damage caused by any > malicious code or virus transmitted by this e-mail. > > --------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > -- Bjørn J

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Bjørn Jørgensen
0] util.ShutdownHookManager: > Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034 > 2023-08-20T19:45:19,691 INFO [shutdown-hook-0] util.ShutdownHookManager: > Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64 > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Bjørn Jørgensen
ins confidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclosure, copying, or distribution of this message, or the taking of any > action based on it, by you is strictly prohibited. > > Deloitte refers to a Deloitte member firm, one of its related entities, or > Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a > separate legal entity and a member of DTTL. DTTL does not provide services > to clients. Please see www.deloitte.com/about to learn more. > > v.E.1 > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Filter out 20% of rows

2023-09-15 Thread Bjørn Jørgensen
. sep. 2023 kl. 20:14 skrev ashok34...@yahoo.com.INVALID : > Hi team, > > I am using PySpark 3.4 > > I have a table of million rows that has few columns. among them incoming > ips and what is known as gbps (Gigabytes per second) and date and time > of incoming ip. > > I want to filter out 20% of low active ips and work on the remainder of > data. How can I do thiis in PySpark? > > Thanks > > > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark stand-alone mode

2023-09-15 Thread Bjørn Jørgensen
er. >> >> As this is local mode, we are facing performance issue(as only one >> executor) when it comes dealing with large datasets. >> >> Can I convert this 4 nodes into spark standalone cluster. We dont have >> hadoop so yarn mode is out of scope. >> >&

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
aimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 16 Sept 2023 at 11:46, Mich Talebzadeh > wrote: > >> Happy Saturday coding 😁 >> >> >> Mich Talebzadeh, >>

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
EDIT: I don't think that the question asker will have only returned the top 25 percentages. lør. 16. sep. 2023 kl. 21:54 skrev Bjørn Jørgensen : > percentile_approx returns the approximate percentile(s) > <https://github.com/apache/spark/pull/14868> The memory consumption

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
disclosure, copying, distribution or use of any of the information > contained in or attached to this message is STRICTLY PROHIBITED. If you > have received this transmission in error, please immediately notify the > sender and delete the e-mail and attached documents. Thank you. > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Bjørn Jørgensen
; rev.prin = p.prin >> rev.scode= p.bcode >> >> >> The item has two rows which have common attributes and the* final join >> should result in 2 rows. But I am seeing 4 rows instead.* >> >> left join item I >> on rev.sys = i.sys >> rev.custumer_id = I.custumer_id >> rev. scode = I.scode >> >> >> >> Regards, >> Meena >> >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

  1   2   >