Re: add an auto_increment column

2022-02-08 Thread Bitfox
Maybe col func is not even needed here. :) >>> df.select(F.dense_rank().over(wOrder).alias("rank"), "fruit","amount").show() ++--+--+ |rank| fruit|amount| ++--+--+ | 1|cherry| 5| | 2| apple| 3| | 2|tomato| 3| | 3|orange| 2| ++--+-

Re: add an auto_increment column

2022-02-08 Thread Gourav Sengupta
Hi, so do you want to rank apple and tomato both as 2? Not quite clear on the use case here though. Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 7:10 AM wrote: > > Hello Gourav > > > As you see here orderBy has already give the solution for "equal > amount": > > >>> df = > >>> > sc.parallel

Re: add an auto_increment column

2022-02-08 Thread capitnfrakass
I have got the answer from Mich's answer. Thank you both. frakass On 08/02/2022 16:36, Gourav Sengupta wrote: Hi, so do you want to rank apple and tomato both as 2? Not quite clear on the use case here though. Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 7:10 AM wrote: Hello Gourav A

question on the different way of RDD to dataframe

2022-02-08 Thread capitnfrakass
Hello I am converting some py code to scala. This works in python: rdd = sc.parallelize([('apple',1),('orange',2)]) rdd.toDF(['fruit','num']).show() +--+---+ | fruit|num| +--+---+ | apple| 1| |orange| 2| +--+---+ And in scala: scala> rdd.toDF("fruit","num").show() +--+---+ |

Re: question on the different way of RDD to dataframe

2022-02-08 Thread Sean Owen
It's just a possibly tidier way to represent objects with named, typed fields, in order to specify a DataFrame's contents. On Tue, Feb 8, 2022 at 4:16 AM wrote: > Hello > > I am converting some py code to scala. > This works in python: > > >>> rdd = sc.parallelize([('apple',1),('orange',2)]) > >

Does spark support something like the bind function in R?

2022-02-08 Thread Andrew Davidson
I need to create a single table by selecting one column from thousands of files. The columns are all of the same type, have the same number of rows and rows names. I am currently using join. I get OOM on mega-mem cluster with 2.8 TB. Does spark have something like cbind() “Take a sequence of ve

Does spark have something like rowsum() in R?

2022-02-08 Thread Andrew Davidson
As part of my data normalization process I need to calculate row sums. The following code works on smaller test data sets. It does not work on my big tables. When I run on a table with over 10,000 columns I get an OOM on a cluster with 2.8 TB. Is there a better way to implement this Kind regard

Re: question on the different way of RDD to dataframe

2022-02-08 Thread Mich Talebzadeh
As Sean mentioned Scala case class is a handy way of representing objects with names and types. For example, if you are reading a csv file with spaced column names like "counter party" etc and you want a more compact column name like counterparty etc scala> val location="hdfs://rhes75:9000/tmp/c

Help With unstructured text file with spark scala

2022-02-08 Thread Danilo Sousa
Hi I have to transform unstructured text to dataframe. Could anyone please help with Scala code ? Dataframe need as: operadora filial unidade contrato empresa plano codigo_beneficiario nome_beneficiario Relação de Beneficiários Ativos e Excluídos Carteira em#27/12/2019##Todos os Beneficiários O

Re: Does spark have something like rowsum() in R?

2022-02-08 Thread Sean Owen
That seems like a fine way to do it. Why you're running out of mem is probably more a function of your parallelism, cluster size, and the fact that R is a memory hog. I'm not sure there are great alternatives in R and Spark; in other languages you might more directly get the array of (numeric?) row

Re: Help With unstructured text file with spark scala

2022-02-08 Thread Lalwani, Jayesh
You will need to provide more info. Does the data contain records? Are the records "homogenous" ; ie; do they have the same fields? What is the format of the data? Are records separated by lines/seperators? Is the data sharded across multiple files? How big is each shard? On 2/8/22, 11:50 AM,

Re: StructuredStreaming - foreach/foreachBatch

2022-02-08 Thread Mich Talebzadeh
BTW you can check this Linkedin article of mine on Processing Change Data Capture with Spark Structured Streaming It covers the concept of triggers includin

Re: Does spark support something like the bind function in R?

2022-02-08 Thread ayan guha
Hi In python, or in general in spark, you can just "read" the files and select the column. I am assuming you are reading each file individually in separate dataframes and joining them. Instead, you can read all the files in single dataframe and select 1 column. On Wed, Feb 9, 2022 at 2:55 AM Andr

flatMap for dataframe

2022-02-08 Thread frakass
Hello for the RDD I can apply flatMap method: >>> sc.parallelize(["a few words","ba na ba na"]).flatMap(lambda x: x.split(" ")).collect() ['a', 'few', 'words', 'ba', 'na', 'ba', 'na'] But for a dataframe table how can I flatMap that as above? >>> df.show() ++ | val

Re: question on the different way of RDD to dataframe

2022-02-08 Thread frakass
I know that using case class I can control the data type strictly. scala> val rdd = sc.parallelize(List(("apple",1),("orange",2))) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :23 scala> rdd.toDF.printSchema root |-- _1: string (nullable = true) |

Re: flatMap for dataframe

2022-02-08 Thread oliver dd
Hi, You can achieve your goal by: df.flatMap(row => row.getAs[String]("value").split(" ")) — Best Regards, oliverdding

Re: flatMap for dataframe

2022-02-08 Thread frakass
Is this the scala syntax? Yes in scala I know how to do it by converting the df to a dataset. how for pyspark? Thanks On 2022/2/9 10:24, oliver dd wrote: df.flatMap(row => row.getAs[String]("value").split(" ")) - To unsubscri

Re: question on the different way of RDD to dataframe

2022-02-08 Thread frakass
I think it's better as: df1.map { case(w,x,y,z) => columns(w,x,y,z) } Thanks On 2022/2/9 12:46, Mich Talebzadeh wrote: scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString, p(2).toString,p(3).toString.toDouble)) // map those columns -

Re: Help With unstructured text file with spark scala

2022-02-08 Thread Bitfox
Hello You can treat it as a csf file and load it from spark: >>> df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("sep","#").load(csv_file) >>> df.show() ++---+-+ | Plano|Código Benefic