from:"marc nicole"

read dataset from only one node in YARN cluster

2023-08-18 Thread marc nicole

Hi, Spark 3.2, Hadoop 3.2, using YARN cluster mode, if one wants to read a dataset that is found in one node of the cluster and not in the others, how to tell Spark that? I expect through DataframeReader and using path like *IP:port/pathOnLocalNode* PS: loading the dataset in HDFS is not an opti

Change column values using several when conditions

2023-05-01 Thread marc nicole

Hello I want to change values of a column in a dataset according to a mapping list that maps original values of that column to other new values. Each element of the list (colMappingValues) is a string that separates the original values from the new values using a ";". So for a given column (in th

How to change column values using several when conditions ?

2023-04-30 Thread marc nicole

Hello to you Sparkling community :) I want to change values of a column in a dataset according to a mapping list that maps original values of that column to other new values. Each element of the list (colMappingValues) is a string that separates the original values from the new values using a ";".

Re: input file size

2022-06-19 Thread marc nicole

Reasoning in files (vs datasets as i first thought of this question), I think this is more adequate in Spark: > org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); it will yield same result as > new File("filePath").length(); Le dim. 19 juin 2022 à 11:11, Enrico Minack a écr

Re: input file size

2022-06-18 Thread marc nicole

Hi, I found this ( https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html) that may be helpful, i use Java: > org.apache.spark.util.SizeEstimator.estimate(dataset)); Le sam. 18 juin 2022 à 22:33, mbreuer a écrit : > Hello Community, > > I am working on opt

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole

n. Anyways thanks guys! Le ven. 17 juin 2022 à 22:35, marc nicole a écrit : > String dateString = String.format("%d-%02d-%02d", 2012, 02, 03); > Date sqlDate = java.sql.Date.valueOf(dateString); > dataset= > dataset.where(to_date(dataset.col("Date"),"MM-dd-&qu

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole

than this.... Le ven. 17 juin 2022 à 22:13, marc nicole a écrit : > @Stelios : to_date requires column type > @Sean how to parse a literal to a date lit("02-03-2012").cast("date")? > > Le ven. 17 juin 2022 à 22:07, Stelios Philippou a > écrit : > >>

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole

2012", > "MM-dd-")); > > On Fri, 17 Jun 2022, 22:51 marc nicole, wrote: > >> dataset = >> dataset.where(to_date(dataset.col("Date"),"MM-dd-").geq("02-03-2012").cast("date")); >> ? >> This is returning

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole

econd part and don't forget to cast it as well > > On Fri, 17 Jun 2022, 22:08 marc nicole, wrote: > >> should i cast to date the target date then? for example maybe: >> >> dataset = >>> dataset.where(to_date(dataset.col("Date"),"MM-dd-")

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole

en a écrit : > Look at your query again. You are comparing dates to strings. The dates > widen back to strings. > > On Fri, Jun 17, 2022, 1:39 PM marc nicole wrote: > >> I also tried: >> >> dataset = >>> dataset.where(to_date(dataset.col("Date"),

Re: how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole

5 as a > string is before 02-03-2012. > You apply date function to dates, not strings. > You have to parse the dates properly, which was the problem in your last > email. > > On Fri, Jun 17, 2022 at 12:58 PM marc nicole wrote: > >> Hello, >> >> I have a dataset

how to properly filter a dataset by dates ?

2022-06-17 Thread marc nicole

Hello, I have a dataset containing a column of dates, which I want to use for filtering. Nothing, from what I have tried, seems to return the exact right solution. Here's my input: + + |Date| ++ | 02-08-2019 | ++ | 02-07-2019 | +--

Re: How to recognize and get the min of a date/string column in Java?

2022-06-15 Thread marc nicole

finally solved with the MM for months format recommendation. thanks Le mar. 14 juin 2022 à 23:02, marc nicole a écrit : > i changed the format to -mm-dd for the example > > Le mar. 14 juin 2022 à 22:52, Sean Owen a écrit : > >> Look at your data - doesn't mat

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole

i changed the format to -mm-dd for the example Le mar. 14 juin 2022 à 22:52, Sean Owen a écrit : > Look at your data - doesn't match date format you give > > On Tue, Jun 14, 2022, 3:41 PM marc nicole wrote: > >> for the input (I changed the format) : >> >

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole

| ++ | 2022-02-08 | ++ the output was 2012-01-03 To note that for my below code to work I cast to string the resulting min column. Le mar. 14 juin 2022 à 21:12, Sean Owen a écrit : > You haven't shown your input or the result > > On Tue, Jun 14, 2022 at 1:40 PM marc

Re: How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole

ically. > > Small note, MM is month. mm is minute. You have to fix that for this to > work. These are Java format strings. > > On Tue, Jun 14, 2022, 12:32 PM marc nicole wrote: > >> Hi, >> >> I want to identify a column of dates as such, the column has formatted >

How to recognize and get the min of a date/string column in Java?

2022-06-14 Thread marc nicole

Hi, I want to identify a column of dates as such, the column has formatted strings in the likes of: "06-14-2022" (the format being mm-dd-) and get the minimum of those dates. I tried in Java as follows: if (dataset.filter(org.apache.spark.sql.functions.to_date( > dataset.col(colName), "mm-dd

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole

at '+' refers > to "no value": > > spark.read > .option("inferSchema", "true") > .option("header", "true") > .option("nullvalue", "+") > .csv("path") > > Enrico >

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole

.csv() method. Any better idea to do that ? Le sam. 4 juin 2022 à 18:40, Enrico Minack a écrit : > Can you provide an example string (row) and the expected inferred schema? > > Enrico > > > Am 04.06.22 um 18:36 schrieb marc nicole: > > How to do just that? i thought we only can

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole

truing > the entire row as a string like "Row[foo=bar, baz=1]" > > On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: > >> Hi Sean, >> >> Thanks, actually I have a dataset where I want to inferSchema after >> discarding the specific String value of &quo

Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole

ctly this way. > You can use a UDF to call .toString on the Row of course, but, again > what are you really trying to do? > > On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: > >> Hi, >> How to convert a Dataset to a Dataset? >> What i have tried is: >> >&g

How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole

Hi, How to convert a Dataset to a Dataset? What i have tried is: List list = dataset.as(Encoders.STRING()).collectAsList(); Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line raises a org.apache.spark.sql.AnalysisException: Try to map struct... to Tuple1, but failed

approx_count_distinct in spark always return 1

2022-06-02 Thread marc nicole

I have a dataset where i want to count distinct values for column based a group of others, i do it like so, processedDataset = processedDataset.withColumn("freq", approx_count_distinct("col1").over(Window.partitionBy(groupCols.toArray(new Column[groupCols.size()]; but even when i have duplic

Re: Unable to convert double values

2022-05-29 Thread marc nicole

so sorry , the matching pattern is rather '^\d*[.]\d*$' Le dim. 29 mai 2022 à 19:58, marc nicole a écrit : > Hi, > > I think this part of your first line of code* > ...regexp_replace(col("annual_salary"), "\.", "") *is messing things up, >

Re: Unable to convert double values

2022-05-29 Thread marc nicole

Hi, I think this part of your first line of code* ...regexp_replace(col("annual_salary"), "\.", "") *is messing things up, so try to remove it. Also try to use this numerical matching pattern '^[0-9]*$' in your code instead Le dim. 29 mai 2022 à 19:24, Sid a écrit : > Hi Team, > > I need help

k-anonymity with Spark in Java

2022-05-28 Thread marc nicole

Hi Spark devs, Anybody willing to check my code implementing *k-anonymity*? public static Dataset < Row > kAnonymizeBySuppression(SparkSession sparksession, Dataset < Row > initDataset, List < String > qidAtts, Integer k_anonymity_constant) { Dataset < Row > anonymizedDF = sparksession.empt

Grouping and counting occurences of specific column rows

2022-04-22 Thread marc nicole

Hi all, Sorry for posting this twice, I need to know how to group by several column attributes (e.g.,List groupByAttributes) a dataset (dataset) and then count the occurrences of associated grouped rows, how do i achieve that ? I tried through the following code: > Dataset groupedRows = dataset.w

Re: Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole

9 avr. 2022 à 14:06, Sean Owen a écrit : > Just .groupBy(...).count() ? > > On Tue, Apr 19, 2022 at 6:24 AM marc nicole wrote: > >> Hello guys, >> >> I want to group by certain column attributes (e.g.,List >> groupByQidAttributes) a dataset (initDataset) and th

Grouping and counting occurences of specific column rows

2022-04-19 Thread marc nicole

Hello guys, I want to group by certain column attributes (e.g.,List groupByQidAttributes) a dataset (initDataset) and then count the occurrences of associated grouped rows, how do i achieve that neatly? I tried through the following code: Dataset groupedRowsDF = initDataset.withColumn("qidsFreqs",

Please Review My Code

2022-04-16 Thread marc nicole

Hello Guys, I want you to review my code available in this Github repo: https://github.com/MNicole12/AlgorithmForReview/blob/main/codeReview.java Thanks in advance for your improving comments. Marc.

read dataset from only one node in YARN cluster

Change column values using several when conditions

How to change column values using several when conditions ?

Re: input file size

Re: input file size

Re: how to properly filter a dataset by dates ?

Re: how to properly filter a dataset by dates ?

Re: how to properly filter a dataset by dates ?

Re: how to properly filter a dataset by dates ?

Re: how to properly filter a dataset by dates ?

Re: how to properly filter a dataset by dates ?

how to properly filter a dataset by dates ?

Re: How to recognize and get the min of a date/string column in Java?

Re: How to recognize and get the min of a date/string column in Java?

Re: How to recognize and get the min of a date/string column in Java?

Re: How to recognize and get the min of a date/string column in Java?

How to recognize and get the min of a date/string column in Java?

Re: How to convert a Dataset to a Dataset?

Re: How to convert a Dataset to a Dataset?

Re: How to convert a Dataset to a Dataset?

Re: How to convert a Dataset to a Dataset?

How to convert a Dataset to a Dataset?

approx_count_distinct in spark always return 1

Re: Unable to convert double values

Re: Unable to convert double values

k-anonymity with Spark in Java

Grouping and counting occurences of specific column rows

Re: Grouping and counting occurences of specific column rows

Grouping and counting occurences of specific column rows

Please Review My Code

30 matches

Site Navigation

Mail list logo

Footer information