Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > >

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
estruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 11 Mar 2

What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
Hello guys, I am launching through code (client mode) a Spark program to run in Hadoop. If I execute on the dataset methods of the likes of show() and count() or collectAsList() (that are displayed in the Spark UI) after performing heavy transformations on the columns then the mentioned methods wi

How to allocate vcores to driver (client mode)

2023-03-10 Thread sam smith
Hi, I am launching through code (client mode) a Spark program to run in Hadoop. Whenever I check the executors tab of Spark UI I always get 0 as the number of vcores for the driver. I tried to change that using *spark.driver.cores*, or also *spark.yarn.am.cores* in the SparkSession configuration b

How to share a dataset file across nodes

2023-03-09 Thread sam smith
Hello, I use Yarn client mode to submit my driver program to Hadoop, the dataset I load is from the local file system, when i invoke load("file://path") Spark complains about the csv file being not found, which i totally understand, since the dataset is not in any of the workers or the application

Re: How to explode array columns of a dataframe having the same length

2023-02-16 Thread sam smith
quot;,"C","E"), List("B","D","null"), List("null","null","null")) > and use flatmap with that method. > > In Scala, this would read: > > df.flatMap { row => (row.getSeq[String](0), row.getSeq[String](1), &g

How to explode array columns of a dataframe having the same length

2023-02-14 Thread sam smith
Hello guys, I have the following dataframe: *col1* *col2* *col3* ["A","B","null"] ["C","D","null"] ["E","null","null"] I want to explode it to the following dataframe: *col1* *col2* *col3* "A" "C" "E" "B" "D" "null" "null" "null" "null" How to do that (preferably in Java) using

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-13 Thread sam smith
ecause it's an aggregate function. You have to groupBy() > (group by nothing) to make that work, but, you can't assign that as a > column. Folks those approaches don't make sense semantically in SQL or > Spark or anything. > They just mean use threads to collect() distinct val

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
withColumn(columnName, > collect_set(col(columnName)).as(columnName)); > } > > Then you have a single DataFrame that computes all columns in a single > Spark job. > > But this reads all distinct values into a single partition, which has the > same downside as collect, so

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
t clear. > > On Sun, Feb 12, 2023 at 10:59 AM sam smith > wrote: > >> @Enrico Minack Thanks for "unpivot" but I am >> using version 3.3.0 (you are taking it way too far as usual :) ) >> @Sean Owen Pls then show me how it can be improved by >> code. &g

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
s > def distinctValuesPerColumn(df: DataFrame): immutable.Map[String, > immutable.Seq[Any]] = { > df.schema.fields > .groupBy(_.dataType) > .mapValues(_.map(_.name)) > .par > .map { case (dataType, columns) => df.select(columns.map(col): _*) } > .ma

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
lar to > what you do here. Just need to do the cols one at a time. Your current code > doesn't do what you want. > > On Fri, Feb 10, 2023, 3:46 PM sam smith > wrote: > >> Hi Sean, >> >> "You need to select the distinct values of each col one at a tim

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
ber of > distinct values is also large. Thus, you should keep your data in > dataframes or RDDs, and store them as csv files, parquet, etc. > > a.p. > > > On 10/2/23 23:40, sam smith wrote: > > I want to get the distinct values of each column in a List (is it good > pr

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
t() the > result as you do here. > > On Fri, Feb 10, 2023, 3:34 PM sam smith > wrote: > >> I want to get the distinct values of each column in a List (is it good >> practice to use List here?), that contains as first element the column >> name, and the other ele

How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
I want to get the distinct values of each column in a List (is it good practice to use List here?), that contains as first element the column name, and the other element its distinct values so that for a dataset we get a list of lists, i do it this way (in my opinion no so fast): List> finalList =

Can we upload a csv dataset into Hive using SparkSQL?

2022-12-10 Thread sam smith
Hello, I want to create a table in Hive and then load a CSV file content into it all by means of Spark SQL. I saw in the docs the example with the .txt file BUT can we do instead something like the following to accomplish what i want? : String warehouseLocation = new File("spark-warehouse").getAb

Re: Aggregate over a column: the proper way to do

2022-04-10 Thread sam smith
Exact, one row, and two columns Le sam. 9 avr. 2022 à 17:44, Sean Owen a écrit : > But it only has one row, right? > > On Sat, Apr 9, 2022, 10:06 AM sam smith > wrote: > >> Yes. Returns the number of rows in the Dataset as *long*. but in my case >> the aggrega

Re: Aggregate over a column: the proper way to do

2022-04-09 Thread sam smith
Yes. Returns the number of rows in the Dataset as *long*. but in my case the aggregation returns a table of two columns. Le ven. 8 avr. 2022 à 14:12, Sean Owen a écrit : > Dataset.count() returns one value directly? > > On Thu, Apr 7, 2022 at 11:25 PM sam smith > wrote: > &g

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
is pointless. > > On Thu, Apr 7, 2022, 11:10 PM sam smith > wrote: > >> What if i do avg instead of count? >> >> Le ven. 8 avr. 2022 à 05:32, Sean Owen a écrit : >> >>> Wait, why groupBy at all? After the filter only rows with myCol equal to >>

Re: Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
What if i do avg instead of count? Le ven. 8 avr. 2022 à 05:32, Sean Owen a écrit : > Wait, why groupBy at all? After the filter only rows with myCol equal to > your target are left. There is only one group. Don't group just count after > the filter? > > On Thu, Apr 7, 2022

Aggregate over a column: the proper way to do

2022-04-07 Thread sam smith
I want to aggregate a column by counting the number of rows having the value "myTargetValue" and return the result I am doing it like the following:in JAVA > long result = > dataset.filter(dataset.col("myCol").equalTo("myTargetVal")).groupBy(col("myCol")).agg(count(dataset.col("myCol"))).select("c

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
ly can't answer until this is > cleared up. > > On Mon, Jan 24, 2022 at 10:57 AM sam smith > wrote: > >> I mean the DAG order is somehow altered when executing on Hadoop >> >> Le lun. 24 janv. 2022 à 17:17, Sean Owen a écrit : >> >>> Code is not e

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
files but you can order data. Still not sure what > specifically you are worried about here, but I don't think the kind of > thing you're contemplating can happen, no > > On Mon, Jan 24, 2022 at 9:28 AM sam smith > wrote: > >> I am aware of that, but whenever the c

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
uld > something, what, modify the byte code? No > > On Mon, Jan 24, 2022, 9:07 AM sam smith > wrote: > >> My point is could Hadoop go wrong about one Spark execution ? meaning >> that it gets confused (given the concurrent distributed tasks) and then >> adds wrong instr

Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
atives here? program execution order is still program execution > order. You are not guaranteed anything about order of concurrent tasks. > Failed tasks can be reexecuted so should be idempotent. I think the answer > is 'no' but not sure what you are thinking of here. > >

Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
Hello guys, I hope my question does not sound weird, but could a Spark execution on Hadoop cluster give different output than the program actually does ? I mean by that, the execution order is messed by hadoop, or an instruction executed twice..; ? Thanks for your enlightenment

Re: About some Spark technical help

2021-12-24 Thread sam smith
implementation compared to the original. > > Also a verbal description of the algo would be helpful > > Happy Holidays > > Andy > > On Fri, Dec 24, 2021 at 3:17 AM sam smith > wrote: > >> Hi Gourav, >> >> Good question! that's the programming la

Re: About some Spark technical help

2021-12-24 Thread sam smith
sity, why JAVA? > > Regards, > Gourav Sengupta > > On Thu, Dec 23, 2021 at 5:10 PM sam smith > wrote: > >> Hi Andrew, >> >> Thanks, here's the Github repo to the code and the publication : >> https://github.com/SamSmithDevs10/paperReplicationForReview >>

Re: About some Spark technical help

2021-12-23 Thread sam smith
you send us the URL the > publication > > > > Kind regards > > > > Andy > > > > *From: *sam smith > *Date: *Wednesday, December 22, 2021 at 10:59 AM > *To: *"user@spark.apache.org" > *Subject: *About some Spark technical help > > >

dataset partitioning algorithm implementation help

2021-12-23 Thread sam smith
Hello All, I am replicating a paper's algorithm about a partitioning approach to anonymize datasets with Spark / Java, and want to ask you for some help to review my 150 lines of code. My github repo, attached below, contains both my java class and the related paper: https://github.com/SamSmithDe

About some Spark technical help

2021-12-22 Thread sam smith
Hello guys, I am replicating a paper's algorithm in Spark / Java, and want to ask you guys for some assistance to validate / review about 150 lines of code. My github repo contains both my java class and the related paper, Any interested reviewer here ? Thanks.

About some Spark technical help

2021-12-22 Thread sam smith
Hello guys, I am replicating a paper's algorithm in Spark / Java, and want to ask you guys for some assistance to validate / review about 150 lines of code. My github repo contains both my java class and the related paper, Any interested reviewer here ? Thanks.

Re: About some Spark technical assistance

2021-12-13 Thread sam smith
you were added to the repo to contribute, thanks. I included the java class and the paper i am replicating Le lun. 13 déc. 2021 à 04:27, a écrit : > github url please. > > On 2021-12-13 01:06, sam smith wrote: > > Hello guys, > > > > I am replicating a paper&

About some Spark technical assistance

2021-12-12 Thread sam smith
Hello guys, I am replicating a paper's algorithm (graph coloring algorithm) in Spark under Java, and thought about asking you guys for some assistance to validate / review my 600 lines of code. Any volunteers to share the code with ? Thanks