Why would csv or a temp table change anything here? You don't need
windowing for distinct values either
On Fri, Feb 10, 2023, 6:01 PM Mich Talebzadeh
wrote:
> on top of my head, create a dataframe reading CSV file.
>
> This is python
>
> listing_df =
> spark.read.format("com.databricks.spark.cs
on top of my head, create a dataframe reading CSV file.
This is python
listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
listing_df.printSchema()
listing_df.createOrReplaceTempView("temp")
## do your distinct colum
I am not sure i understand well " Just need to do the cols one at a time".
Plus I think Apostolos is right, this needs a dataframe approach not a list
approach.
Le ven. 10 févr. 2023 à 22:47, Sean Owen a écrit :
> For each column, select only that call and get distinct values. Similar to
> what
Hi Apotolos,
Can you suggest a better approach while keeping values within a dataframe?
Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos <
papad...@csd.auth.gr> a écrit :
> Dear Sam,
>
> you are assuming that the data fits in the memory of your local machine.
> You are using as a basis a
Hi Sean,
"You need to select the distinct values of each col one at a time", how ?
Le ven. 10 févr. 2023 à 22:40, Sean Owen a écrit :
> That gives you all distinct tuples of those col values. You need to select
> the distinct values of each col one at a time. Sure just collect() the
> result as
Dear Sam,
you are assuming that the data fits in the memory of your local machine.
You are using as a basis a dataframe, which potentially can be very
large, and then you are storing the data in local lists. Keep in mind
that that the number of distinct elements in a column may be very large
That gives you all distinct tuples of those col values. You need to select
the distinct values of each col one at a time. Sure just collect() the
result as you do here.
On Fri, Feb 10, 2023, 3:34 PM sam smith wrote:
> I want to get the distinct values of each column in a List (is it good
> pract
I want to get the distinct values of each column in a List (is it good
practice to use List here?), that contains as first element the column
name, and the other element its distinct values so that for a dataset we
get a list of lists, i do it this way (in my opinion no so fast):
List> finalList =
unsubscribe
On Tue, Feb 7, 2023 at 5:19 AM Tang Jinxin wrote:
> unsubscribe
>