Re: How to improve efficiency of this piece of code (returning distinct column values)

Mich Talebzadeh Fri, 10 Feb 2023 16:01:46 -0800

on top of my head, create a dataframe reading CSV file.

This is python


 listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
 listing_df.printSchema()
 listing_df.createOrReplaceTempView("temp")

## do your distinct columns using windowing functions on temp table with SQL

 HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 10 Feb 2023 at 21:59, sam smith <qustacksm2123...@gmail.com> wrote:

> I am not sure i understand well " Just need to do the cols one at a time".
> Plus I think Apostolos is right, this needs a dataframe approach not a list
> approach.
>
> Le ven. 10 févr. 2023 à 22:47, Sean Owen <sro...@gmail.com> a écrit :
>
>> For each column, select only that call and get distinct values. Similar
>> to what you do here. Just need to do the cols one at a time. Your current
>> code doesn't do what you want.
>>
>> On Fri, Feb 10, 2023, 3:46 PM sam smith <qustacksm2123...@gmail.com>
>> wrote:
>>
>>> Hi Sean,
>>>
>>> "You need to select the distinct values of each col one at a time", how
>>> ?
>>>
>>> Le ven. 10 févr. 2023 à 22:40, Sean Owen <sro...@gmail.com> a écrit :
>>>
>>>> That gives you all distinct tuples of those col values. You need to
>>>> select the distinct values of each col one at a time. Sure just collect()
>>>> the result as you do here.
>>>>
>>>> On Fri, Feb 10, 2023, 3:34 PM sam smith <qustacksm2123...@gmail.com>
>>>> wrote:
>>>>
>>>>> I want to get the distinct values of each column in a List (is it good
>>>>> practice to use List here?), that contains as first element the column
>>>>> name, and the other element its distinct values so that for a dataset we
>>>>> get a list of lists, i do it this way (in my opinion no so fast):
>>>>>
>>>>> List<List<String>> finalList = new ArrayList<List<String>>();
>>>>>     Dataset<Row> df = spark.read().format("csv").option("header", 
>>>>> "true").load("/pathToCSV");
>>>>>     String[] columnNames = df.columns();
>>>>>  for (int i=0;i<columnNames.length;i++) {
>>>>>     List<String> columnList = new ArrayList<String>();
>>>>>
>>>>>     columnList.add(columnNames[i]);
>>>>>
>>>>>
>>>>>     List<Row> columnValues = 
>>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>>>>     for (int j=0;j<columnValues.size();j++)
>>>>>         columnList.add(columnValues.get(j).apply(0).toString());
>>>>>
>>>>>     finalList.add(columnList);
>>>>>
>>>>>
>>>>> How to improve this?
>>>>>
>>>>> Also, can I get the results in JSON format?
>>>>>
>>>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to