Re: How to improve efficiency of this piece of code (returning distinct column values)

sam smith Fri, 10 Feb 2023 13:58:28 -0800

I am not sure i understand well " Just need to do the cols one at a time".
Plus I think Apostolos is right, this needs a dataframe approach not a list
approach.


Le ven. 10 févr. 2023 à 22:47, Sean Owen <sro...@gmail.com> a écrit :

> For each column, select only that call and get distinct values. Similar to
> what you do here. Just need to do the cols one at a time. Your current code
> doesn't do what you want.
>
> On Fri, Feb 10, 2023, 3:46 PM sam smith <qustacksm2123...@gmail.com>
> wrote:
>
>> Hi Sean,
>>
>> "You need to select the distinct values of each col one at a time", how ?
>>
>> Le ven. 10 févr. 2023 à 22:40, Sean Owen <sro...@gmail.com> a écrit :
>>
>>> That gives you all distinct tuples of those col values. You need to
>>> select the distinct values of each col one at a time. Sure just collect()
>>> the result as you do here.
>>>
>>> On Fri, Feb 10, 2023, 3:34 PM sam smith <qustacksm2123...@gmail.com>
>>> wrote:
>>>
>>>> I want to get the distinct values of each column in a List (is it good
>>>> practice to use List here?), that contains as first element the column
>>>> name, and the other element its distinct values so that for a dataset we
>>>> get a list of lists, i do it this way (in my opinion no so fast):
>>>>
>>>> List<List<String>> finalList = new ArrayList<List<String>>();
>>>>     Dataset<Row> df = spark.read().format("csv").option("header", 
>>>> "true").load("/pathToCSV");
>>>>     String[] columnNames = df.columns();
>>>>  for (int i=0;i<columnNames.length;i++) {
>>>>     List<String> columnList = new ArrayList<String>();
>>>>
>>>>     columnList.add(columnNames[i]);
>>>>
>>>>
>>>>     List<Row> columnValues = 
>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>>>>     for (int j=0;j<columnValues.size();j++)
>>>>         columnList.add(columnValues.get(j).apply(0).toString());
>>>>
>>>>     finalList.add(columnList);
>>>>
>>>>
>>>> How to improve this?
>>>>
>>>> Also, can I get the results in JSON format?
>>>>
>>>

Re: How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to