I am not sure i understand well " Just need to do the cols one at a time". Plus I think Apostolos is right, this needs a dataframe approach not a list approach.
Le ven. 10 févr. 2023 à 22:47, Sean Owen <sro...@gmail.com> a écrit : > For each column, select only that call and get distinct values. Similar to > what you do here. Just need to do the cols one at a time. Your current code > doesn't do what you want. > > On Fri, Feb 10, 2023, 3:46 PM sam smith <qustacksm2123...@gmail.com> > wrote: > >> Hi Sean, >> >> "You need to select the distinct values of each col one at a time", how ? >> >> Le ven. 10 févr. 2023 à 22:40, Sean Owen <sro...@gmail.com> a écrit : >> >>> That gives you all distinct tuples of those col values. You need to >>> select the distinct values of each col one at a time. Sure just collect() >>> the result as you do here. >>> >>> On Fri, Feb 10, 2023, 3:34 PM sam smith <qustacksm2123...@gmail.com> >>> wrote: >>> >>>> I want to get the distinct values of each column in a List (is it good >>>> practice to use List here?), that contains as first element the column >>>> name, and the other element its distinct values so that for a dataset we >>>> get a list of lists, i do it this way (in my opinion no so fast): >>>> >>>> List<List<String>> finalList = new ArrayList<List<String>>(); >>>> Dataset<Row> df = spark.read().format("csv").option("header", >>>> "true").load("/pathToCSV"); >>>> String[] columnNames = df.columns(); >>>> for (int i=0;i<columnNames.length;i++) { >>>> List<String> columnList = new ArrayList<String>(); >>>> >>>> columnList.add(columnNames[i]); >>>> >>>> >>>> List<Row> columnValues = >>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList(); >>>> for (int j=0;j<columnValues.size();j++) >>>> columnList.add(columnValues.get(j).apply(0).toString()); >>>> >>>> finalList.add(columnList); >>>> >>>> >>>> How to improve this? >>>> >>>> Also, can I get the results in JSON format? >>>> >>>