on top of my head, create a dataframe reading CSV file. This is python
listing_df = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load(csv_file) listing_df.printSchema() listing_df.createOrReplaceTempView("temp") ## do your distinct columns using windowing functions on temp table with SQL HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 10 Feb 2023 at 21:59, sam smith <qustacksm2123...@gmail.com> wrote: > I am not sure i understand well " Just need to do the cols one at a time". > Plus I think Apostolos is right, this needs a dataframe approach not a list > approach. > > Le ven. 10 févr. 2023 à 22:47, Sean Owen <sro...@gmail.com> a écrit : > >> For each column, select only that call and get distinct values. Similar >> to what you do here. Just need to do the cols one at a time. Your current >> code doesn't do what you want. >> >> On Fri, Feb 10, 2023, 3:46 PM sam smith <qustacksm2123...@gmail.com> >> wrote: >> >>> Hi Sean, >>> >>> "You need to select the distinct values of each col one at a time", how >>> ? >>> >>> Le ven. 10 févr. 2023 à 22:40, Sean Owen <sro...@gmail.com> a écrit : >>> >>>> That gives you all distinct tuples of those col values. You need to >>>> select the distinct values of each col one at a time. Sure just collect() >>>> the result as you do here. >>>> >>>> On Fri, Feb 10, 2023, 3:34 PM sam smith <qustacksm2123...@gmail.com> >>>> wrote: >>>> >>>>> I want to get the distinct values of each column in a List (is it good >>>>> practice to use List here?), that contains as first element the column >>>>> name, and the other element its distinct values so that for a dataset we >>>>> get a list of lists, i do it this way (in my opinion no so fast): >>>>> >>>>> List<List<String>> finalList = new ArrayList<List<String>>(); >>>>> Dataset<Row> df = spark.read().format("csv").option("header", >>>>> "true").load("/pathToCSV"); >>>>> String[] columnNames = df.columns(); >>>>> for (int i=0;i<columnNames.length;i++) { >>>>> List<String> columnList = new ArrayList<String>(); >>>>> >>>>> columnList.add(columnNames[i]); >>>>> >>>>> >>>>> List<Row> columnValues = >>>>> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList(); >>>>> for (int j=0;j<columnValues.size();j++) >>>>> columnList.add(columnValues.get(j).apply(0).toString()); >>>>> >>>>> finalList.add(columnList); >>>>> >>>>> >>>>> How to improve this? >>>>> >>>>> Also, can I get the results in JSON format? >>>>> >>>>