Hi Apotolos, Can you suggest a better approach while keeping values within a dataframe?
Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos < papad...@csd.auth.gr> a écrit : > Dear Sam, > > you are assuming that the data fits in the memory of your local machine. > You are using as a basis a dataframe, which potentially can be very large, > and then you are storing the data in local lists. Keep in mind that that > the number of distinct elements in a column may be very large (depending on > the app). I suggest to work on a solution that assumes that the number of > distinct values is also large. Thus, you should keep your data in > dataframes or RDDs, and store them as csv files, parquet, etc. > > a.p. > > > On 10/2/23 23:40, sam smith wrote: > > I want to get the distinct values of each column in a List (is it good > practice to use List here?), that contains as first element the column > name, and the other element its distinct values so that for a dataset we > get a list of lists, i do it this way (in my opinion no so fast): > > List<List<String>> finalList = new ArrayList<List<String>>(); > Dataset<Row> df = spark.read().format("csv").option("header", > "true").load("/pathToCSV"); > String[] columnNames = df.columns(); > for (int i=0;i<columnNames.length;i++) { > List<String> columnList = new ArrayList<String>(); > > columnList.add(columnNames[i]); > > > List<Row> columnValues = > df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList(); > for (int j=0;j<columnValues.size();j++) > columnList.add(columnValues.get(j).apply(0).toString()); > > finalList.add(columnList); > > > How to improve this? > > Also, can I get the results in JSON format? > > -- > Apostolos N. Papadopoulos, Associate Professor > Department of Informatics > Aristotle University of Thessaloniki > Thessaloniki, GREECE > tel: ++0030312310991918 > email: papad...@csd.auth.gr > twitter: @papadopoulos_ap > web: http://datalab.csd.auth.gr/~apostol > >