Re: How to improve efficiency of this piece of code (returning distinct column values)

sam smith Fri, 10 Feb 2023 13:51:42 -0800

Hi Apotolos,
Can you suggest a better approach while keeping values within a dataframe?


Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos <
[email protected]> a écrit :

> Dear Sam,
>
> you are assuming that the data fits in the memory of your local machine.
> You are using as a basis a dataframe, which potentially can be very large,
> and then you are storing the data in local lists. Keep in mind that that
> the number of distinct elements in a column may be very large (depending on
> the app). I suggest to work on a solution that assumes that the number of
> distinct values is also large. Thus, you should keep your data in
> dataframes or RDDs, and store them as csv files, parquet, etc.
>
> a.p.
>
>
> On 10/2/23 23:40, sam smith wrote:
>
> I want to get the distinct values of each column in a List (is it good
> practice to use List here?), that contains as first element the column
> name, and the other element its distinct values so that for a dataset we
> get a list of lists, i do it this way (in my opinion no so fast):
>
> List<List<String>> finalList = new ArrayList<List<String>>();
>     Dataset<Row> df = spark.read().format("csv").option("header", 
> "true").load("/pathToCSV");
>     String[] columnNames = df.columns();
>  for (int i=0;i<columnNames.length;i++) {
>     List<String> columnList = new ArrayList<String>();
>
>     columnList.add(columnNames[i]);
>
>
>     List<Row> columnValues = 
> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList();
>     for (int j=0;j<columnValues.size();j++)
>         columnList.add(columnValues.get(j).apply(0).toString());
>
>     finalList.add(columnList);
>
>
> How to improve this?
>
> Also, can I get the results in JSON format?
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: [email protected]
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>

Re: How to improve efficiency of this piece of code (returning distinct column values)

Reply via email to