Re: [PR] Pivoted CSV export fix [superset]

via GitHub Tue, 19 Nov 2024 03:55:29 -0800


frlm commented on PR #30961:
URL: https://github.com/apache/superset/pull/30961#issuecomment-2485508993


   ### UPDATE: Export Pivot Tables into CSV Format
   
   In the last commit, I made a small change to export pivot tables without 
flattening multi-index rows/columns. This feature, initially implemented by the 
community for managing JSON export, is particularly unhelpful when dealing with 
pivot tables exported into CSV format with a large number of columns 
corresponding to a multi-index.
   
   To better explain the issue, I have attached an example. If I create a pivot 
table with many columns associated with multi-index rows, the flattening 
process transforms them into a single column with a field value obtained by 
concatenating all fields separated by a space. This approach is not very 
effective if we want to use these exports in Excel or other tools.
   
   ##### Example:
   
   I created this pivot table using your example dataset:
   
   
![pivot_table](https://github.com/user-attachments/assets/9f71798c-36d2-4989-b8e4-1f9cc5f790b7)
   
   The AS-IS behavior generates this type of file, where the MultiIndex columns 
get collapsed into one row and MultiIndex Rows get merged into a single column, 
obtained by concatenating all fields separated by a space. This export is not 
very easy to use in Excel for the next steps.
   
   
![image](https://github.com/user-attachments/assets/c142d4b3-2a48-4fbf-afe9-15b48121f9a5)
   
   
[export_as_is_20241119_101212.csv](https://github.com/user-attachments/files/17814564/export_as_is_20241119_101212.csv)
   
   The TO-BE behavior generates this type of file, where the MultiIndex 
rows/columns get preserved:
   
   
![image](https://github.com/user-attachments/assets/7796e2a6-031a-415a-abf5-ed460244bbfb)
   
   
[export_to_be_20241119_100853.csv](https://github.com/user-attachments/files/17814637/export_to_be_20241119_100853.csv)
   
   #### Code Change
   
   ~~~python
   if query["result_format"] == ChartDataResultFormat.JSON:
       # Flatten hierarchical columns/index since they are represented as
       # `Tuple[str]`. Otherwise encoding to JSON later will fail because
       # maps cannot have tuples as their keys in JSON.
       processed_df.columns = [
           " ".join(str(name) for name in column).strip()
           if isinstance(column, tuple)
           else column
           for column in processed_df.columns
       ]
       processed_df.index = [
           " ".join(str(name) for name in index).strip()
           if isinstance(index, tuple)
           else index
           for index in processed_df.index
       ]
       query["data"] = processed_df.to_dict()
       
   elif query["result_format"] == ChartDataResultFormat.CSV:
       buf = StringIO()
       processed_df.to_csv(buf)
       buf.seek(0)
       query["data"] = buf.getvalue()
   ~~~
   
   #### Code Complete
   
   ~~~python
   def apply_post_process(
       result: dict[Any, Any],
       form_data: Optional[dict[str, Any]] = None,
       datasource: Optional[Union["BaseDatasource", "Query"]] = None,
   ) -> dict[Any, Any]:
       form_data = form_data or {}
   
       viz_type = form_data.get("viz_type")
       if viz_type not in post_processors:
           return result
   
       post_processor = post_processors[viz_type]
   
       for query in result["queries"]:
           if query["result_format"] not in (rf.value for rf in 
ChartDataResultFormat):
               raise Exception(  # pylint: disable=broad-exception-raised
                   f"Result format {query['result_format']} not supported"
               )
   
           data = query["data"]
   
           if isinstance(data, str):
               data = data.strip()
   
           if not data:
               # do not try to process empty data
               continue
   
           if query["result_format"] == ChartDataResultFormat.JSON:
               df = pd.DataFrame.from_dict(data)
           elif query["result_format"] == ChartDataResultFormat.CSV:
               df = pd.read_csv(StringIO(data), 
                                delimiter=superset_config.CSV_EXPORT.get('sep'),
                                
encoding=superset_config.CSV_EXPORT.get('encoding'),
                                
decimal=superset_config.CSV_EXPORT.get('decimal'))
           
           # convert all columns to verbose (label) name
           if datasource:
               df.rename(columns=datasource.data["verbose_map"], inplace=True)
   
           processed_df = post_processor(df, form_data, datasource)
   
           query["colnames"] = list(processed_df.columns)
           query["indexnames"] = list(processed_df.index)
           query["coltypes"] = extract_dataframe_dtypes(processed_df, 
datasource)
           query["rowcount"] = len(processed_df.index)
                   
           if query["result_format"] == ChartDataResultFormat.JSON:
               # Flatten hierarchical columns/index since they are represented 
as
               # `Tuple[str]`. Otherwise encoding to JSON later will fail 
because
               # maps cannot have tuples as their keys in JSON.
               processed_df.columns = [
                   " ".join(str(name) for name in column).strip()
                   if isinstance(column, tuple)
                   else column
                   for column in processed_df.columns
               ]
               processed_df.index = [
                   " ".join(str(name) for name in index).strip()
                   if isinstance(index, tuple)
                   else index
                   for index in processed_df.index
               ]
               query["data"] = processed_df.to_dict()
               
           elif query["result_format"] == ChartDataResultFormat.CSV:
               buf = StringIO()
               processed_df.to_csv(buf)
               buf.seek(0)
               query["data"] = buf.getvalue()
   
       return result
   ~~~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Pivoted CSV export fix [superset]

Reply via email to