Re: [PR] feat: improve perf of CSV uploads [superset]

via GitHub Thu, 07 Aug 2025 12:05:05 -0700


korbit-ai[bot] commented on code in PR #34603:
URL: https://github.com/apache/superset/pull/34603#discussion_r2261186635



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@
         )
 
     @staticmethod
-    def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+    def _detect_encoding(file: FileStorage) -> str:
+        """Detect file encoding with fallback options"""
+        sample = file.read(10000)
+        file.seek(0)
+
+        for encoding in ENCODING_FALLBACKS:
+            try:
+                sample.decode(encoding)
+                return encoding
+            except UnicodeDecodeError:
+                continue
+        return DEFAULT_ENCODING
+
+    @staticmethod
+    def _read_csv(  # noqa: C901
+        file: FileStorage,
+        kwargs: dict[str, Any],
+    ) -> pd.DataFrame:
+        encoding = kwargs.get("encoding", DEFAULT_ENCODING)
+
+        # PyArrow engine doesn't support iterator/chunksize/nrows
+        # It also has known issues with date parsing and missing values
+        # Default to "c" engine for stability
+        has_unsupported_options = (
+            "chunksize" in kwargs
+            or "iterator" in kwargs
+            or ("nrows" in kwargs and kwargs.get("nrows") is not None)
+            or kwargs.get("parse_dates")  # Has bugs with multiple date columns
+            or kwargs.get("na_values")  # Has bugs with missing value handling
+        )
+
+        # Use PyArrow engine if feature flag is enabled and options are 
compatible
+        if (
+            is_feature_enabled("CSV_UPLOAD_PYARROW_ENGINE")
+            and not has_unsupported_options
+        ):
+            # Check if pandas was built with pyarrow
+            if "pyarrow" not in str(pd.__version__):

Review Comment:
   ### Unreliable pyarrow detection <sub>![category 
Functionality](https://img.shields.io/badge/Functionality-0284c7)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   Unreliable pyarrow availability check. Checking pandas version string for 
'pyarrow' is not a reliable way to determine if pandas was built with pyarrow 
support.
   
   
   ###### Why this matters
   This could lead to incorrect engine selection and potential runtime errors 
when attempting to use pyarrow.
   
   ###### Suggested change ∙ *Feature Preview*
   Use a more reliable method to check pyarrow availability:
   ```python
   def is_pyarrow_available():
       try:
           import pyarrow
           return True
       except ImportError:
           return False
   
   kwargs["engine"] = "pyarrow" if is_pyarrow_available() else "c"
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:2a9070dd-c853-4bb9-8aaf-a369380517de -->
   
   
   [](2a9070dd-c853-4bb9-8aaf-a369380517de)



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@ def __init__(
         )
 
     @staticmethod
-    def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+    def _detect_encoding(file: FileStorage) -> str:
+        """Detect file encoding with fallback options"""
+        sample = file.read(10000)
+        file.seek(0)

Review Comment:
   ### Suboptimal encoding detection strategy <sub>![category 
Performance](https://img.shields.io/badge/Performance-4f46e5)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   Reading a fixed size sample for encoding detection may be insufficient for 
files with mixed encodings or encoding issues later in the file
   
   
   ###### Why this matters
   This could lead to performance degradation if encoding issues are discovered 
after processing significant portions of the file, requiring a restart of the 
entire process
   
   ###### Suggested change ∙ *Feature Preview*
   Implement progressive sampling that increases sample size if encoding 
detection fails, or use a chardet-like library that's more reliable at encoding 
detection:
   ```python
   def _detect_encoding(file: FileStorage, initial_sample_size: int = 10000) -> 
str:
       for sample_size in [initial_sample_size, 50000, 100000]:
           sample = file.read(sample_size)
           file.seek(0)
           # Try encoding detection with increasing samples
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:f6b42878-83fd-41eb-8756-0184059e3c3c -->
   
   
   [](f6b42878-83fd-41eb-8756-0184059e3c3c)



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@
         )
 
     @staticmethod
-    def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+    def _detect_encoding(file: FileStorage) -> str:
+        """Detect file encoding with fallback options"""
+        sample = file.read(10000)
+        file.seek(0)
+
+        for encoding in ENCODING_FALLBACKS:
+            try:
+                sample.decode(encoding)
+                return encoding
+            except UnicodeDecodeError:
+                continue
+        return DEFAULT_ENCODING
+
+    @staticmethod
+    def _read_csv(  # noqa: C901
+        file: FileStorage,
+        kwargs: dict[str, Any],
+    ) -> pd.DataFrame:
+        encoding = kwargs.get("encoding", DEFAULT_ENCODING)
+
+        # PyArrow engine doesn't support iterator/chunksize/nrows
+        # It also has known issues with date parsing and missing values
+        # Default to "c" engine for stability
+        has_unsupported_options = (
+            "chunksize" in kwargs
+            or "iterator" in kwargs
+            or ("nrows" in kwargs and kwargs.get("nrows") is not None)
+            or kwargs.get("parse_dates")  # Has bugs with multiple date columns
+            or kwargs.get("na_values")  # Has bugs with missing value handling
+        )
+
+        # Use PyArrow engine if feature flag is enabled and options are 
compatible
+        if (
+            is_feature_enabled("CSV_UPLOAD_PYARROW_ENGINE")
+            and not has_unsupported_options
+        ):

Review Comment:
   ### Hardcoded engine selection strategy <sub>![category 
Design](https://img.shields.io/badge/Design-0d9488)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   The engine selection logic is directly embedded in the reader class, making 
it difficult to extend or modify the engine selection strategy.
   
   
   ###### Why this matters
   This violates the Open/Closed Principle by making it hard to add new engines 
or modify engine selection logic without changing the existing code.
   
   ###### Suggested change ∙ *Feature Preview*
   Extract engine selection into a strategy pattern:
   ```python
   class CSVEngineStrategy:
       def select_engine(self, options: dict[str, Any]) -> str:
           # Engine selection logic
   
   class PyArrowEngineStrategy(CSVEngineStrategy):
       def select_engine(self, options: dict[str, Any]) -> str:
           # PyArrow specific logic
   
   class CSVReader:
       def __init__(self, engine_strategy: CSVEngineStrategy):
           self.engine_strategy = engine_strategy
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:e0081889-6635-4127-b33b-625c5279ebda -->
   
   
   [](e0081889-6635-4127-b33b-625c5279ebda)



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@
         )
 
     @staticmethod
-    def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+    def _detect_encoding(file: FileStorage) -> str:
+        """Detect file encoding with fallback options"""
+        sample = file.read(10000)
+        file.seek(0)
+
+        for encoding in ENCODING_FALLBACKS:
+            try:
+                sample.decode(encoding)
+                return encoding
+            except UnicodeDecodeError:
+                continue
+        return DEFAULT_ENCODING
+
+    @staticmethod
+    def _read_csv(  # noqa: C901
+        file: FileStorage,
+        kwargs: dict[str, Any],
+    ) -> pd.DataFrame:
+        encoding = kwargs.get("encoding", DEFAULT_ENCODING)
+
+        # PyArrow engine doesn't support iterator/chunksize/nrows
+        # It also has known issues with date parsing and missing values
+        # Default to "c" engine for stability
+        has_unsupported_options = (
+            "chunksize" in kwargs
+            or "iterator" in kwargs
+            or ("nrows" in kwargs and kwargs.get("nrows") is not None)
+            or kwargs.get("parse_dates")  # Has bugs with multiple date columns
+            or kwargs.get("na_values")  # Has bugs with missing value handling
+        )
+
+        # Use PyArrow engine if feature flag is enabled and options are 
compatible
+        if (
+            is_feature_enabled("CSV_UPLOAD_PYARROW_ENGINE")
+            and not has_unsupported_options
+        ):
+            # Check if pandas was built with pyarrow
+            if "pyarrow" not in str(pd.__version__):
+                # Try to use pyarrow engine if available
+                kwargs["engine"] = "pyarrow" if util.find_spec("pyarrow") else 
"c"
+            else:
+                # Pandas has built-in pyarrow, use c engine to avoid conflicts
+                kwargs["engine"] = "c"
+        else:
+            # Default to c engine for reliability
+            kwargs["engine"] = "c"
+
+        kwargs["low_memory"] = False
+
         try:
             if "chunksize" in kwargs:
-                return pd.concat(
-                    pd.read_csv(
-                        filepath_or_buffer=file.stream,
-                        **kwargs,
-                    )
+                chunks = []
+                chunk_iterator = pd.read_csv(
+                    filepath_or_buffer=file.stream,
+                    **kwargs,
                 )
+
+                for chunk in chunk_iterator:
+                    chunks.append(chunk)
+                    if "nrows" in kwargs and kwargs["nrows"] is not None:
+                        if len(pd.concat(chunks)) >= kwargs["nrows"]:
+                            break

Review Comment:
   ### Memory inefficient chunk size validation <sub>![category 
Performance](https://img.shields.io/badge/Performance-4f46e5)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   Inefficient memory usage when checking chunk size limits. The code 
concatenates all chunks on every iteration just to check the total length.
   
   
   ###### Why this matters
   This can lead to significant memory overhead and performance degradation 
when processing large files, defeating the purpose of chunked reading.
   
   ###### Suggested change ∙ *Feature Preview*
   Track the total rows processed without concatenating:
   ```python
   total_rows = 0
   for chunk in chunk_iterator:
       chunks.append(chunk)
       total_rows += len(chunk)
       if "nrows" in kwargs and kwargs["nrows"] is not None:
           if total_rows >= kwargs["nrows"]:
               break
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:f4b50773-88da-43c5-a70d-07593f522778 -->
   
   
   [](f4b50773-88da-43c5-a70d-07593f522778)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: improve perf of CSV uploads [superset]

Reply via email to