korbit-ai[bot] commented on code in PR #34603:
URL: https://github.com/apache/superset/pull/34603#discussion_r2261186635
##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@
)
@staticmethod
- def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+ def _detect_encoding(file: FileStorage) -> str:
+ """Detect file encoding with fallback options"""
+ sample = file.read(10000)
+ file.seek(0)
+
+ for encoding in ENCODING_FALLBACKS:
+ try:
+ sample.decode(encoding)
+ return encoding
+ except UnicodeDecodeError:
+ continue
+ return DEFAULT_ENCODING
+
+ @staticmethod
+ def _read_csv( # noqa: C901
+ file: FileStorage,
+ kwargs: dict[str, Any],
+ ) -> pd.DataFrame:
+ encoding = kwargs.get("encoding", DEFAULT_ENCODING)
+
+ # PyArrow engine doesn't support iterator/chunksize/nrows
+ # It also has known issues with date parsing and missing values
+ # Default to "c" engine for stability
+ has_unsupported_options = (
+ "chunksize" in kwargs
+ or "iterator" in kwargs
+ or ("nrows" in kwargs and kwargs.get("nrows") is not None)
+ or kwargs.get("parse_dates") # Has bugs with multiple date columns
+ or kwargs.get("na_values") # Has bugs with missing value handling
+ )
+
+ # Use PyArrow engine if feature flag is enabled and options are
compatible
+ if (
+ is_feature_enabled("CSV_UPLOAD_PYARROW_ENGINE")
+ and not has_unsupported_options
+ ):
+ # Check if pandas was built with pyarrow
+ if "pyarrow" not in str(pd.__version__):
Review Comment:
### Unreliable pyarrow detection <sub></sub>
<details>
<summary>Tell me more</summary>
###### What is the issue?
Unreliable pyarrow availability check. Checking pandas version string for
'pyarrow' is not a reliable way to determine if pandas was built with pyarrow
support.
###### Why this matters
This could lead to incorrect engine selection and potential runtime errors
when attempting to use pyarrow.
###### Suggested change ∙ *Feature Preview*
Use a more reliable method to check pyarrow availability:
```python
def is_pyarrow_available():
try:
import pyarrow
return True
except ImportError:
return False
kwargs["engine"] = "pyarrow" if is_pyarrow_available() else "c"
```
###### Provide feedback to improve future suggestions
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66/upvote)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66?what_not_true=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66?what_out_of_scope=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66?what_not_in_standard=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/2f2ae8de-def4-41a6-b15d-50e95de78a66)
</details>
<sub>
💬 Looking for more details? Reply to this comment to chat with Korbit.
</sub>
<!--- korbi internal id:2a9070dd-c853-4bb9-8aaf-a369380517de -->
[](2a9070dd-c853-4bb9-8aaf-a369380517de)
##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@ def __init__(
)
@staticmethod
- def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+ def _detect_encoding(file: FileStorage) -> str:
+ """Detect file encoding with fallback options"""
+ sample = file.read(10000)
+ file.seek(0)
Review Comment:
### Suboptimal encoding detection strategy <sub></sub>
<details>
<summary>Tell me more</summary>
###### What is the issue?
Reading a fixed size sample for encoding detection may be insufficient for
files with mixed encodings or encoding issues later in the file
###### Why this matters
This could lead to performance degradation if encoding issues are discovered
after processing significant portions of the file, requiring a restart of the
entire process
###### Suggested change ∙ *Feature Preview*
Implement progressive sampling that increases sample size if encoding
detection fails, or use a chardet-like library that's more reliable at encoding
detection:
```python
def _detect_encoding(file: FileStorage, initial_sample_size: int = 10000) ->
str:
for sample_size in [initial_sample_size, 50000, 100000]:
sample = file.read(sample_size)
file.seek(0)
# Try encoding detection with increasing samples
```
###### Provide feedback to improve future suggestions
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731/upvote)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731?what_not_true=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731?what_out_of_scope=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731?what_not_in_standard=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/479f44c2-0cc5-4824-a4d8-1cd034bd1731)
</details>
<sub>
💬 Looking for more details? Reply to this comment to chat with Korbit.
</sub>
<!--- korbi internal id:f6b42878-83fd-41eb-8756-0184059e3c3c -->
[](f6b42878-83fd-41eb-8756-0184059e3c3c)
##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@
)
@staticmethod
- def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+ def _detect_encoding(file: FileStorage) -> str:
+ """Detect file encoding with fallback options"""
+ sample = file.read(10000)
+ file.seek(0)
+
+ for encoding in ENCODING_FALLBACKS:
+ try:
+ sample.decode(encoding)
+ return encoding
+ except UnicodeDecodeError:
+ continue
+ return DEFAULT_ENCODING
+
+ @staticmethod
+ def _read_csv( # noqa: C901
+ file: FileStorage,
+ kwargs: dict[str, Any],
+ ) -> pd.DataFrame:
+ encoding = kwargs.get("encoding", DEFAULT_ENCODING)
+
+ # PyArrow engine doesn't support iterator/chunksize/nrows
+ # It also has known issues with date parsing and missing values
+ # Default to "c" engine for stability
+ has_unsupported_options = (
+ "chunksize" in kwargs
+ or "iterator" in kwargs
+ or ("nrows" in kwargs and kwargs.get("nrows") is not None)
+ or kwargs.get("parse_dates") # Has bugs with multiple date columns
+ or kwargs.get("na_values") # Has bugs with missing value handling
+ )
+
+ # Use PyArrow engine if feature flag is enabled and options are
compatible
+ if (
+ is_feature_enabled("CSV_UPLOAD_PYARROW_ENGINE")
+ and not has_unsupported_options
+ ):
Review Comment:
### Hardcoded engine selection strategy <sub></sub>
<details>
<summary>Tell me more</summary>
###### What is the issue?
The engine selection logic is directly embedded in the reader class, making
it difficult to extend or modify the engine selection strategy.
###### Why this matters
This violates the Open/Closed Principle by making it hard to add new engines
or modify engine selection logic without changing the existing code.
###### Suggested change ∙ *Feature Preview*
Extract engine selection into a strategy pattern:
```python
class CSVEngineStrategy:
def select_engine(self, options: dict[str, Any]) -> str:
# Engine selection logic
class PyArrowEngineStrategy(CSVEngineStrategy):
def select_engine(self, options: dict[str, Any]) -> str:
# PyArrow specific logic
class CSVReader:
def __init__(self, engine_strategy: CSVEngineStrategy):
self.engine_strategy = engine_strategy
```
###### Provide feedback to improve future suggestions
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a/upvote)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a?what_not_true=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a?what_out_of_scope=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a?what_not_in_standard=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/78ca6122-8826-4d78-a17a-4728d1ca311a)
</details>
<sub>
💬 Looking for more details? Reply to this comment to chat with Korbit.
</sub>
<!--- korbi internal id:e0081889-6635-4127-b33b-625c5279ebda -->
[](e0081889-6635-4127-b33b-625c5279ebda)
##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -61,23 +65,102 @@
)
@staticmethod
- def _read_csv(file: FileStorage, kwargs: dict[str, Any]) -> pd.DataFrame:
+ def _detect_encoding(file: FileStorage) -> str:
+ """Detect file encoding with fallback options"""
+ sample = file.read(10000)
+ file.seek(0)
+
+ for encoding in ENCODING_FALLBACKS:
+ try:
+ sample.decode(encoding)
+ return encoding
+ except UnicodeDecodeError:
+ continue
+ return DEFAULT_ENCODING
+
+ @staticmethod
+ def _read_csv( # noqa: C901
+ file: FileStorage,
+ kwargs: dict[str, Any],
+ ) -> pd.DataFrame:
+ encoding = kwargs.get("encoding", DEFAULT_ENCODING)
+
+ # PyArrow engine doesn't support iterator/chunksize/nrows
+ # It also has known issues with date parsing and missing values
+ # Default to "c" engine for stability
+ has_unsupported_options = (
+ "chunksize" in kwargs
+ or "iterator" in kwargs
+ or ("nrows" in kwargs and kwargs.get("nrows") is not None)
+ or kwargs.get("parse_dates") # Has bugs with multiple date columns
+ or kwargs.get("na_values") # Has bugs with missing value handling
+ )
+
+ # Use PyArrow engine if feature flag is enabled and options are
compatible
+ if (
+ is_feature_enabled("CSV_UPLOAD_PYARROW_ENGINE")
+ and not has_unsupported_options
+ ):
+ # Check if pandas was built with pyarrow
+ if "pyarrow" not in str(pd.__version__):
+ # Try to use pyarrow engine if available
+ kwargs["engine"] = "pyarrow" if util.find_spec("pyarrow") else
"c"
+ else:
+ # Pandas has built-in pyarrow, use c engine to avoid conflicts
+ kwargs["engine"] = "c"
+ else:
+ # Default to c engine for reliability
+ kwargs["engine"] = "c"
+
+ kwargs["low_memory"] = False
+
try:
if "chunksize" in kwargs:
- return pd.concat(
- pd.read_csv(
- filepath_or_buffer=file.stream,
- **kwargs,
- )
+ chunks = []
+ chunk_iterator = pd.read_csv(
+ filepath_or_buffer=file.stream,
+ **kwargs,
)
+
+ for chunk in chunk_iterator:
+ chunks.append(chunk)
+ if "nrows" in kwargs and kwargs["nrows"] is not None:
+ if len(pd.concat(chunks)) >= kwargs["nrows"]:
+ break
Review Comment:
### Memory inefficient chunk size validation <sub></sub>
<details>
<summary>Tell me more</summary>
###### What is the issue?
Inefficient memory usage when checking chunk size limits. The code
concatenates all chunks on every iteration just to check the total length.
###### Why this matters
This can lead to significant memory overhead and performance degradation
when processing large files, defeating the purpose of chunked reading.
###### Suggested change ∙ *Feature Preview*
Track the total rows processed without concatenating:
```python
total_rows = 0
for chunk in chunk_iterator:
chunks.append(chunk)
total_rows += len(chunk)
if "nrows" in kwargs and kwargs["nrows"] is not None:
if total_rows >= kwargs["nrows"]:
break
```
###### Provide feedback to improve future suggestions
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8/upvote)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8?what_not_true=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8?what_out_of_scope=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8?what_not_in_standard=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/71e59fbe-1874-4d94-a87d-ac11e9a2d3c8)
</details>
<sub>
💬 Looking for more details? Reply to this comment to chat with Korbit.
</sub>
<!--- korbi internal id:f4b50773-88da-43c5-a70d-07593f522778 -->
[](f4b50773-88da-43c5-a70d-07593f522778)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]