codeant-ai-for-open-source[bot] commented on code in PR #37740:
URL: https://github.com/apache/superset/pull/37740#discussion_r2772765498
##########
superset/utils/pandas_postprocessing/histogram.py:
##########
@@ -49,7 +49,7 @@ def histogram(
groupby = []
# drop empty values from the target column
- df = df.dropna(subset=[column])
+ df = df.dropna(subset=[column]).copy()
Review Comment:
**Suggestion:** Dropping NaN rows from the entire DataFrame before grouping
removes groups that only contain null values in the target column, so those
groups will disappear from the histogram output instead of being represented
with zero counts, which breaks the documented "each row corresponds to a group"
contract. [logic error]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ superset/utils/pandas_postprocessing/histogram.histogram() omits groups.
- ⚠️ Grouped histogram outputs have missing group rows.
- ⚠️ Downstream consumers receive fewer rows than expected.
```
</details>
```suggestion
# NaN values in the target column are dropped later in hist_values(), so
keep all rows here
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Create a DataFrame file and call the function `histogram()` defined in
`superset/utils/pandas_postprocessing/histogram.py` (function signature at
lines 24-31).
2. Prepare input DataFrame where one group has only nulls, e.g.:
- DataFrame with columns `group` and `value` where group "B" rows exist
but `value` is
NaN for all "B" rows.
3. Call histogram(df, column="value", groupby=["group"]) -> execution
reaches the line
that pre-filters nulls at line 52: `df = df.dropna(subset=[column]).copy()`.
4. Rows for group "B" are removed by that dropna at line 52, so later
grouping (code paths
at lines 81-86: the df.groupby(...) branch) never sees group "B" and no
output row for "B"
is produced. The helper `hist_values` (lines 71-74) which drops NaNs inside
a group's
Series is never given a chance to generate a zero-count vector for an absent
group.
Explanation: The current code intentionally removes rows with NaN at line
52; that causes
groups whose only rows had NaN in the target column to disappear instead of
appearing with
zero-count bins.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/pandas_postprocessing/histogram.py
**Line:** 52:52
**Comment:**
*Logic Error: Dropping NaN rows from the entire DataFrame before
grouping removes groups that only contain null values in the target column, so
those groups will disappear from the histogram output instead of being
represented with zero counts, which breaks the documented "each row corresponds
to a group" contract.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]