GitHub user yjhjstz created a discussion: Using locale C / COLLATE "C" to 
unlock ORCA performance for TPC-DS benchmarks

## Motivation

When running TPC-DS benchmarks on Cloudberry, we noticed that queries on string 
columns (item descriptions, store names, customer names, etc.) often cause 
**ORCA to fall back to the Postgres planner**. The root cause is that ORCA 
currently does not support columns with `COLLATE "C"`, as tracked in [issue 
#717](https://github.com/apache/cloudberry/issues/717).

Beyond fixing the fallback, there is a broader performance opportunity: locale 
C is significantly faster than locale-aware collations for string comparisons.

## Performance Evidence

A detailed benchmark published by depesz — [How much speed you are leaving at 
the table if you use default 
locale](https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/)
 — shows:

| Operation | Speedup with locale C vs default locale |
|-----------|------------------------------------------|
| Equality checks on unindexed data | ~50% faster |
| Range queries | up to **107% faster** |
| Sequential scan comparisons | >20% faster |

Key takeaways:
- `libc/C` collation is the fastest across nearly every benchmark
- Even the newer ICU and builtin providers lag significantly behind C
- The overhead of locale-aware collation accumulates heavily in analytical 
workloads like TPC-DS

## Current Behavior in Cloudberry / ORCA

```sql
-- Table with default collation: ORCA handles it
EXPLAIN SELECT * FROM tbl ORDER BY v;
-- Optimizer: Pivotal Optimizer (GPORCA)

-- Table with COLLATE "C": ORCA falls back
EXPLAIN SELECT * FROM tbl_collate_c ORDER BY v;
-- Optimizer: Postgres query optimizer (fallback!)
```

This means for TPC-DS string columns defined with COLLATE "C", we lose ORCA's 
superior join ordering, parallel aggregation plans, and better sort/merge 
strategies.

## Proposal

We have opened [issue #1603](https://github.com/apache/cloudberry/issues/1603) 
to track the work. The proposed steps are:

1. **Fix ORCA to support COLLATE "C"** (prerequisite: #717) — teach ORCA to 
recognize and use C collation in sort keys, merge keys, and equality operators
2. **Ensure ORCA's string comparison operators are collation-aware** — avoid 
incorrect plans when mixing collations
3. **Recommend C locale for TPC-DS test tables** — document or script the 
best-practice setup so benchmarks reflect ORCA's full optimization capability

## Discussion Questions

- Should Cloudberry's default cluster initialization (`initdb`) recommend or 
default to `LC_COLLATE=C` for analytical workloads?
- Are there correctness concerns with C collation in a distributed MPP setting 
(e.g., segment-level sort merge across different OS locales)?
- What is the right approach for ORCA to handle multiple collations — treat 
them as opaque and fall back, or model them explicitly in the optimizer?
- Has anyone already tested TPC-DS with `LC_COLLATE=C` on Cloudberry? What were 
your findings?

Would love to hear thoughts from the community, especially from contributors 
familiar with ORCA internals and anyone who has run TPC-DS benchmarks on 
Cloudberry.

## References

- [depesz benchmark: locale C 
performance](https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/)
- [Issue #717: ORCA fallbacks for collate 
"C"](https://github.com/apache/cloudberry/issues/717)
- [Issue #1603: Feature request to support locale C in ORCA for 
TPC-DS](https://github.com/apache/cloudberry/issues/1603)


GitHub link: https://github.com/apache/cloudberry/discussions/1604

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to