Hi all,
I know this thread is 10+ years old, but I wanted to follow up since
the regexp performance discussion is still highly relevant today.
TL;DR: The situation has improved slightly over the years, but the
fundamental performance characteristics haven't changed dramatically.
So I built coregex - an alternative regex engine for Go that addresses
the performance issues discussed here.
What's Changed in Go stdlib (2013-2025)
========================================
The good:
- Bug fixes and stability improvements
- Better Unicode handling
- Minor optimizations here and there
The unchanged:
- Still uses Thompson's NFA exclusively
- No SIMD optimizations
- No prefilter strategies
- Same single-engine architecture
Go's regexp prioritizes correctness and simplicity over raw performance.
That's a valid design choice - it guarantees O(n) time complexity and
prevents ReDoS attacks. But for regex-heavy workloads, the performance
gap vs other languages remains significant.
The Performance Gap Today (2025)
=================================
Benchmarking against Rust's regex crate on patterns like .*error.*connection.*:
- Go stdlib: 12.6ms (250KB input)
- Rust regex: ~20µs (same input)
- Gap: ~600x slower
This isn't a criticism of Go - it's a different set of trade-offs.
But it shows the problem hasn't gone away.
What I Built: coregex
=====================
After hitting regex bottlenecks in production, I spent 6 months building
coregex - a drop-in replacement for Go's regexp.
GitHub: https://github.com/coregx/coregex
Architecture:
- Multi-engine strategy selection (DFA/NFA/specialized engines)
- SIMD-accelerated prefilters (AVX2 assembly)
- Bidirectional search for patterns like .*keyword.*
- Zero allocations in hot paths
Performance (vs stdlib):
- 3-3000x faster depending on pattern
- Maintains O(n) guarantees (no backtracking)
- Drop-in API compatibility
Real benchmarks:
Pattern Input stdlib coregex Speedup
-------------------------------------------------------
.*\.txt$ 1MB 27ms 21µs 1,314x
.*error.* 250KB 12.6ms 4µs 3,154x
(?i)error 32KB 1.23ms 4.7µs 263x
\w+@\w+\.\w+ 1KB 688ns 196ns 3.5x
Status: v0.8.0 released, MIT licensed, 88% test coverage
Could This Go Into stdlib?
===========================
That's the interesting question. I've been thinking about this from
several angles:
Challenges:
1. Complexity - Multi-engine architecture is significantly more
complex than current implementation
2. Maintenance burden - SIMD assembly needs platform-specific
variants (AVX2, NEON, etc.)
3. Binary size - Multiple engines increase compiled binary size
4. API stability - stdlib changes need extreme care
Opportunities:
1. Incremental adoption - Could start with just SIMD primitives
(internal/bytealg improvements)
2. Opt-in optimizations - Keep current implementation as default,
offer regexp/fast package
3. Strategy selection - Add smart path selection without breaking
existing code
4. Knowledge transfer - Techniques from coregex could inform stdlib
improvements
What I'm Proposing
==================
Rather than a direct "merge coregex into stdlib" proposal, I'm suggesting:
1. Short term: Community uses coregex for performance-critical workloads
2. Medium term: Discuss which techniques could benefit stdlib
(SIMD byte search, prefilters)
3. Long term: Potential collaboration on stdlib improvements
(if there's interest)
I'd be happy to:
- Help with stdlib patches for incremental improvements
- Share implementation learnings and benchmarks
- Discuss compatibility considerations
For Those Interested
====================
Try it:
go get github.com/coregx/
[email protected]Read more:
- Dev.to article:
https://dev.to/kolkov/gos-regexp-is-slow-so-i-built-my-own-3000x-faster-3i6h
- GitHub repo:
https://github.com/coregx/coregex
- v0.8.0 release:
https://github.com/coregx/coregex/releases/tag/v0.8.0
Feedback welcome on:
- API compatibility issues
- Performance on your specific patterns
- Ideas for stdlib integration
The Bottom Line
===============
The regexp performance discussion from 10+ years ago was valid then and
remains valid now. The good news: we have options today. The better news:
maybe some of these ideas will make their way into stdlib eventually.
In the meantime, coregex is production-ready and MIT-licensed. Use it if
it helps.
Cheers,
Andrey Kolkov
GitHub: https://github.com/kolkov
CoreGX (Production Go Libraries): https://github.com/coregx
On Thursday, 28 April 2011 at 18:13:21 UTC+4 Russ Cox wrote:
> In some areas Go kann keep up with Java but when it comes to string
> operations ("regex-dna" benchmark), Go is even much slower than Ruby
> or Python. Is the status quo going to improve anytime soon? And why is
> Go so terribly slow when it comes to string/RegEx operations?You assume the benchmark is worth something.
First of all, Ruby and Python are using C implementations
of the regexp search, so Go is being beat by C, not by Ruby.
Second, Go is using a different algorithm for regexp matching
than the C implementations in those other languages.
The algorithm Go uses guarantees to complete in time that is
linear in the length of the input. The algorithm that Ruby/Python/etc
are using can take time exponential in the length of the input,
although on trivial cases it typically runs quite fast.
In order to guarantee the linear time bound, Go's algorithm's
best case speed a little slower than the optimistic Ruby/Python/etc
algorithm. On the other hand, there are inputs for which Go will
return quickly and Ruby/Python/etc need more time than is left
before the heat death of the universe. It's a decent tradeoff.
http://swtch.com/~rsc/regexp/regexp1.html
Russ
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
[email protected].
To view this discussion visit
https://groups.google.com/d/msgid/golang-nuts/ba9bb686-3db1-4d5c-b92a-d5cdd9f6814cn%40googlegroups.com.