Hi all, I know this thread is 10+ years old, but I wanted to follow up since the regexp performance discussion is still highly relevant today.
TL;DR: The situation has improved slightly over the years, but the fundamental performance characteristics haven't changed dramatically. So I built coregex - an alternative regex engine for Go that addresses the performance issues discussed here. What's Changed in Go stdlib (2013-2025) ======================================== The good: - Bug fixes and stability improvements - Better Unicode handling - Minor optimizations here and there The unchanged: - Still uses Thompson's NFA exclusively - No SIMD optimizations - No prefilter strategies - Same single-engine architecture Go's regexp prioritizes correctness and simplicity over raw performance. That's a valid design choice - it guarantees O(n) time complexity and prevents ReDoS attacks. But for regex-heavy workloads, the performance gap vs other languages remains significant. The Performance Gap Today (2025) ================================= Benchmarking against Rust's regex crate on patterns like .*error.*connection.*: - Go stdlib: 12.6ms (250KB input) - Rust regex: ~20µs (same input) - Gap: ~600x slower This isn't a criticism of Go - it's a different set of trade-offs. But it shows the problem hasn't gone away. What I Built: coregex ===================== After hitting regex bottlenecks in production, I spent 6 months building coregex - a drop-in replacement for Go's regexp. GitHub: https://github.com/coregx/coregex Architecture: - Multi-engine strategy selection (DFA/NFA/specialized engines) - SIMD-accelerated prefilters (AVX2 assembly) - Bidirectional search for patterns like .*keyword.* - Zero allocations in hot paths Performance (vs stdlib): - 3-3000x faster depending on pattern - Maintains O(n) guarantees (no backtracking) - Drop-in API compatibility Real benchmarks: Pattern Input stdlib coregex Speedup ------------------------------------------------------- .*\.txt$ 1MB 27ms 21µs 1,314x .*error.* 250KB 12.6ms 4µs 3,154x (?i)error 32KB 1.23ms 4.7µs 263x \w+@\w+\.\w+ 1KB 688ns 196ns 3.5x Status: v0.8.0 released, MIT licensed, 88% test coverage Could This Go Into stdlib? =========================== That's the interesting question. I've been thinking about this from several angles: Challenges: 1. Complexity - Multi-engine architecture is significantly more complex than current implementation 2. Maintenance burden - SIMD assembly needs platform-specific variants (AVX2, NEON, etc.) 3. Binary size - Multiple engines increase compiled binary size 4. API stability - stdlib changes need extreme care Opportunities: 1. Incremental adoption - Could start with just SIMD primitives (internal/bytealg improvements) 2. Opt-in optimizations - Keep current implementation as default, offer regexp/fast package 3. Strategy selection - Add smart path selection without breaking existing code 4. Knowledge transfer - Techniques from coregex could inform stdlib improvements What I'm Proposing ================== Rather than a direct "merge coregex into stdlib" proposal, I'm suggesting: 1. Short term: Community uses coregex for performance-critical workloads 2. Medium term: Discuss which techniques could benefit stdlib (SIMD byte search, prefilters) 3. Long term: Potential collaboration on stdlib improvements (if there's interest) I'd be happy to: - Help with stdlib patches for incremental improvements - Share implementation learnings and benchmarks - Discuss compatibility considerations For Those Interested ==================== Try it: go get github.com/coregx/[email protected] Read more: - Dev.to article: https://dev.to/kolkov/gos-regexp-is-slow-so-i-built-my-own-3000x-faster-3i6h - GitHub repo: https://github.com/coregx/coregex - v0.8.0 release: https://github.com/coregx/coregex/releases/tag/v0.8.0 Feedback welcome on: - API compatibility issues - Performance on your specific patterns - Ideas for stdlib integration The Bottom Line =============== The regexp performance discussion from 10+ years ago was valid then and remains valid now. The good news: we have options today. The better news: maybe some of these ideas will make their way into stdlib eventually. In the meantime, coregex is production-ready and MIT-licensed. Use it if it helps. Cheers, Andrey Kolkov GitHub: https://github.com/kolkov CoreGX (Production Go Libraries): https://github.com/coregx On Thursday, 28 April 2011 at 18:13:21 UTC+4 Russ Cox wrote: > > In some areas Go kann keep up with Java but when it comes to string > > operations ("regex-dna" benchmark), Go is even much slower than Ruby > > or Python. Is the status quo going to improve anytime soon? And why is > > Go so terribly slow when it comes to string/RegEx operations? > > You assume the benchmark is worth something. > > First of all, Ruby and Python are using C implementations > of the regexp search, so Go is being beat by C, not by Ruby. > > Second, Go is using a different algorithm for regexp matching > than the C implementations in those other languages. > The algorithm Go uses guarantees to complete in time that is > linear in the length of the input. The algorithm that Ruby/Python/etc > are using can take time exponential in the length of the input, > although on trivial cases it typically runs quite fast. > In order to guarantee the linear time bound, Go's algorithm's > best case speed a little slower than the optimistic Ruby/Python/etc > algorithm. On the other hand, there are inputs for which Go will > return quickly and Ruby/Python/etc need more time than is left > before the heat death of the universe. It's a decent tradeoff. > > http://swtch.com/~rsc/regexp/regexp1.html > > Russ > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/ba9bb686-3db1-4d5c-b92a-d5cdd9f6814cn%40googlegroups.com.
