> I'm writing a new mode for Emacs that involves a massive > regular expression, auto-generated from a list of files in > the directory. If the number of files is too large (c. 1500, > depending on the length of the filenames), then the regular > expression that gets built is too big, and Emacs flashes up > an error: Invalid regexp: "regular expression too big".
> So it looks as though this is a known issue, and that the > solution was just to hardcode a ceiling on regexp size. This > is a showstopper for us. At the moment, the only workaround > that we can think of would be to chop the regexp into > multiple pieces, run them separately, and then somehow > combine the results. As you can imagine, this is going to be > much slower, and much much uglier. > Is there anything that can be done to extend the allowed > size of the regexp? Well, you can rewrite regexp.c if you want. Currently it works by compiling your regexp to a non-deterministic (i.e. backtracking) byte-code machine, which uses 2-byte offsets to jump around, so it makes it difficult to write regexps much larger than about 32KB (after compilation). There could be various ways to change regexp.c so as to allow larger regexps. One would be to make the "too large" check more precise (right now, I believe it just complains as soon as the whole compiled regexp exceeds 32KB, but we could allow larger ones, as long as all offsets fit within the ±32KB limit), or one could add "long jumps" with 4byte offsets and try to insert them were needed, or one could make all offsets 4bytes, or one could rewrite regexp.c completely (ideally just adapting GNU libc's regexp engine or some other). But maybe you can circumvent the limit without removing it. Tell us more about your regexps: maybe we can optimize them. Stefan