On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas <hlinnakan...@vmware.com > wrote:
> On 30.11.2012 13:20, Alexander Korotkov wrote: > >> On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas@** >> vmware.com <hlinnakan...@vmware.com> >> >>> wrote: >>> >> >> Would it be safe to simply stop short the depth-first search on overflow, >>> and proceed with the graph that was constructed up to that point? >>> >> >> For depth-first it's not. But your proposal naturally makes sense. I've >> changed it to breadth-first search. And then it's safe to mark all >> unprocessed states as final when overflow. It means that we assume every >> unprocessed branch to immediately finish with matching (this can give us >> more false positives but no false negatives). >> For overflow of matrix collection, it's safe to do just OR between all the >> trigrams. >> New version of patch is attached. >> > > Thanks, sounds good. > > I've spent quite a long time trying to understand the transformation the > getState/addKeys/addAcrs functions do to the original CNFA graph. I think > that still needs more comments to explain the steps involved in it. > > One thing that occurs to me is that it might be better to delay expanding > colors to characters. You could build and transform the graph, and even > create the paths, all while operating on colors. You would end up with > lists of "color trigrams", consisting of sequences of three colors that > must appear in the source string. Only at the last step you would expand > the color trigrams to real character trigrams. I think that would save a > lot of processing while building the graph, if you have colors that contain > many characters. It would allow us to do better than the fixed small > MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current > patch, it's a shame that we can't do anything with a fairly simple regexp > like "^a[b-g]h$" > Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement... ------ With best regards, Alexander Korotkov.