bug#22357: grep -f not only huge memory usage, but also huge time cost

Paul Jackson Thu, 15 Dec 2016 21:17:22 -0800

An interesting discussion ... thanks :).

I tried "grep -f" on a couple of large text files I had.


I have a file of "words", one word per line, all lower case ascii,
that I had accumulated from various spell checkers and such.
This "words" file has 395,696 unique (more or less) English words in it.

I also have a large "hosts" file that I use as part of my spammy
site filtering, with 345,552 lines in it, each line with the IP addr
127.0.0.1 and a host URL.

I tried "grep -f words hosts", using grep (GNU grep) 2.26.

I monitored it from another screen.

Most lines in the hosts file matched one or more of the shorter
words in the words file.  But the resident memory foot print
kept climbing, non-stop, at some rate over 1 GByte of RAM
per second.

Within less than a minutes time, after line 157 of the 345,552
lines in the hosts file was output as a match, all my 32 GBytes
of RAM were consumed, and I spent the next minute regaining
control of my system.  So far as I know, no useful searching
occurred once I ran out of RAM.

I would suggest six  things:

1) Something seems broken if the "grep -f" algorithm requires
continuously increasing the resident RAM size, even after it has
accomplished whatever initialization it needs and is busy scanning
the search space for matches.  At the rate it was growing, I would
have needed (345,552/157) * 32 GBytes which is about 70 petabytes
of RAM, which is larger than any system that I know of, and it would
have taken (wild estimate) 20 hours to complete.  Ouch!

2) In those cases where grep -f seemed to take "forever" ... or
however long until the user gave up, I would guess that's because
the computer memory was thrashing to disk.    As soon as a problem
no longer fits in RAM, it slows down by multiple orders of
magnitude.   Take for example my above test, in which I was
getting a few matching lines of output per second ... until I
ran out of RAM and couldn't see anymore because my window
would not even refresh or show further output or show my
mouse cursor.

3) Yes, the default behaviour should not require the user to choose
the algorithm.  However if that behaviour cannot be or is not yet
provided, then the default should be to do the best it can, but there
should also be options to select the algorithm, if multiple algorithms
are coded, for use in those cases where the user "requires the very best"
(or at least cannot tolerate what might be dreadfully bad) for their
particular problem.

4) When I have many things to search for, as with the large "words"
file above, I look for ways to simplify the space being searched in,
so that, for example, I only have to search for matching words at
the start of a line or specific field separator character.    Then I can
load my "words" into some sort of tree or associative array (such
as a Python dictionary, Patricia Trie, Judy Tree, or lmdb database)
and look each line or field up in that data structure.

5) I've never had to actually solve a problem with both a long
list of "words" to search for, and where the possible starting
characters for a matching word could be almost any character
in the space to search, but if I did have to solve such a problem,
I'd guess that I'd try my approach (4) above, and just accept that
I was going to have to loop on every character in the search
space, starting a lookup into my tree or associative array with the
characters beginning at that point.  Since the "words" file in my
above example was about 4 MBytes, a tree or associative array
with those words loaded should fit in 5 to 10 MBytes of RAM,
give or take, and the solution would proceed reliably without
consuming much more RAM than that ... however just with  an
inner loop that was executed 12174500 times, for each character
as a possible starting point, in my "hosts" file.   Fortunately,
only 6414428 characters in that "hosts" file were lower case
letters, so the other 5760072 characters (not lower case letters)
would loop quickly, not even matching on the first lookup.

6) The problem considered in (4) above is actually one that
I run into regularly in some other work that I am doing.
I have a command that I use from the shell prompt (along
with artful use of many other classic Unix shell commands)
that is implemented using a Python dictionary, to solve
such problems much more quickly than "grep -f" can
handle.   If you're an ancient Unix shell hacker such as
myself, you might find it useful:

    http://pauljackson.us/dict.py

Using this dict command, I can find the 10392 entries in my
"hosts" file that have the second, from the left, dot-separated
component of its URL matching one of the 395,696 words in
my "words" file, in about 1.4 seconds of CPU time for the dict
command, and about 4 Mbytes of RAM its dictionary, using the
following shell commands:

  < words sed 's/.*/& & MATCH/' > /tmp/words.1
  <  hosts awk -F. '/^127.0.0.1/ {print $5, $0}' | dict /tmp/words.1 | grep 
MATCH | wc

The awk -F. command copies the second, from the left, dot-separated
component of the URL on each line to front of the line.  Each active line
in my hosts file has an IP address of 127.0.0.1.

I habitually use dict to replace any first field that matches with that
field, and/or whatever other replacement value I want, plus the word
"MATCH", so that I can then easily grep out just the matching lines.

Yes - a lot of detail in this item (6) - but when you're routinely solving
problems that can be cast in this form, the "1.6 seconds" for even a
rather large example is compelling.  In my most common use case,
I am dealing with over 10 million entries in one or the other of the
words I am looking for, or the search space I am looking in.

No variant of "grep -f" that I have ever seen can handle such problems.

-- 
                Paul Jackson
                p...@usa.net

bug#22357: grep -f not only huge memory usage, but also huge time cost

Reply via email to