bug#24858: URGENT: Question about grep

Paul Jackson Wed, 02 Nov 2016 10:26:22 -0700

Greta asked:
>> So what I have to add in grep command to put the limit of 30 characters?


Eric replied:
>> You can't do it with grep. 

Bruce suggested:
>> cut -c 30 filename | grep ACGTAC

Using the following grep command seems to work for me, and is about
40% faster, in terms of user CPU time spent, on my system, using a large
dataset I have (some web server logs)  than using cut and grep in a pipeline,
as the extra CPU cost of the more complex grep expression is more than
compensated for by the reduced copying of the datastream:

grep -E '^.{0,30}GTGTCA

===

A custom C program could make this dramatically faster, especially if:

it avoided using stdio or any other form of line buffering that copied
each line of data within the application,

it used raw read(2) calls,

it used strchr(3) calls to scan to the end of the current line (hence the start
of the next line), and

it used a mix of strchr and unaligned word compares, say of the 4 bytes
"ACGT", then the 2 bytes "AC",  which can be done on CPU's supporting
unaligned word compares.

Finding a programmer who can code that might be difficult, and
such optimization would only make sense if you're burning lots of
CPU time or project time, on this particular scan.

-- 
                Paul Jackson
                p...@usa.net

bug#24858: URGENT: Question about grep

Reply via email to