Quoting Vincent Lefevre ([email protected]): > On 2015-05-22 21:01:01 -0500, David Wright wrote: > > However, in https://lists.debian.org/debian-user/2015/04/msg01265.html > > I was perhaps less ambiguous (point 2): > > > > "In which case, if you want to know how come mutt is so fast, take a > > look at the source. Just to mention one optimisation I would consider: > > slurp the directory and sort the entries by inode. Open the files in > > inode order. > > And another: it's probably faster to slurp bigger chunks of each file > > (with an intelligent guess of the best buffer size) and use a fast > > search for \nMessage-ID rather than reading and checking line by line. > > " > > This may be interesting with mmap. Otherwise, one may do unnecessary > copies. > > > > Then I don't think that in the particular case of header validation, > > > there is much gain applying regexp's on the full header at once; the > > > reason is that my regexp's use the end of line as a separator (things > > > like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file > > > line by line, I already do a part of the job of regexp matching. > > > > But I would assume that regexp in languages like Perl/Python has code > > far more optimised than reading files line by line. > > This is not clear. All my regexp's are anchored on a newline. > Reading files line by line allows one to do some factoring. > > > So you would search for \nmessage-id:.*?\n (where .*? is > > non-greedy). > > One can do better. The code I used in the second test was: > > $header =~ /^\S+:/ || $header =~ /^From / or die; > $header =~ /\n[^:\s]+\s/ and die; > $header =~ /^Message-ID:.*^Message-ID:/ims and die; > $header =~ /^Message-ID:\s+(<\S+>)( \(added by .*\))?$/im or die; > > where $header is the full header. > > > > And finally, for each test, the header has to be read several times. > > > > I'm not sure why, without knowing the tests to apply (or did I miss > > seeing them?). > > See above. > > > > In my case, I don't need to deal with folded headers, except validating > > > the format, which is very easy with a line-by-line parsing. > > > > You did mention validating message-id and other headers and checking > > for missing ones, but do your scripts throw all this work away and, > > if so, why? For example, if you add your own distinctive Message-ID > > header to any file that doesn't have one, then that's one test you > > never have to repeat. > > I don't understand.
Well, the discussion in these threads has ranged widely over trying to speed up the reading of directories and large numbers of files. Every so often, I think about what you're doing with that huge directory of emails, all 145k of them. AIUI, and correct me if I'm wrong, you have to be able to read them with a mail client (mutt). You have to check that (all) the header lines are correctly formed and that each email has a single unique message-id. Every so often (quite frequently) you run Perl scripts (like those posted) over them and modify the header lines (or flags) of some of them, then restart mutt so it picks up the modifications. Not being conversant with the maildir format, I took a look at http://wiki2.dovecot.org/MailboxFormat/Maildir to see how filenames are used, and how flags are implemented. I see one also might have to be careful about preserving timestamps. Anyway, the questions that pop into my head are things like: If an email doesn't have a message-id, why not give it one with a X-header that you recognise as your own? (You could process duplicates similarly.) Why not put your X-header as the first line in the file? (In most cases, it would be a copy of the original message-id.) Then you only have to read one line to get at your X-header/message-id on every subsequent occasion that you process the files. If a header line is malformed, why not fix it up straight away as best you can (rather than die), perhaps flagging the fact. Why not do all these things just the once? Process all the existing messages in however long it takes. Do it when you're not running mutt, not renaming files etc, so that the directory is static. Then keep track of a mtime "tidemark" so that you can recognise new messages, which need their X-header to be added and to be checked over. Now when you do all your message filtering/flagging, you don't have to faff around with variable numbers of header lines yet again. BTW I couldn't help being amused by this paragraph in the dovecot wiki: "Issues with the specification Locking Although maildir was designed to be lockless, Dovecot locks the maildir while doing modifications to it or while looking for new messages in it. This is required because otherwise Dovecot might temporarily see mails incorrectly deleted, which would cause trouble. Basically the problem is that if one process modifies the maildir (eg. a rename() to change a message's flag), another process in the middle of listing files at the same time could skip a file. The skipping happens because readdir() system call doesn't guarantee that all the files are returned if the directory is modified between the calls to it. This problem exists with all the commonly used filesystems. " Cheers, David. -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected] Archive: https://lists.debian.org/20150526011450.GA14799@alum

