Thank you all for your helpful replies. It seems pretty clear that the recommended Notmuch usage 
for someone who wants to incorporate a script that classifies a batch of messages, is either to 
write the whole script to use Notmuch from the beginning, and to have the messages specified as a 
list of "id:" IDs or even a general Notmuch query; or, if you are using an existing 
script that accepts a list of files, to try to extract the message ID from each file at the end so 
that the new tags can be communicated from the script back to Notmuch. Both options seem a little 
hacky - especially since it is rather common to receive multiple distinct messages with the same 
ID, for example when someone replies to a mailing list post and Cc's me, and I would want these to 
be separately viewable (they are linked together ith an "=" in the Mutt thread view) for 
security reasons. If Notmuch is meant to function as an abstraction layer over message files stored 
on the file system, then why doesn't it p
rovide a standard way to go from file paths to Notmuch messages?

As for why it would be a security issue to ignore new messages with duplicate 
message IDs, consider that one can apparently play the following game on a 
Notmuch user. (1) Send a private email that the user will never see because it 
contains spam keywords. (2) Send a public email to a mailing list with the same 
ID. The Notmuch user will not see the second email, and everyone will think he 
is unable to reply to the allegations it presumably contains, and that he is 
therefore guilty and should be arrested.

My recommendation would be to split the Notmuch project into three teams: one to work on 
the source code, another on the documentation, and a third on test cases. There should be 
separate Git repositories for each team, so that I can for example run current test cases 
against a fork of the source repo, or use recent manual pages with an older version of 
the source. This way, the documentation team will be able to document deficiencies in 
various source releases as well as standardizing proposed new features or syntax. Or, 
someone would submit a pull request to the source team, that would then be discussed on 
the mailing list or in the issue tracker, and someone on another team would then use that 
discussion to write documentation or test cases before the PR is accepted. The teams 
would have a "checks and balances" relationship, like with the three branches 
of government. (I think that all software projects should be run this way, so please 
don't be offended.)

I wrote some Perl scripts a long time ago, which work together to tag mail and 
put links to each message in a tag-specific directory for each of its tags. The 
script would add headers to the message, however, and it rewrote the Message-ID 
if it wasn't unique. It did not create a full-text index like Notmuch does. It 
did seem fairly reliable. I am trying to adapt it to send the tags to Notmuch. 
I am having to use Notmuch because of a third piece of software that depends on 
it. It is somewhat perplexing to me that no one else has had my use case before.

Best wishes,

Frederick

On Sat, Sep 21, 2024 at 11:38:18AM +0200, Michael J Gruber wrote:
Am Sa., 21. Sept. 2024 um 05:23 Uhr schrieb Frederick Eaton <frede...@ofb.net>:

Thank you for your response, Pengji.

On Sat, Sep 21, 2024 at 08:25:10AM +0800, Pengji Zhang wrote:
>Hi Frederick,
>
>Frederick Eaton <frede...@ofb.net> writes:
>
>>I am trying to figure out how to adapt a script I wrote for
>>filtering messages, to apply notmuch tags to each message. A
>>difficulty is that the messages are already in the Notmuch database,
>>because another tool has delivered them to a maildir and run
>>"notmuch new".
>>
>>Now, Notmuch can provide me with the paths of all the new
>>(unfiltered) messages, which I can give to my script. The question I
>>have is, once the filter is done, how can the script tell Notmuch
>>which message to apply the tags to?
>
>
>I am not sure if I understand you correctly. If the problem here is to
>distinguish existing messages and new messages, would the config
>option 'new.tags' work? For example, use
>
>   notmuch config set new.tags new
>
>to give all new messages a 'new' tag.

No, I already have that configuration. The first sentence described what I 
already know how to do, the second sentence is what I'm trying to do.

It seems that we're still guess-working-out what your script is
doing/trying to do. Do you mind sharing a trimmed down version?

It might be useful for the reasons I stated, namely in case the Message-ID does 
not exist or is not unique.

This is probably at the heart of the problem. Within notmuch, a
"message" is something identified by a message-id (mid), and all
information in the notmuch database is tied to a mid.

When you speak about a message, you probably mean the content of an
individual "message file" - which is a natural, but different notion.
A "path:" refers to a message file, a "mid:" to message id.

When "notmuch new" encounters a new message files, it
- checks if it contains a valid "Message-ID" header
- used that as mid or generates a mid using a sha1 checksum of the message file
- checks whether that mid (!) is in the database already
- adds the path to the existing db entry, or creates a new db entry

So, you may have several files (path entries) for the same mid, and
which one is used for indexing purposes depends on the order of
arrival (or, in the case of reindexing, probably on file system
ordering). notmuch assumes that this makes no difference - same mid
same "message". This assumption can break, for example for list
copies, different headers on sent versus received etc.

I"m elaborating on this because we have to guess about your script -
what is a "new message" for your script, and which kind of information
does it want to process?

Typical processing would be done in a notmuch post-hook, and it would:
- check for new messages (tag:new)
- get their file paths form `notmuch search --output=files mid:XYZ` or such
- do whatever it needs using the file if you really need to parse that yourself

I guess most of us have some sort of script running on new messages as
part of a hook, be it `afew` or something homegrown, and this
typically clears the new tag afterwards.

Michael


On Tue, Sep 24, 2024 at 11:09:26AM +0200, Michael J Gruber wrote:
Am Sa., 21. Sept. 2024 um 18:24 Uhr schrieb Panayotis Manganaris
<panos.mangana...@gmail.com>:
...
notmuch search --output=messages 'tag:new' > /tmp/msgs
notmuch search --output=files 'tag:new' |\
    bogofilter -o0.7,0.7 -bt |\
    paste - /tmp/msgs |\
    awk '$1 ~ /S/ { print "-new +spam", "-", $3 }' |\
    notmuch tag --batch

...
This script operates on the assumption that the order of results from notmuch 
queries are
always the same, which is fortunately true.

It also operates under the assumption that you receive no duplicate
messages with the same message-id (such as list copies,
sent/reveived), or else `paste` will have a hard time matching lines.

Note that you can loop over the msgs, treat them individually, and
still collect input for `notmuch tag --batch`, which solves both the
problem with duplicate messages and potential ordering instability
while keeping batch efficiency.

Your instinct to use batch tagging and id: queries is correct. I collect my new 
message ids in
/tmp/msgs. These ids are unique, they are definitely unique enough to be used 
to tag individual
messages on a daily basis.

I'm sorry, but either they're unique or not. What's unique enough? I'm
pestering on this because part of the OP's problem is being clear
about the notion of message, which is uniquely identified by a message
id in the notmuch db. I tried to clear that up in my previous answer
in this thread.


> It might be useful for the reasons I stated, namely in case the Message-ID 
does not exist or
> is not unique.

I think mail that is successfully transmitted through a mail host necessarily 
obtains a message
id, but I might be wrong. I believe notmuch indexes on both it's own unique 
thread ids and the
message ids. Thereby further decreasing the already minuscule chance of message 
id collisions.

No. Messages can arrive without mid. In that case, notmuch creates one
(without altering the message file) and uses it for indexing.
"Thread-id" is something completely different from message-ids. They
do not identify a message uniquely (but a thread of messages "joint"
by references), albeit indirectly (such as "root message of the
thread", assuming one root).

Cheers
Michael

_______________________________________________
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org

Reply via email to