Optimise message deduplication
Multiple options (could be combined):
- Store the last seen Message ID, UID, and UIDVALIDITY to detect if there have been new mails (as outlined by RFC3501). If the UIDVALIDITY has not changed, any mails with a lower UID than was last seen should be disregarded as already processed. That would ditch the requirement for keeping a log of all UIDs we have ever processed (in combination with keeping the Message IDs)
- Store message IDs along with the recipients for which that mail has been delivered. Further more anonymisation / pseudonymisation could be achieved by hashing the Message ID and recipients address separately.
- Come up with a better performing storage scheme than we have right now. Both the UID and Message-ID File will grow indefinitely increasing lookup-times. A storage schema like so could (maybe, I am not an expert in this) be better performing:
-<Message-ID Domain or Fallback (possibly hashed)>/
--<first char of Non-Domain part of the Message ID (possibly hashed)>/
---<Non-Domain part of the Message ID (possibly hashed)>
The <Non-Domain part of the Message ID (possibly hashed)>
file would either contain, separated by new lines, all recipients (possibly hashed) for which this message has been processed. Or, if a balance between file-count and entries per file should be achieved, the (hashed) Message-ID, and separated by e.g. a comma, the (hashed) recipient. Multiple lines for the same Message-ID are allowed to exist in one file, but each combination of Message-ID and recipient may not exist more than once.
A FS structure could look like this:
-<Message-ID Domain or Fallback (possibly hashed)>/
--<first two characters of Non-Domain part of the Message ID (possibly hashed)>
The message-id Domain could of course also be tiered.