How to mark email as "Ham" (to train Fastmail spam filter)?

hydrostarr · 25 Jan 2022, 12:12 PM

Summary

In Fastmail:

1. How to mark email as "Ham"? (Seems less trustworthy without this.)
2. Can I apply large collections of (past) known-Ham and known-Spam emails to be "marked as Ham" and "marked as Spam" to best train Fastmail's spam filters?

Details

In Fastmail I can train its spam engine by moving email to the toplevel Spam IMAP folder. How can I train Fastmail to recognize email as "NOT spam" aka "Ham"?

Tuffmail.net had this feature, simply by copying any email I wanted to an Auto-Train/Ham IMAP folder (similar but the opposite purpose of Tuffmail's Auto-Train/Spam folder).

Does Fastmail provide some similar functionality?

Without it, I have much less trust of Fastmail's spam filter to do the right thing. ie, ensure "Ham" is "not Spam filtered," which to me is just as much or maybe more inportant than catching-and-filtering Spam.

Further: I still retain all my emails from Tuffmail (for ~15 yrs) that I manually marked as both Spam and Ham. Thousands of emails. I'd love to apply those to my Fastmail account engine. Is that feasible? What are my options (for procedural execution)? Are some procedures potentially more-efficient than others?

Terry · 25 Jan 2022, 01:02 PM

Yes you can import your tuffmail emails but would be limited to so many per day....also surly that would put you over your storage limit.

15 years of emails that for me would be a nightmare.

hydrostarr · 25 Jan 2022, 01:44 PM

Quote:

Originally Posted by Terry

Yes you can import your tuffmail emails but would be limited to so many per day....also surly that would put you over your storage limit.

15 years of emails that for me would be a nightmare.

I have already imported all of my past emails. It’s not an issue and it was easy to do.

BritTim · 25 Jan 2022, 06:46 PM

The way you quickly train the spam filter (and get your own personal Bayes database) with Fastmail is to set up two folders, specifically designed to facilitate identification of spam and ham. Once setup correctly, those folders are scanned daily for new messages. Once 200 spam and 200 ham messages have been used to train your Bayes database, that is used in place of the global Bayes. For details, read https://www.fastmail.help/hc/en-us/a...pam-protection

Note that it is most effective if you can use messages that you know were mischaracterised in the past.

hydrostarr · 26 Jan 2022, 03:15 PM

This is just the sort of thing I was seeking. I've configured my Fastmail account to take advantage of this. Thank you BritTim for the reference.

Question: does Fastmail offer any reporting output (like a "report log" emailed or stored somewhere) after it does its "daily batch process" for all the notSpam/Spam folder checks?

Quote:

Originally Posted by BritTim

The way you quickly train the spam filter (and get your own personal Bayes database) with Fastmail is to set up two folders, specifically designed to facilitate identification of spam and ham. Once setup correctly, those folders are scanned daily for new messages. Once 200 spam and 200 ham messages have been used to train your Bayes database, that is used in place of the global Bayes. For details, read https://www.fastmail.help/hc/en-us/a...pam-protection

Note that it is most effective if you can use messages that you know were mischaracterised in the past.

hydrostarr · 26 Jan 2022, 03:52 PM

Quote:

Originally Posted by hydrostarr

Question: does Fastmail offer any reporting output (like a "report log" emailed or stored somewhere) after it does its "daily batch process" for all the notSpam/Spam folder checks?

Answering my own question... this Settings web view:

Settings->Filters&Rules->Spam_protection->Advanced_settings

...has a "Personal spam filter" section (currently) at the bottom of the page, with the following counters:

Spam learned
Non-spam learned

Helpful. Does not offer any granular-reporting-detail output, but it's much (much) better than nothing. Helps me to see if Fastmail is actually doing any personal-spam/ham processing.

(I do not yet see this mentioned in Fastmail.help's docs, would be good if it could mentioned there in the future.)

BritTim · 26 Jan 2022, 03:54 PM

Quote:

Originally Posted by hydrostarr

This is just the sort of thing I was seeking. I've configured my Fastmail account to take advantage of this. Thank you BritTim for the reference.

Question: does Fastmail offer any reporting output (like a "report log" emailed or stored somewhere) after it does its "daily batch process" for all the notSpam/Spam folder checks?

No detailed feedback is available. All you have is the summary of learned spam and learned ham, together with whether your spam filtering uses the global Bayes or your personal Bayes database.

If you really want to investigate how any specific email was handled on receipt, in terms of spam filtering, several of the message headers (available with Actions->Show raw message) can help. As an example, these might include:

Quote:

X-Spam-known-sender: yes ("Address xxx@example.com in From header is in addressbook");
in-addressbook
X-Spam-sender-reputation: 500 (none)
X-Spam-score: 0.0
X-Spam-hits: ME_SENDERREP_NEUTRAL 0.001, RCVD_IN_DNSWL_NONE -0.0001,
RCVD_IN_MSPIKE_H3 0.001, RCVD_IN_MSPIKE_WL 0.001, RCVD_IN_PBL 0.001,
SPF_HELO_NONE 0.001, SPF_PASS -0.001, LANGUAGES unknown, BAYES_USED none,
SA_VERSION 3.4.2
X-Spam-source: IP='209.85.210.174', Host='mail-pf1-f174.google.com', Country='US',
FromHeader='com', MailFrom='com'
X-Spam-charsets:

hydrostarr · 26 Jan 2022, 03:58 PM

Quote:

Originally Posted by BritTim

No detailed feedback is available. All you have is the summary of learned spam and learned ham, together with whether your spam filtering uses the global Bayes or your personal Bayes database.

Right, copy that, thanks. (I provided some details above. Our msg posts might have criss-crossed, fyi.)

Quote:

Originally Posted by BritTim

If you really want to investigate how any specific email was handled on receipt, in terms of spam filtering, several of the message headers (available with Actions->Show raw message) can help. As an example, these might include:

Yes, helpful, I have history/experience analyzing these headers. I also have a X-Spam-score: column displayed in Mozilla Thunderbird (via a "Spam Scores" add-on), which is helpful.

Thanks BritTim for all your help and feedback, super appreciated.

hydrostarr · 30 Jan 2022, 06:50 AM

Update: the Spam/non-Spam processing going WAY too slow.

After 4 days the spam counters show 6k emails have been processed. That's a ~1.5k/per_day rate. And the counters suggest the daily rate may be _slowing down_. I currently have 120k marked-as-spam-and-not-spam emails in queue to process... and this will most-likely grow every day (possibly dramatically) as I mass-add emails to my "Ham / non-Spam" folder. This will take months at the current rate. I have created a ticket with Fastmail on this.

I doubt they'll have good answers, once they finally send me a meaningful reply. (I'm not yet having great Fastmail-tech-support experiences, fwiw.)

My current idea: run a bayesian-database-generating gizmo on one of my own machines/servers and given them the data and they insert said spam-bayes data into whatever mechanism they have. And I'll work to generate compatible data for the "import." However... I'll be surprised if I'll be able to get them to do this.

Fwiw, I have a Fastmail "Professional" membership (the biggest/baddest/most-expensive one they have). I keep asking them if there's ways I can give them more money for more service, features, performance on every tech-support topic I ask about. They have yet to take me up on the offer. Eventually I may come back to EmailDiscussions.com to see who might give me direct access to a smart/capable/authoritatively-enabled Fastmail development/tech-support/operations manager. Until then, I'll work the process a little more to see where I can get.

hydrostarr · 30 Jan 2022, 07:13 AM

fwiw: my team already has an automated-deployment rig setup for email servers (Ubuntu + Postfix + Dovecot) we routinely run for rebuilds of email servers for our teams internally (with NO external-SMTP/MX gateway in or out... which means it's an email server that serves internal-to-VPN-only connections... so we do not have to deal with security-attacks, spam, and the like).

Given this, we figure it's not terribly hard to add Apache SpamAssassin to a test server, run my above emails through it (for a) spam and b) not-spam "programming"), export the resulting database info somehow into a Fastmail-compatible "data package"... and try to get Fastmail to "import" this into their stuff (the last part I anticiapte being the most-difficult osbtacle).

(We run lots of server apps that are new to us as development+test systems. It's part of our regular work projects. So we're not deterred, daunted, or "scared" by prospect of this. Especially for a one-off effort that we have no intention of running in production, and only for this one particular task.)

QUESTION:

Other than getting Fastmail to cooperate: does anyone see a problem with this line of thinking? And/or can anyone offer a better way/path to solve this overall issue (whether or not we generate our own bayesian spam/nonspam data)?

Quote:

Originally Posted by hydrostarr

The Spam/non-Spam processing going WAY too slow.
[...]
My current idea: run a bayesian-database-generating gizmo on one of my own machines/servers and given them the data and they insert said spam-bayes data into whatever mechanism they have. And I'll work to generate compatible data for the "import." However... I'll be surprised if I'll be able to get them to do this.

BritTim · 30 Jan 2022, 11:13 AM

Quote:

Update: the Spam/non-Spam processing going WAY too slow.

After 4 days the spam counters show 6k emails have been processed. That's a ~1.5k/per_day rate. And the counters suggest the daily rate may be _slowing down_. I currently have 120k marked-as-spam-and-not-spam emails in queue to process... and this will most-likely grow every day (possibly dramatically) as I mass-add emails to my "Ham / non-Spam" folder. This will take months at the current rate. I have created a ticket with Fastmail on this.

The global Bayes database already provides a good baseline. If well selected (i.e. emails that were originally mischaracterised) processing 1k spam and 1k ham to fine tune your personal Bayes will usually produce great results.

I hope Bill comes by and adds his own thoughts. He has tuned his own account so he can safely discard virtually all spam (no false positives) while allowing almost no spam to reach the Inbox.

I agree that it is important to spend some time to get this right, but careful selection is more important than throwing massive amounts of data at the problem.

hydrostarr · 30 Jan 2022, 12:05 PM

Thanks BritTim for your continued excellent feedback!

First a disclaimer about my long note below: I'm getting all wordy and lengthy here for one main reason: I asked Fastmail tech support to read and stay updated on this EMD thread. I have confidence that BritTim and others at EMD already get where I'm coming from without having to detail everything. I do not yet have that confidence with Fastmail tech support or their systems.

Second, fyi: I switched my team's email domains from Tuffmail.net to Fastmail.com in Dec 2021.

Comments on BritTim's excellent points:

Quote:

Originally Posted by BritTim

I agree that it is important to spend some time to get this right, but careful selection is more important than throwing massive amounts of data at the problem.

Roger that. My problem: I'm a _brand new_ Fastmail user. I have little to no Fastmail-generated "false positive or negative" emails (ie, emails that were originally mischaracterized).

What I do have is a *massive* number of emails (over ~14 years or so) that were categorized (many of them to undo the false positive/negative) over the years by me when Tuffmail.net hosted my email domains and service.

Further: I have little desire to take the time to figure out which email sets (from these 120k+ Tuffmail-spam-and-nonspam-trained emails) represent a better selection ("well-selected emails"), if that's what you mean. That's a huge effort (to selected 1k ham and 1k spam emails from 120k+ total ham-and-spam emails), or so it seems (maybe I'm missing something? pls advise if I am).

It seems much easier for me to whip up my own temporary SpamAssassin server, process the existingly-categorized emails, and hand the resulting database over to Fastmail (if Fastmail is willing to do this). Further, I can rerun this paradigm whenever I get a large, new influx of new email characterizations (mostly to mark large sets of existing email folders from my Tuffmail days as "ham"/not-spam)... again if Fastmail is willing to play ball, or simply speed up their spam-processing a bit.

Quote:

The global Bayes database already provides a good baseline. If well selected (i.e. ) processing 1k spam and 1k ham to fine tune your personal Bayes will usually produce great results.

<rant #1: Fastmail has eroded my trust in them, starting immediately with the first test cases I ran against their systems>

I'm not sure if it was Bayes related or something else, but there's been a potentially-big problem I've had with existing spam classifications (on Fastmail) and/or email-delivery delays... or something. The fact that it seems ambiguous (Fastmail tech thinks they have it under control; I do _not_ think that). This problem also happened almost immediately when I started testing my Fastmail-served domains (the first few emails I tested broke things and it's still not been "fixed"--it's been a baaaaaad experience). More on this later if the problems/symptoms remain relevant. There may be good explanations for this... or not. It depends. I've not yet decided. It's a deeper topic, not enough time for me to properly introduce and detail right now. (I have interacted with "level 2 tech support" at Fastmail on this. I'm not yet satisfied. They're doing their best to assist me, I'm sure.)

The point: this Fastmail experience of mine has put a big, fat question mark in my mind on the trustworthiness of the Fastmail mx/spam/whatever-is-going-on filters.

And since 2nd-level Fastmail tech support failed to tell me -precisely- what was going on, I do not trust their explanation. Their answer seemed flippant and possibly embedded with a tone suggesting I was an inexperienced user. And while I'm confident it was not their intent, I felt like they blew me off (subjective assessment, granted); this came after I waited over a week to get a response from their "senior tech." I truly appreciate that they are working to do their job the best that they can. Each tech handles hundreds to possibly thousands of these inquiries a week; they do not want to have to linger or spend any extra time on any point more than what's needed.

Instead, what I ask is that some manager at Fastmail recognizes that I'm a special user, and they need to get me on the phone with their smartest tech-operations/developer person they got. Please enable me to blast past all the bureaucracy and red tape. This will solve this issue with max efficiency and minimal fuss. I'm happy to pay whatever extra fees this incurs, within reason. (I've already maxed out the user account to 3 years of "Professional.") I've already offered these "extra payments."

Granted, I do recognize I'm a VERY hard-to-please customer with respect to these issues. I'm not Fastmail's average user. But I'm picky for what I think is a darn good reason: I want my email-communication systems to WORK and be reliable, else business and projects can fail. And I do not like to have to consistently revisit the question of "can I trust my email service provider to not throw my good email away." I want to kill the problem dead, once, and be done with it.

In my teams' computing worlds: there's no such thing as "mostly working." In high-level practical terms, it works or it fails.

Digital-computing systems can be treated this way if you design, test, and implement them correctly. I say this with confidence given decades of experience with all manner of implementations, whether or not the core technology was designed by my team or others. And we've designed some of the-most-complex-and-impactful technology ever built. Please do not "hand wave" over important points and details when trying to gain my trust with computing systems that you provide that effectively might be "eating my data" without my knowing it. (Again, I'm talking to you, Fastmail.)

</rant #1>

<rant #2 = comparing Tuffmail vs Fastmail filtering configurability... granted, not a fair comparison>

With Tuffmail.net: I had confidence that I knew exactly what was happening when John and Derek were running Tuffmail. eg, I knew _exactly_ which mx filter was running for every domain/email.address, because the Tuffmail management interface allowed me to program that entire configuration. I could look at the daily report log for _every single mx filter action_ and easily spot problems.

I also managed our own Sieve inbound scripts--it wasn't hard. (Fastmail's Sieve stuff seems harder; there's a more-complex existing configuration where it's less clear to me where I should input my Sieve programming, or not. Or maybe it's just "new" and I don't want to have to take the effort to figure out Fastmail's Sieve base config. ;-) ).

Tuffmail also allowed more-granular level control of the Bayesian spam filters (separate from the mx-level filters).

Sieve, spam-filter, mx configuration and logging, several other config options: all of this gave me tremendous confidence in Tuffmail's system behavior.

I do not yet have that confidence with Fastmail. The only filter-configuration control I seem to have is the "selected folder marked as spam or ham/non-spam" stuff on top of the "zero/small/medium/large"-ish spam-control radio buttons. Add this to the big, unexplained, "ghost" of a problem mentioned in my rant #1 (above).... and...

I'm hammering on this spam config--since it seems to be the only thing I can control with respect to filtering at Fastmail--to at least get it to the point where I'm more comfortable with it's Bayesian spam-filtering and thus trying to trust my email service once again (now that I've switched from Tuffmail to Fastmail in December).

</rant #2>

In short, the bottom line: I'm not yet trusting the global Bayes data running at Fastmail.

Quote:

I hope Bill comes by and adds his own thoughts. He has tuned his own account so he can safely discard virtually all spam (no false positives) while allowing almost no spam to reach the Inbox.

That sounds quite interesting, I too hope to hear from Bill. :-)

JeremyNicoll · 30 Jan 2022, 10:02 PM

Quote:

Originally Posted by hydrostarr

And I do not like to have to consistently revisit the question of "can I trust my email service provider to not throw my good email away." I want to kill the problem dead, once, and be done with it.

I don't think anyone can TOTALLY trust Bayesian spam filtering, which means you always have manually to check anything flagged as spam. That being so, I don't use it. I do though have hundreds of addresses for my incoming mails, one per company / person I deal with, and hundreds of filters matching those addresses. So pretty much any mail that fails to match on (specific incoming address & matching characteristics of the sender) combination is dubious.

I see quite a lot of mail-list mails where previous mails in a thread have been flagged as spam and thus eg Subject-line tags remain in all following replies, and for these it's always clear that someone-else's system missclassified an earlier mail.

Quote:

Originally Posted by hydrostarr

I also managed our own Sieve inbound scripts--it wasn't hard. (Fastmail's Sieve stuff seems harder; there's a more-complex existing configuration where it's less clear to me where I should input my Sieve programming, or not. Or maybe it's just "new" and I don't want to have to take the effort to figure out Fastmail's Sieve base config. ;-) ).

I'd always put my Sieve code ahead of all (apart from the requires) of FM's generated stuff.

But their generator is pretty good ... provided that when setting conditions up, you click the "switch to no-preview rules" option.

I find it irritating that one can't insert a new rule where one wants; instead one lets their system add it then you have to find it and drag it to where you want it. And although I asked for this ages ago, they still don't allow one to clone an existing rule, which (especially if it got defined initially right next to the one it's based on) would save a colossal amount of time when one wants to set up several very similar rules. Maybe creative use of rule export/import would help for this, but I see that they export as JSON so I'd need to be certain I could manipulate those files correctly.

I have a lot of rules, set up in groups of related types of conditions, and between those groups I define dummy rules as a way of inserting comments in the list, eg before the rules for incoming mails from mail-lists hosted by groups.io, I have a rule

if mailing list id is exactly "-------------------------------------------------------- LISTS (groups.io - others)" then move mail to Inbox

(which of course is extremely unlikely every to happen), but the point is that the literal

"-------------------------------------------------------- LISTS (groups.io - others)"

stands out visually in the list of rules as one scrolls through it.

FM did also add a feature which I requested a while ago, which is nicknames for rules. IIRC I did this because especially with regex ones it's impossible to glance at them and know what they do, and with rules which have lots of conditions in them their scrolling list of rules only shows the lefthand part of the whole rule. The "nickname" can be a whole line of text and so it too can contain a sensible comment. One could make one's "nicknames" have structured/meaningful contents. For example one of my "nickname"s is

ID posts with no URLs in plain text part, ie 'added' etc preceded by someone's name (not end-span tag).

and that whole line of explanation is displayed in the scrolling list of rules.

hydrostarr · 31 Jan 2022, 02:47 AM

@JeremyNicoll - I find your above comments incredibly useful. I'm digesting them carefully. Thank you for taking the time to write up your thoughts.

hadaso · 31 Jan 2022, 05:50 AM

Quote:

Originally Posted by hydrostarr

What I do have is a *massive* number of emails (over ~14 years or so) that were categorized (many of them to undo the false positive/negative) over the years ...

Probably almost all of this stuff is totally useless for for detecting new incoming spam. 2022 spam doesn't look like 2008 spam nor like 2018 spam. almost all tos the spammers who sent this are probably out of business for years and new ones are in the spam business now. Ham also changes over time.

I have a huge spam collection, some of it dating almost 20 years ago, but I don't think it's worth anything as training material for a spam filter. My spam collection is as useful as my wife's stamp collection. Perhaps in the future some of it would be worth something... I have some rare spam from exotic senders...

30 Jan 2022, 06:50 AM	#9
hydrostarr Member Join Date: Jul 2003 Posts: 55	Spam/non-Spam processing is WAY too slow for 120k+ emails Update: the Spam/non-Spam processing going WAY too slow. After 4 days the spam counters show 6k emails have been processed. That's a ~1.5k/per_day rate. And the counters suggest the daily rate may be _slowing down_. I currently have 120k marked-as-spam-and-not-spam emails in queue to process... and this will most-likely grow every day (possibly dramatically) as I mass-add emails to my "Ham / non-Spam" folder. This will take months at the current rate. I have created a ticket with Fastmail on this. I doubt they'll have good answers, once they finally send me a meaningful reply. (I'm not yet having great Fastmail-tech-support experiences, fwiw.) My current idea: run a bayesian-database-generating gizmo on one of my own machines/servers and given them the data and they insert said spam-bayes data into whatever mechanism they have. And I'll work to generate compatible data for the "import." However... I'll be surprised if I'll be able to get them to do this. Fwiw, I have a Fastmail "Professional" membership (the biggest/baddest/most-expensive one they have). I keep asking them if there's ways I can give them more money for more service, features, performance on every tech-support topic I ask about. They have yet to take me up on the offer. Eventually I may come back to EmailDiscussions.com to see who might give me direct access to a smart/capable/authoritatively-enabled Fastmail development/tech-support/operations manager. Until then, I'll work the process a little more to see where I can get. Last edited by hydrostarr : 30 Jan 2022 at 07:39 AM.

25 Jan 2022, 12:12 PM	#1
hydrostarr Member Join Date: Jul 2003 Posts: 55	How to mark email as "Ham" (to train Fastmail spam filter)? Summary In Fastmail: 1. How to mark email as "Ham"? (Seems less trustworthy without this.) 2. Can I apply large collections of (past) known-Ham and known-Spam emails to be "marked as Ham" and "marked as Spam" to best train Fastmail's spam filters? Details In Fastmail I can train its spam engine by moving email to the toplevel Spam IMAP folder. How can I train Fastmail to recognize email as "NOT spam" aka "Ham"? Tuffmail.net had this feature, simply by copying any email I wanted to an Auto-Train/Ham IMAP folder (similar but the opposite purpose of Tuffmail's Auto-Train/Spam folder). Does Fastmail provide some similar functionality? Without it, I have much less trust of Fastmail's spam filter to do the right thing. ie, ensure "Ham" is "not Spam filtered," which to me is just as much or maybe more inportant than catching-and-filtering Spam. Further: I still retain all my emails from Tuffmail (for ~15 yrs) that I manually marked as both Spam and Ham. Thousands of emails. I'd love to apply those to my Fastmail account engine. Is that feasible? What are my options (for procedural execution)? Are some procedures potentially more-efficient than others?

25 Jan 2022, 01:02 PM	#2
Terry The "e" in e-mail Join Date: Jul 2002 Location: VK4 Posts: 3,029	Yes you can import your tuffmail emails but would be limited to so many per day....also surly that would put you over your storage limit. 15 years of emails that for me would be a nightmare.

25 Jan 2022, 06:46 PM	#4
BritTim The "e" in e-mail Join Date: May 2003 Location: mostly in Thailand Posts: 3,095	The way you quickly train the spam filter (and get your own personal Bayes database) with Fastmail is to set up two folders, specifically designed to facilitate identification of spam and ham. Once setup correctly, those folders are scanned daily for new messages. Once 200 spam and 200 ham messages have been used to train your Bayes database, that is used in place of the global Bayes. For details, read https://www.fastmail.help/hc/en-us/a...pam-protection Note that it is most effective if you can use messages that you know were mischaracterised in the past.

31 Jan 2022, 02:47 AM	#14
hydrostarr Member Join Date: Jul 2003 Posts: 55	@JeremyNicoll - I find your above comments incredibly useful. I'm digesting them carefully. Thank you for taking the time to write up your thoughts.