EmailDiscussions.com  

Go Back   EmailDiscussions.com > Email Service Provider-specific Forums > FastMail Forum
Register FAQ Members List Calendar Today's Posts
Stay in touch wirelessly

FastMail Forum All posts relating to FastMail.FM should go here: suggestions, comments, requests for help, complaints, technical issues etc.

Reply
 
Thread Tools
Old 13 Jul 2007, 06:47 PM   #1
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,102

Representative of:
Fastmail.FM
Per-user bayes db rolled out

The per-user bayes db work has been rolled out to all users now. Yay!

There were a few glitches while rolling out the second server, so for about 1 hour a bunch of emails weren't using any bayes db and would have had the "BAYES_USED none" X-Spam-hit, but that should all be fixed now.

There was also a period immediately after the roll out where the new databases were being fetched that caused the spam servers to get vastly overloaded, resulting in a half-hour delay on all delivered email that lasted about an hour. I blogged about it on the status blog. Everything seems to have been fine since then.

If you look at a recently delivered email and check the X-Spam-hits header, you should either see a hit:

BAYES_USED global

or

BAYES_USED user

Which tells you whether the global or per-user bayes db was used. Currently for the per-user one to be used, you must have reported at least 200 spams and 200 non-spams.

Once I have the new folders screen ironed out, I'll be adding an option so you can specify folders as "spam" or "not-spam" folders, and emails in those folders will regularly be scanned and learned as specified to your database. In the meantime, if you're getting "BAYES_USED global", you might want to just report a bunch of non-spam messages as "non-spam" and spam messages as "spam" via the web interface. Note that after reporting, there is a cache period before the updated db is used, so it may take up to 15 minutes after learning before any change is propagated to the spam checking servers.

It would be nice to show somewhere how many spams/non-spams are actually in your bayes database, but unfortunately getting hold of that information isn't as easy as i'd like, so I can't add it easily at the moment. You'll just have to report spam/not-spam till the "BAYES_USED user" spam hit appears.

Rob
robmueller is offline   Reply With Quote

Old 13 Jul 2007, 07:59 PM   #2
eggman
Essential Contributor
 
Join Date: Jun 2002
Location: AU
Posts: 471
Hi Rob

Thanks for the per user db.

I took my archive of spam (approx 3000 messages) and clicked "report as spam" in the hope that this would "train" the db.

I expected these would be moved to trash as per the FAQ:
Quote:
Report spam - Marks the selected emails as spam in the internal database to train the spam filter on and then moves the email to Trash
I thought they had been permanently deleted however these have been moved to Inbox.junk mail

Can you update the faq?
http://www.fastmail.fm/docs/faqparts....htm#MbxAction

I will keep an eye out to see if the bayes is working for me.

cheers Eggman
eggman is offline   Reply With Quote
Old 13 Jul 2007, 09:34 PM   #3
nighthawk700
Essential Contributor
 
Join Date: Oct 2004
Location: Baltimore, MD Suburbs (US)
Posts: 237
Quote:
Originally Posted by robmueller View Post
Which tells you whether the global or per-user bayes db was used. Currently for the per-user one to be used, you must have reported at least 200 spams and 200 non-spams.
Does this mean we need to have 200 spam messages that are delievered to our regular inbox, that we then mark as spam? Or is there a way to confirm the spam in our Junk folder really is spam? Same with reporting non-spam. Because at the current rate, it'll be a long time before I can report 200 spam if I need to report them from non-spam folders, and even longer to report 200 non-spam in the Junk box. But then again, that probably means that the system in place is working well enough as it is...
nighthawk700 is offline   Reply With Quote
Old 13 Jul 2007, 10:37 PM   #4
Berenburger
The "e" in e-mail
 
Join Date: Sep 2004
Location: The Netherlands
Posts: 2,908
Quote:
Originally Posted by robmueller View Post
If you look at a recently delivered email and check the X-Spam-hits header, you should either see a hit:
BAYES_USED global
or
BAYES_USED user
Which tells you whether the global or per-user bayes db was used. Currently for the per-user one to be used, you must have reported at least 200 spams and 200 non-spams.
Mine is BAYES_USED user, but I know for sure that I didn't reported 200 spams AND 200 non-spams.
I do have 257 mails in my Junk mail folder since the year 2005. Is this may'be counted in the user db?
Berenburger is offline   Reply With Quote
Old 14 Jul 2007, 02:56 AM   #5
sflorack
The "e" in e-mail
 
Join Date: Feb 2002
Posts: 2,937
I've received only 7 spam messages in nearly five years. At that rate, I won't get to use this feature for another 142 years.
sflorack is offline   Reply With Quote
Old 14 Jul 2007, 06:22 AM   #6
hadaso
The "e" in e-mail
 
Join Date: Oct 2002
Location: Holon, Israel.
Posts: 4,858
Quote:
Originally Posted by sflorack View Post
I've received only 7 spam messages in nearly five years. At that rate, I won't get to use this feature for another 142 years.
Let's hope it stays this way for you!
Usually it's not linear. You have a clean email address and one day it starts getting a little spam. Then more and more until you cannot find your ham anymore...
hadaso is online now   Reply With Quote
Old 14 Jul 2007, 07:18 AM   #7
infoghost
Junior Member
 
Join Date: Sep 2005
Posts: 21
This is working great for me, finally all those stupid YOU HAVE WON!!! or CONTACT US NOW!!! spams are out of my inbox and into the junk folder. All that training paid off.

Keep up the great work guys!!
infoghost is offline   Reply With Quote
Old 14 Jul 2007, 07:49 AM   #8
Misha
Senior Member
 
Join Date: Nov 2004
Posts: 178
This is exciting! We've been waiting for per-user Bayes for a long time, and it's great to have it.

I trained the bayed db on my past three months' worth of email - all my ham and all my spam (I've been keeping all my mail, in part so I can train the Bayes DB effectively). With the current interface, that was a bit of work (everything's in multiple folders, etc), but not too bad.

Really anecdotally, it seems like the filter's working better now (no false positives or false negatives this afternoon, with nice high scores on all my spam and nice low scores on all my ham). But I guess it's too early to really know.

Thanks, Rob, for getting this working!!
Misha is offline   Reply With Quote
Old 14 Jul 2007, 12:17 PM   #9
Misha
Senior Member
 
Join Date: Nov 2004
Posts: 178
What is this URIBL_WHITE spamassassin rule?

Hey. The new per-user bayes system seems to be working great, at least on my account!

One spam that did sneak through made it past, I think, because of a Spamassassin rule called URIBL_WHITE.

I looked that up, and there doesn't seem to be any documentation of it anywhere, and most of the hits for it seem to be in fastmail-related threads.

Is it a custom rule here at fastmail? If so - it seems like it might be triggering false negatives...
Misha is offline   Reply With Quote
Old 16 Jul 2007, 01:14 PM   #10
robmueller
Intergalactic Postmaster
 
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,102

Representative of:
Fastmail.FM
The per-user bayes db stuff seems to have been working well so far. I've tweaked a few things for the moment.

1. I removed the URIBL_WHITE rule. It was checking against white.uribl.com, but with the spam coming from hotmail/yahoo these days that includes taglines that link to their own services and are in the white list, it was causing too many false negatives
2. The BAYES_00 -> BAYES_40 scores had been significantly reduced (well, they're negative, so reduced in the sense of "halved") from the default values for a long time while we used the global bayes db. I've increased them (eg larger negative) some more again, though not to their full values. I'll look if I can use the reduced values for the global db, but the full values for a per-user db
3. A while back, I seeded a bunch of user databases by training emails from peoples Inboxes and from their Junk Mail folders. That was for some testing on db size, though I never truncated those databases, so that's why some people might be seeing "BAYES_USED user" even though they haven't reported 200 spam + 200 not-spam. In general, these should actually be fine because most emails used for training would have been correct spam/not-spam, and any real problems can now be "corrected" by training the mistakes.
4. I'll update the FAQ docs, moving to Junk Mail I think is the right thing

If you currently see "BAYES_USED global", and you have a bunch of spam messages in your Junk Mail folder, and you have a bunch of non-spam messages in other folders, you can just report them as spam/not-spam as appropriate to get over the 200 of each threshold and enable the per-user database. It's worthwhile

Rob
robmueller is offline   Reply With Quote
Old 16 Jul 2007, 01:45 PM   #11
Misha
Senior Member
 
Join Date: Nov 2004
Posts: 178
Quote:
Originally Posted by robmueller View Post
The per-user bayes db stuff seems to have been working well so far. I've tweaked a few things for the moment.

1. I removed the URIBL_WHITE rule. It was checking against white.uribl.com, but with the spam coming from hotmail/yahoo these days that includes taglines that link to their own services and are in the white list, it was causing too many false negatives
Great!


Quote:
Originally Posted by robmueller View Post
2. The BAYES_00 -> BAYES_40 scores had been significantly reduced (well, they're negative, so reduced in the sense of "halved") from the default values for a long time while we used the global bayes db. I've increased them (eg larger negative) some more again, though not to their full values. I'll look if I can use the reduced values for the global db, but the full values for a per-user db
I wonder whether the existing default spamassassin setup might already anticipate this by, say, being less likely to assign a Bayes_00 score when working w. a GlobalDB. If that's the case, there might be no need to assign separate scores for GlobalDB and UserDB. This seems like it'd be the smart way of doing things, and Spamassassin's usually pretty smart.

(I'd personally love to see the default "high" (ie- "more negative") scores used. From what I've seen so far, the Bayes DB is doing an awesome job of identifying my ham...)

Quote:
Originally Posted by robmueller View Post
3. A while back, I seeded a bunch of user databases by training emails from peoples Inboxes and from their Junk Mail folders. That was for some testing on db size, though I never truncated those databases, so that's why some people might be seeing "BAYES_USED user" even though they haven't reported 200 spam + 200 not-spam. In general, these should actually be fine because most emails used for training would have been correct spam/not-spam, and any real problems can now be "corrected" by training the mistakes.
It's not really a concern of mine, since I'm willing to do the work of hand-training my bayes DB. But I can't help thinking that for people who don't want to do that training, any attempt to guess at spam and ham (by, say, assuming there's no old spam in their inbox) seems to be asking for trouble. Maybe in these cases, spamassassin's built-in autolearn functionality would be a better solution.

Just my two cents.

Quote:
Originally Posted by robmueller View Post
If you currently see "BAYES_USED global", and you have a bunch of spam messages in your Junk Mail folder, and you have a bunch of non-spam messages in other folders, you can just report them as spam/not-spam as appropriate to get over the 200 of each threshold and enable the per-user database. It's worthwhile

Rob
It really is worthwhile!

Rob, thanks again for getting per-user Bayes working! For me, this transforms Fastmail from an amazing service with one nagging problem into just a purely amazing service.
Misha is offline   Reply With Quote
Old 20 Jul 2007, 09:35 AM   #12
GeraldR
Essential Contributor
 
Join Date: Apr 2007
Location: Canada
Posts: 227
How long to keep in Junk Mail folder

How long does spam need to be kept in the Junk Mail folder for it to be learnt as spam?
GeraldR is offline   Reply With Quote
Old 20 Jul 2007, 03:59 PM   #13
Jeremy Howard
Ultimate Contributor
 
Join Date: Sep 2001
Location: Australia
Posts: 11,501
Quote:
Originally Posted by GeraldR View Post
How long does spam need to be kept in the Junk Mail folder for it to be learnt as spam?
When you click the 'Empty' link next to the Junk Mail folder it's learnt as spam. This will happen automatically at some point in the future - there's already code to scan folders and learn stuff left in Junk Mail for a while as spam, and other (not Inbox/Trash/Sent) as ham, but it's awaiting a UI to allow users to customise it.

For now, you'll need to select the contents of a large (>200 messages) folder and click "Report non-spam", and empty your Junk Mail when it's got >200 messages in it, to get your user Bayes DB initialised.
Jeremy Howard is offline   Reply With Quote
Old 21 Jul 2007, 12:08 AM   #14
GeraldR
Essential Contributor
 
Join Date: Apr 2007
Location: Canada
Posts: 227
Jeremy,

That is a good design. If it is learnt as spam upon the "Empty" operation does this mean mail moved there by the IMAP interface also counts? I thought that only mail moved by the "Report Spam" operation was treated as spam.

I don't think all 200 spams have to be present at once. My headers say "BAYES_USED user" and I don't think I ever had 200 at once.
GeraldR is offline   Reply With Quote
Old 9 Aug 2007, 01:51 PM   #15
elvey
The "e" in e-mail
 
Join Date: Jan 2002
Location: San Francisco
Posts: 2,458
Ok, I've trained enough for the per-user bayes filter (PUBF) to kick in.

Just got this on an email from this forum software:

X-Spam-hits: AXB_XMID_1212 3.496, BAYES_00 -1.3, BAYES_USED user

(Yay! Nice work!)

Last edited by elvey : 9 Aug 2007 at 01:52 PM. Reason: already answered
elvey is offline   Reply With Quote
Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Forum Jump


All times are GMT +9. The time now is 07:54 PM.

 

Copyright EmailDiscussions.com 1998-2022. All Rights Reserved. Privacy Policy