|
FastMail Forum All posts relating to FastMail.FM should go here: suggestions, comments, requests for help, complaints, technical issues etc. |
|
Thread Tools |
13 Jul 2007, 06:47 PM | #1 |
Intergalactic Postmaster
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,102
Representative of:
Fastmail.FM |
Per-user bayes db rolled out
The per-user bayes db work has been rolled out to all users now. Yay!
There were a few glitches while rolling out the second server, so for about 1 hour a bunch of emails weren't using any bayes db and would have had the "BAYES_USED none" X-Spam-hit, but that should all be fixed now. There was also a period immediately after the roll out where the new databases were being fetched that caused the spam servers to get vastly overloaded, resulting in a half-hour delay on all delivered email that lasted about an hour. I blogged about it on the status blog. Everything seems to have been fine since then. If you look at a recently delivered email and check the X-Spam-hits header, you should either see a hit: BAYES_USED global or BAYES_USED user Which tells you whether the global or per-user bayes db was used. Currently for the per-user one to be used, you must have reported at least 200 spams and 200 non-spams. Once I have the new folders screen ironed out, I'll be adding an option so you can specify folders as "spam" or "not-spam" folders, and emails in those folders will regularly be scanned and learned as specified to your database. In the meantime, if you're getting "BAYES_USED global", you might want to just report a bunch of non-spam messages as "non-spam" and spam messages as "spam" via the web interface. Note that after reporting, there is a cache period before the updated db is used, so it may take up to 15 minutes after learning before any change is propagated to the spam checking servers. It would be nice to show somewhere how many spams/non-spams are actually in your bayes database, but unfortunately getting hold of that information isn't as easy as i'd like, so I can't add it easily at the moment. You'll just have to report spam/not-spam till the "BAYES_USED user" spam hit appears. Rob |
13 Jul 2007, 07:59 PM | #2 | |
Essential Contributor
Join Date: Jun 2002
Location: AU
Posts: 471
|
Hi Rob
Thanks for the per user db. I took my archive of spam (approx 3000 messages) and clicked "report as spam" in the hope that this would "train" the db. I expected these would be moved to trash as per the FAQ: Quote:
Can you update the faq? http://www.fastmail.fm/docs/faqparts....htm#MbxAction I will keep an eye out to see if the bayes is working for me. cheers Eggman |
|
13 Jul 2007, 09:34 PM | #3 |
Essential Contributor
Join Date: Oct 2004
Location: Baltimore, MD Suburbs (US)
Posts: 237
|
Does this mean we need to have 200 spam messages that are delievered to our regular inbox, that we then mark as spam? Or is there a way to confirm the spam in our Junk folder really is spam? Same with reporting non-spam. Because at the current rate, it'll be a long time before I can report 200 spam if I need to report them from non-spam folders, and even longer to report 200 non-spam in the Junk box. But then again, that probably means that the system in place is working well enough as it is...
|
13 Jul 2007, 10:37 PM | #4 | |
The "e" in e-mail
Join Date: Sep 2004
Location: The Netherlands
Posts: 2,908
|
Quote:
I do have 257 mails in my Junk mail folder since the year 2005. Is this may'be counted in the user db? |
|
14 Jul 2007, 02:56 AM | #5 |
The "e" in e-mail
Join Date: Feb 2002
Posts: 2,937
|
I've received only 7 spam messages in nearly five years. At that rate, I won't get to use this feature for another 142 years.
|
14 Jul 2007, 06:22 AM | #6 | |
The "e" in e-mail
Join Date: Oct 2002
Location: Holon, Israel.
Posts: 4,858
|
Quote:
Usually it's not linear. You have a clean email address and one day it starts getting a little spam. Then more and more until you cannot find your ham anymore... |
|
14 Jul 2007, 07:18 AM | #7 |
Junior Member
Join Date: Sep 2005
Posts: 21
|
This is working great for me, finally all those stupid YOU HAVE WON!!! or CONTACT US NOW!!! spams are out of my inbox and into the junk folder. All that training paid off.
Keep up the great work guys!! |
14 Jul 2007, 07:49 AM | #8 |
Senior Member
Join Date: Nov 2004
Posts: 178
|
This is exciting! We've been waiting for per-user Bayes for a long time, and it's great to have it.
I trained the bayed db on my past three months' worth of email - all my ham and all my spam (I've been keeping all my mail, in part so I can train the Bayes DB effectively). With the current interface, that was a bit of work (everything's in multiple folders, etc), but not too bad. Really anecdotally, it seems like the filter's working better now (no false positives or false negatives this afternoon, with nice high scores on all my spam and nice low scores on all my ham). But I guess it's too early to really know. Thanks, Rob, for getting this working!! |
14 Jul 2007, 12:17 PM | #9 |
Senior Member
Join Date: Nov 2004
Posts: 178
|
What is this URIBL_WHITE spamassassin rule?
Hey. The new per-user bayes system seems to be working great, at least on my account!
One spam that did sneak through made it past, I think, because of a Spamassassin rule called URIBL_WHITE. I looked that up, and there doesn't seem to be any documentation of it anywhere, and most of the hits for it seem to be in fastmail-related threads. Is it a custom rule here at fastmail? If so - it seems like it might be triggering false negatives... |
16 Jul 2007, 01:14 PM | #10 |
Intergalactic Postmaster
Join Date: Oct 2001
Location: Melbourne, Australia
Posts: 6,102
Representative of:
Fastmail.FM |
The per-user bayes db stuff seems to have been working well so far. I've tweaked a few things for the moment.
1. I removed the URIBL_WHITE rule. It was checking against white.uribl.com, but with the spam coming from hotmail/yahoo these days that includes taglines that link to their own services and are in the white list, it was causing too many false negatives 2. The BAYES_00 -> BAYES_40 scores had been significantly reduced (well, they're negative, so reduced in the sense of "halved") from the default values for a long time while we used the global bayes db. I've increased them (eg larger negative) some more again, though not to their full values. I'll look if I can use the reduced values for the global db, but the full values for a per-user db 3. A while back, I seeded a bunch of user databases by training emails from peoples Inboxes and from their Junk Mail folders. That was for some testing on db size, though I never truncated those databases, so that's why some people might be seeing "BAYES_USED user" even though they haven't reported 200 spam + 200 not-spam. In general, these should actually be fine because most emails used for training would have been correct spam/not-spam, and any real problems can now be "corrected" by training the mistakes. 4. I'll update the FAQ docs, moving to Junk Mail I think is the right thing If you currently see "BAYES_USED global", and you have a bunch of spam messages in your Junk Mail folder, and you have a bunch of non-spam messages in other folders, you can just report them as spam/not-spam as appropriate to get over the 200 of each threshold and enable the per-user database. It's worthwhile Rob |
16 Jul 2007, 01:45 PM | #11 | ||||
Senior Member
Join Date: Nov 2004
Posts: 178
|
Quote:
Quote:
(I'd personally love to see the default "high" (ie- "more negative") scores used. From what I've seen so far, the Bayes DB is doing an awesome job of identifying my ham...) Quote:
Just my two cents. Quote:
Rob, thanks again for getting per-user Bayes working! For me, this transforms Fastmail from an amazing service with one nagging problem into just a purely amazing service. |
||||
20 Jul 2007, 09:35 AM | #12 |
Essential Contributor
Join Date: Apr 2007
Location: Canada
Posts: 227
|
How long to keep in Junk Mail folder
How long does spam need to be kept in the Junk Mail folder for it to be learnt as spam?
|
20 Jul 2007, 03:59 PM | #13 | |
Ultimate Contributor
Join Date: Sep 2001
Location: Australia
Posts: 11,501
|
Quote:
For now, you'll need to select the contents of a large (>200 messages) folder and click "Report non-spam", and empty your Junk Mail when it's got >200 messages in it, to get your user Bayes DB initialised. |
|
21 Jul 2007, 12:08 AM | #14 |
Essential Contributor
Join Date: Apr 2007
Location: Canada
Posts: 227
|
Jeremy,
That is a good design. If it is learnt as spam upon the "Empty" operation does this mean mail moved there by the IMAP interface also counts? I thought that only mail moved by the "Report Spam" operation was treated as spam. I don't think all 200 spams have to be present at once. My headers say "BAYES_USED user" and I don't think I ever had 200 at once. |
9 Aug 2007, 01:51 PM | #15 |
The "e" in e-mail
Join Date: Jan 2002
Location: San Francisco
Posts: 2,458
|
Ok, I've trained enough for the per-user bayes filter (PUBF) to kick in.
Just got this on an email from this forum software: X-Spam-hits: AXB_XMID_1212 3.496, BAYES_00 -1.3, BAYES_USED user (Yay! Nice work!) Last edited by elvey : 9 Aug 2007 at 01:52 PM. Reason: already answered |