EmailDiscussions.com - SpamAssassin autolearn=ham?

Page 1 of 2

Show 40 post(s) from this thread on one page

EmailDiscussions.com (http://www.emaildiscussions.com/index.php)

- Runbox Forum (http://www.emaildiscussions.com/forumdisplay.php?f=18)

- - SpamAssassin autolearn=ham? (http://www.emaildiscussions.com/showthread.php?t=21309)

carverrn

4 May 2004 06:37 AM

SpamAssassin autolearn=ham?

What does it mean if the SpamAssassin header says "autolearn=ham"? I noticed the SpamAssassin message headers of a spam message that slipped through SpamAssassin said "autolearn=ham".

Regards,
Rich

BKB	4 May 2004 07:10 AM

Hi Rich,
Just googled it. Here is what I got from this page:

http://spamassassin.rediris.es/doc/s...20for%20ham%20

DEFAULT TAGGING FOR HAM (NON-SPAM) MAILS
X-Spam-Status: header
A string, No, hits=nn required=nn tests=xxx,xxx autolearn=(ham|spam|no) is set in this header to reflect the filter status.

Looks like Ham is the opposite of Spam. Does that answer your question?

BKB.

carverrn

4 May 2004 08:41 AM

Thanks. I guess I'm more curious about why it was "ham" and not "no" like it usually is?

Rich

jbs	4 May 2004 01:04 PM

I vaguely recall that Ham is the email you explicitly designate as being good, wanted mail. It is the opposite of spam, but more specifically it's the content you use to train a Bayesian filter to recognize the mail that you want.

I wonder if the SpamAssassin filter is saying that that particular message is not just "Not Spam" but even above and beyond that it closely matches a message which it's been explicitly told is a good message.

Would that make sense? Was this a message that someone somewhere could have actually wanted? Or is it potentially a sign of someone trying to trick SA into thinking this was a good message?

--Jason

carverrn

4 May 2004 03:41 PM

The message was SPAM that was not flagged as SPAM by SpamAssassin. What I don't understand is why "autolearn" was set to "ham". As far as I know Runbox isn't using the "autolearn" function yet. That's why it's should have said "autolearn=no".

Rich

carverrn

5 May 2004 09:48 PM

Jason, you were right about the autolearn. I found some details at the SpamAssassin Wiki:

Why isn't autolearning working for me? (aka: "autolearn=no")

If it says "autolearn=no" then it means that SpamAssassin has not learned whether or not the message is spam or ham. If it says "autolearn=ham", then SpamAssassin has been trained to recognized the message as "ham". If it says "autolearn=spam", then SpamAssassing has been trained to recognize the message as "spam".

I have a message that is 100% SPAM yet SpamAssassin has tagged it with "autolearn=ham" which means it's been trained the recognize the message as "ham".

My question for the Runbox crew is when did you start training SpamAssassin, who's training it and how do you "retrain" it when it's been trained wrong?

Regards,
Rich

tore	7 May 2004 06:30 PM

Quote:

Originally posted by carverrn

I have a message that is 100% SPAM yet SpamAssassin has tagged it with "autolearn=ham" which means it's been trained the recognize the message as "ham".

My question for the Runbox crew is when did you start training SpamAssassin, who's training it and how do you "retrain" it when it's been trained wrong?

SpamAssassin is training SpamAssassin. If the mail gets a very low score, it is learned as ham, and if it gets a very high score it is learned as spam. Else, the message is not learned at all. The score inflicted by the build-in bayesian classifier is ignored when deciding whether or not the mail should be learned.

The database is system-wide, and kept in memory only (it will have to be re-built if the system is rebooted). Today there is no way for you to "correct" any mail learned wrong or explicitly learn any mail that had autolearn=no.

Of course, this classifier is in no way as effective as a actively maintained per-user bayesian filter would be. However, it was deemed to be better than nothing when we started using SpamAssassin, even though incorrect learnings such as the one you experienced may occur sometimes. Look for the BAYES_* tests in the spam report to see the effect the classifier has on scoring your incoming mails.

Geir	7 May 2004 09:52 PM

We are currently working to expand the spam filter with individual "intelligent" filtering (more specifically CRM-114), allowing a user to tell the filter when it has classified a message erroneously - thus teaching it to catch spam more accurately.

We're hoping to launch this within a couple of weeks.

- Geir

carverrn

7 May 2004 10:46 PM

Hi Tore,

Quote:

SpamAssassin is training SpamAssassin. If the mail gets a very low score, it is learned as ham, and if it gets a very high score it is learned as spam. Else, the message is not learned at all.

I kind of figured it was on auto-pilot.

Quote:

Today there is no way for you to "correct" any mail learned wrong or explicitly learn any mail that had autolearn=no.

What about the sa-learn tool mentioned in this SpamAssassin Wiki:

If a message has been learned incorrectly, what do I need to do to fix it?

Quote:

Look for the BAYES_* tests in the spam report to see the effect the classifier has on scoring your incoming mails.

Here's the SA information from the message:
X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on
gaspode.runbox.com
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham
version=2.60
X-Spam-Level:

The message is for Vicodin (line reads "0,rder V~ic0-din 0nline Anytime"). It's all text (no HTML). About 50% of it is unrelated (part random words, part a joke about a freshman being tossed into a frog pond).

I guess the spammers are figuring out how to get passed the Bayesian filters.

Rich

carverrn

7 May 2004 11:04 PM

Hi Geir,

Quote:

Originally posted by Geir
We are currently working to expand the spam filter with individual "intelligent" filtering (more specifically CRM-114), allowing a user to tell the filter when it has classified a message erroneously - thus teaching it to catch spam more accurately.

I don't know about others but an improved Whitelist would probably work better for me. I would say that 99% of my valid mail comes from people/addresses I know.

The ones that are valid that I don't know usually have other information in the headers I can key off of with filters.

I also like the "challenge message" ideas other services use.

Rich

tore	9 May 2004 02:13 AM

Quote:

Originally posted by carverrn

What about the sa-learn tool mentioned in this SpamAssassin Wiki:

We're well aware of that too. However, it's not that easy. First off, the web interface doesn't have any functionality to re-learn these messages today. Secondly, the "auto-pilot" bayesian databases aren't shared between the MXes - there would need to be a system to distribute the sa-learn invocation on all the MXes, too.

Of course - *I* could take the message in question and feed it to SA-learn on all the machines, but once that is neccessary to make the bayesian classifier work it's better to just disable the whole thing.

Quote:

I guess the spammers are figuring out how to get passed the Bayesian filters.

Yes. That's why the auto-pilot classifier isn't that efficient any longer and might well be disabled in the near future. The per-user bayesian databases you can correct in case of error should be much more efficient than the automatic one (IFF you do take care to correct it whenever it's made a mistake). That feature is on its way, I think.

tore	9 May 2004 02:36 AM

Quote:

Originally posted by carverrn

I also like the "challenge message" ideas other services use.

For what it's worth, I detest these. They place the burden of maintaining your white list on everyone *but* you. Have you ever received a stupid, unsolicited junk bounce from some mail system that are "kindly" informing you that you sent some user a virus, even if you're 110% percent certain you've done no such thing? Well - if C-R systems will become exactly as common as moronic virus filters, expect that number of junk mail to double. Same goes for spam sent from forged addresses.

C-R system places the burden on the alleged sender. But as there is absolutely *no* way today to verify the sender of an e-mail's authenticity, a C-R system will direct its challenges to unrelated third parties - thus making them unsolicited, or in other words: spam. It's a very very asocial, arrogant, and selfish way to deal with one's spam problems, IMNSHO.

That said - C-R *is* very effective to prevent spam. However, it'll also prevent legit e-mail, as I, and many others, refuse to reply to such challenges by principle alone. If you want to keep a white-list of senders, fine, but don't expect me (and every other e-mail using individual on the planet) to maintain *your* whitelist *for* you.

Also see Karsten M. Self's thoughts on the subject, most of which I very much agree with.

(It is of course not not my desiscion whether or not Runbox should implement such a feature.)

jedilizagain

9 May 2004 07:33 AM

spamasssasin is failing to identify a lot of SPAM for me. And I'm having problems with setting up filters. One of them has the "Html tags" thing and then when it follows that filter, someitmes runbox list emails end up in a "possibly spam" folder I have set up.

How does the -# filters work? do those get looked at first?

carverrn

9 May 2004 01:17 PM

Hi Tore,

OK, you win. Maybe C-R isn't as good as it sounded.

Rich

carverrn

9 May 2004 01:33 PM

Quote:

Originally posted by jedilizagain
spamasssasin is failing to identify a lot of SPAM for me. And I'm having problems with setting up filters. One of them has the "Html tags" thing and then when it follows that filter, someitmes runbox list emails end up in a "possibly spam" folder I have set up.

How does the -# filters work? do those get looked at first?

Runbox filter order values go from -99 to 999. With -99 as the highest/first filters and 999 as the lowest/last filters. If you have multiple filters of the same order value they are processed in the order the entries were defined. Basically, just look at your filter list and go from the top to the bottom.

SpamAssassin has been missing more and more spams lately. But I can see why they are being missed. Many of them for me are very small text only messages with the spam messages buied in the middle of random words or full paragraphs of unrelated text. It's hard for a program to flag these as spam and they apparently haven't been reported to the traditional blacklists (at least SpamAssassin didn't show that in the scoring).

Rich

All times are GMT +9. The time now is 06:49 AM.

Page 1 of 2

Show 40 post(s) from this thread on one page