|
FastMail Forum All posts relating to FastMail.FM should go here: suggestions, comments, requests for help, complaints, technical issues etc. |
|
Thread Tools |
14 Jan 2019, 05:40 AM | #1 |
Essential Contributor
Join Date: May 2018
Posts: 478
|
Testing for russian spam
I got a Russian spam where the From name was in Cyrillic and it's slipping under my spam threshold. Since I'm a bit "obsessed" with sieve these days I thought I would write a test to check for Russian names. The following test condition works:
Code:
header :regex "From" "(^|,)[[:space:]]*\"?[^<]*[а-яА-ЯЁё]+[^\"<]*\"?[[:space:]]*<") Code:
header :regex "From" "(^|,)[[:space:]]*\"?[^<]*[аАбБвВгГдДеЕёЁжЖзЗиИйЙкКлЛмМнНоОпПрРсСтТуУфФхХцЦчЧшШщЩъЪыЫьЬэЭюЮяЯ]+[^\"<]*\"?[[:space:]]*<") The actual full Unicode range for Cyrillic is 0x400-0x4ff and I would much rather specify this hex range in the regex but I cannot figure how to do it. RFC5228, section 2.4.2.4 indicates I can write hex numbers like ${hex:400} or Unicode as ${unicode:400}. But if I attempt to write a regex range like [${unicode:400}-${unicode:4ff}] I will get a syntax error. So is there some sytax that allows a hex or Unicode number range in a sieve regex or am I "stuck" checking the explicit characters? |
14 Jan 2019, 06:22 AM | #2 |
Cornerstone of the Community
Join Date: Jun 2008
Location: Perth
Posts: 664
|
Looks interesting.
Can you explain that code that follows the "From" |
14 Jan 2019, 09:07 AM | #3 |
Essential Contributor
Join Date: May 2018
Posts: 478
|
If I was to use the UI to generate an organize rule of the form:
The senderʼs name matches glob pattern abc then the sieve code FM generates for the test is, Code:
header :regex "From" "(^|,)[[:space:]]*\"?abc\"?[[:space:]]*<" From: abc <foo@domain.tlc> The pattern match for a header sieve command starts after the colon. So starting after the From: the pattern ignores leading spaces or spaces after a comma followed by what to look for optionally enclosed in quotes (if name has spaces of its own), followed by any number of spaces before the email address which starts after the '<'. Apparently there must be cases where more than one name can be specified and are comma separated. It's the only reason I can think of for looking for commas. Using the UI organize rules avoids a lot of mistakes and saves time constructing these things. I always keep a disabled organize rule laying around just for this purpose. I took this pattern and replaced the abc portion with a pattern match for one or more (the + sign) Cyrillic characters. In other words [а-яА-ЯЁё]+ or all those characters enumerated. I would prefer to use the entire Unicode range of Unicode Cyrillic possibilities, i.e., 0x400 to 0x4ff, so that was the reason for my question. I don't know if this is even syntactically possible in sieve regex. Certainly what I tried so far isn't. I posted here hoping someone might know the magic syntax that works if any. Update: As I was writing that last paragraph I started thinking about whether there are actually Unicode characters across that entire range for Cyrillic. Looking at a Unicode table for Cyrillic (here) I discovered that there was. So [Ѐ-ӿ]+ should work. Not sure why the web pages I google searched didn't show that. Maybe because others showed the hex range instead. Still want to know if I can do that. Last edited by xyzzy : 14 Jan 2019 at 09:41 AM. |
14 Jan 2019, 11:02 AM | #4 |
Cornerstone of the Community
Join Date: Jun 2008
Location: Perth
Posts: 664
|
Thanks for sharing the methodology.
|
14 Jan 2019, 04:45 PM | #5 |
The "e" in e-mail
Join Date: May 2003
Location: mostly in Thailand
Posts: 3,095
|
Did you include
Code:
require "encoded-character"; |
14 Jan 2019, 05:39 PM | #6 |
Essential Contributor
Join Date: May 2018
Posts: 478
|
Thank you BritTim. Good catch! That was the missing piece of the puzzle. I didn't even notice that line in the example in the spec or the comment a little above about the require encoded-character. With it added this is the range format that appears to work (testing all this with Sieve Tester).
Code:
[${unicode:400}-${unicode:4ff}]+ Another reason why I assumed you couldn't add others was because Sieve Tester keeps erroring out the fcc extension since it is not implemented in Sieve Tester itself. I wish FM would fix that since I always need to delete that fcc when I copy/paste my script into there. Yes, I submitted a ticket on it some time ago. Sieve Tester is obviously not very high priority since they think not many users actually write Sieve stuff. They're probably right too. I think the reason Sieve Tester errors out the require fcc but not encoded-character is that encoded-character is part of the base Sieve standard (RFC5228) and I guess must be implemented where fcc is not in the base standard. Again thanks. |