Re: FROM_STARTS_WITH_NUMS matches on text-to-email

Martin Gregorie Tue, 13 Apr 2010 02:22:21 -0700

On Mon, 2010-04-12 at 19:30 -0400, Jason Bertoch wrote:
> On 4/12/2010 4:58 PM, Martin Gregorie wrote:
> > I had quite a bit to do with phone numbers en mass a while back. My
> > initial reaction is that its not easy: not only do phone numbers vary in
> > length between locales, but even such things as the 'international
> > dialing' and non-local-call prefix vary from country to country.
> That is certainly true with all phone numbers, but I suspect it's not 
> for cell phone numbers using text-to-email.
>
Presumably t2e goes through some sort of server where the user has an
account, in which case its number is likely to be a 'normal' mobile
number starting with the national prefix of the telco that the t2e
service subscribes to and, if the sender uses international roaming
much, this will be in turn be prefixed by the international dial prefix
plus the country code.

The international dialing prefix can be stored in a mobile as '+' but
IIRC this is always translated into the international dialing prefix of
the locale where each call originates. In the UK that would be '00'. I
routinely store UK numbers in my mobile with a '+44' prefix so they'll
work no matter where I may be. When I dial the prefix becomes 0044 which
will be used if I'm out of the UK and ignored if I'm here.

Back to the problem: recognising a valid UK mobile number is a matter
of:
- if the prefix is 0044 this is an international dial code for the UK.
  Remove it.
- if the prefix is a single zero, remove it 
- if the next four digits are the dial code for a UK mobile telco and
  are followed by exactly six digits then its a valid UK mobile number.

Thats what I meant by a country-specific pattern to validate the number.
To be safe you'd need equivalent rules for every country. All-number
user IDs that match any of these rules can be given a small negative
score. 

IMO you can't do more than that with an all-numeric user ID since we now
know that other all-numeric user IDs may be valid. 

Martin

>   I don't have any non-US 
> examples to verify against, but it really wouldn't make sense for 
> providers to use international dialing codes in this case...at least not 
> a huge variety at any rate.  I'm hoping that those in the non-US 
> community can contribute opinions.  Maybe this problem isn't as complex 
> as it initially sounds.
> 
> On 4/12/2010 5:57 PM, Ted Mittelstaedt wrote:
> > The fundamental flaw
> > here is in the assumption that an all-number mailbox user ID is 
> > virtually certain to be spam.  It is not.  Clearly, the default score 
> > assignment to that rule is too high. 
> 
> That could certainly be true and it may prove that doing the proposed 
> tests just aren't worth the CPU cycles.  Only a test against the corpus 
> will say with any degree of certainty.  Sadly, I don't have the perl 
> skills to make that judgment, hence my appeal to the community for 
> ideas, opinions, and possible code to test the theory.
> 
> /Jason

Re: FROM_STARTS_WITH_NUMS matches on text-to-email

Reply via email to