On Mon, 2010-04-12 at 19:30 -0400, Jason Bertoch wrote: > On 4/12/2010 4:58 PM, Martin Gregorie wrote: > > I had quite a bit to do with phone numbers en mass a while back. My > > initial reaction is that its not easy: not only do phone numbers vary in > > length between locales, but even such things as the 'international > > dialing' and non-local-call prefix vary from country to country. > That is certainly true with all phone numbers, but I suspect it's not > for cell phone numbers using text-to-email. > Presumably t2e goes through some sort of server where the user has an account, in which case its number is likely to be a 'normal' mobile number starting with the national prefix of the telco that the t2e service subscribes to and, if the sender uses international roaming much, this will be in turn be prefixed by the international dial prefix plus the country code.
The international dialing prefix can be stored in a mobile as '+' but IIRC this is always translated into the international dialing prefix of the locale where each call originates. In the UK that would be '00'. I routinely store UK numbers in my mobile with a '+44' prefix so they'll work no matter where I may be. When I dial the prefix becomes 0044 which will be used if I'm out of the UK and ignored if I'm here. Back to the problem: recognising a valid UK mobile number is a matter of: - if the prefix is 0044 this is an international dial code for the UK. Remove it. - if the prefix is a single zero, remove it - if the next four digits are the dial code for a UK mobile telco and are followed by exactly six digits then its a valid UK mobile number. Thats what I meant by a country-specific pattern to validate the number. To be safe you'd need equivalent rules for every country. All-number user IDs that match any of these rules can be given a small negative score. IMO you can't do more than that with an all-numeric user ID since we now know that other all-numeric user IDs may be valid. Martin > I don't have any non-US > examples to verify against, but it really wouldn't make sense for > providers to use international dialing codes in this case...at least not > a huge variety at any rate. I'm hoping that those in the non-US > community can contribute opinions. Maybe this problem isn't as complex > as it initially sounds. > > On 4/12/2010 5:57 PM, Ted Mittelstaedt wrote: > > The fundamental flaw > > here is in the assumption that an all-number mailbox user ID is > > virtually certain to be spam. It is not. Clearly, the default score > > assignment to that rule is too high. > > That could certainly be true and it may prove that doing the proposed > tests just aren't worth the CPU cycles. Only a test against the corpus > will say with any degree of certainty. Sadly, I don't have the perl > skills to make that judgment, hence my appeal to the community for > ideas, opinions, and possible code to test the theory. > > /Jason