Or consider a different approach to the problem... figure out which regex patterns fit the data.

# test series ... I think your ANAAAN was supposed to be ANANAN
zipcode <- c("22942-0173", "32601", "N9Y2E6", "S7V 1J9", "0022942-0173", "32-601", "NN9Y2E6", "S7V 1J9")
# test series in data frame
zipdf <- data.frame( Zip=zipcode )
# default condition for category
zipdf$Category <- "Unknown"
# recognize US patterns ... test for "Unknown" is only there for consistency in this first search zipdf[ with( zipdf, "Unknown"==Category & grepl( "^[[:digit:]]{5}(-[[:digit:]]{4})?$", Zip ) ), "Category" ] <- "US"
# recognize Canada patterns
zipdf[ with( zipdf, "Unknown"==Category & grepl( "^[[:alpha:]][[:digit:]][[:alpha:]] ?[[:digit:]][[:alpha:]][[:digit:]]$", Zip ) ), "Category" ] <- "CA"
# summarize categories
table(zipdf$Category)
# review un-recognized zips
zipdf[ "Unknown"==zipdf$Category, ]

Note that regular expressions have a wide variety of sources of documentation... there are whole books on them. The above patterns have some pattern flexibility... it can be easier to setup multiple simpler regex patterns that all map to the same category while you learn what patterns are in the data, though making multiple passes is slower which may be an issue for large amounts of data.

As an example, the US test above could be written as

zipdf[ with( zipdf, "Unknown"==Category & grepl( "^[[:digit:]]{5}$", Zip ) ), "Category" ] <- "US" zipdf[ with( zipdf, "Unknown"==Category & grepl( "^[[:digit:]]{5}-[[:digit:]]{4}$", Zip ) ), "Category" ] <- "US"

and get the same answer as the single test above but using twice the processing time.

On Wed, 8 Jan 2014, Frede Aakmann T?gersen wrote:

Hi

Something like this.

## 4 valid zips + 4 invalid zips
zipcode <- c("22942-0173", "32601", "N9YZE6", "S7V 1J9", "0022942-0173", "32-601", 
"NN9YZE6", "S7V  1J9")

tmp <- gsub("[[:space:]]", "_", zipcode)
tmp <- gsub("[[:alpha:]]", "A", tmp)
tmp <- gsub("[[:digit:]]", "N", tmp)

tmp
## [1] "NNNNN-NNNN"   "NNNNN"        "ANAAAN"       "ANA_NAN"      
"NNNNNNN-NNNN"
## [6] "NN-NNN"       "AANAAAN"      "ANA__NAN"

patterns <- c("NNNNN-NNNN", "NNNNN", "ANAAAN", "ANA_NAN")

zipcode[tmp %in% patterns]
## [1] "22942-0173" "32601"      "N9YZE6"     "S7V 1J9"
zipcode[!tmp %in% patterns]
## [1] "0022942-0173" "32-601"       "NN9YZE6"      "S7V  1J9"


Yours sincerely / Med venlig hilsen


Frede Aakmann T?gersen
Specialist, M.Sc., Ph.D.
Plant Performance & Modeling

Technology & Service Solutions
T +45 9730 5135
M +45 2547 6050
fr...@vestas.com
http://www.vestas.com

Company reg. name: Vestas Wind Systems A/S
This e-mail is subject to our e-mail disclaimer statement.
Please refer to www.vestas.com/legal/notice
If you have received this e-mail in error please contact the sender.


-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On Behalf Of Jeff Johnson
Sent: 8. januar 2014 00:11
To: r-help@r-project.org
Subject: [R] Patterns on postal codes

Hi all,

I'm pretty new to R and have a question. I have a postal_code field which
can have a variety of values such as:
For US postal codes: 22942-0173 or 32601
For Canada postal codes: N9YZE6 or S7V 1J9

What I want to do is represent these as patterns, such as:
US: NNNNN-NNNN or NNNNN
Canada: ANAAAN or ANA NAN
where N = any number and A = any alpha character, space = space, etc (other
characters such as ' should be represented as '.

Ultimately I want to count these to see how many have a pattern of
NNNNN-NNNN, ANA NAN, etc so that I can visualize the outliers.

Does anyone know if there is a built-in function in R to do this?
Currently, the str() function on the postal_code field shows a factor with
90,993 levels which isn't particularly helpful.

Thanks in advance!

--
Jeff

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to