[R] Reading data file with both fixed and tab-delimited fields

2010-03-02 Thread Marshall Feldman
Hello R wizards,

What is the best way to read a data file containing both fixed-width and 
tab-delimited files? (More detail follows.)

_*Details:*_
The U.S. Bureau of Labor Statistics provides local area unemployment 
statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are 
documented in the file la.txt 
. Each data file has five 
tab-delimited fields:

* series_id
* year
* period (codes for things like quarter or month of year)
* value
* footnote_codes

The series_id consists of five fixed-width subfields (length in 
parentheses):

* survey abbreviation (2)
* seasonal code (1)
* area type code (2)
* area code (6)
* measure code (2)

So an example record might be:

LASPS36040003   1990M01 8.8 L

I want to read in the data in one pass and convert them to a data frame with 
the following columns (actual name, class in parentheses):

Survey abbreviation (survey, character)
Seasonal (seasonal, logical seasonal=T)
Area type (area_type_code, factor)
Area (area_code, factor)
Measure (measure_code, factor)
Year (year, Date)
Period (period, factor)
Value (value, numeric)
Footnote (footnote_codes, character but see note)

(Regarding the Footnote, I have to look at the data more. If there's 
just one code per record, this will be a factor; if there are multiple, 
it will either be character or a list. For not I'm making it only 
character.)

Currently I can read the data just fine using read.table, but this makes 
series_id the first variable. I want to break out the subfields as 
separate columns.

Any suggestions?

Thanks.
 Marsh Feldman




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading data file with both fixed and tab-delimited fields

2010-03-02 Thread Marshall Feldman
Ah, I should have mentioned this. Personally I work on Macs (Leopard) 
and PC's (XP Pro and XP Pro x64). Even though the PC's do have Cygwin, 
I'm trying to make this code portable. So I want to avoid such things as 
sed, perl, etc.

I want to do this in R, even if processing is a bit slower. Eventually, 
I'll hide the code in a class, so the code can be a bit complex.

 Marsh Feldman

On 3/2/2010 12:29 PM, Chidambaram Annamalai wrote:
> I tried to shoehorn the read.* functions and match both the fixed 
> width and the variable width fields
> in the data but it doesn't seem evident to me. (read.fwf reads fixed 
> width data properly but the rest
> of the fields must be processed separately -- maybe insert NULL stubs 
> in the remaining fields and
> fill them in later?)
>
> One way is to sidestep the entire issue and convert the structured 
> data you have into a csv
> file using sed (usually available on  most *nix systems) with 
> something like so:
>
> cat data | sed -r 's/^(..)(.)(..)(.{6})(..)[ \t]*([^ \t]*)[ \t]*([^ 
> \t]*)[ \t]*([^ \t]*)[ \t]*([^ \t]*)[ \t]*([^ 
> \t]*)/\1,\2,\3,\4,\5,\6,\7,\8,\9/' | less
>
> and see if the output is alright and use the resulting .csv file 
> directly in R using read.csv
>
> If that does not satisfy you maybe the R Wizards on the list might be 
> able to point you to a
> native R way of doing this possibly using scan? I'm not sure though.
>
> Hope this helps,
> Chillu
>
> On Tue, Mar 2, 2010 at 9:42 PM, Marshall Feldman  <mailto:ma...@uri.edu>> wrote:
>
> Hello R wizards,
>
> What is the best way to read a data file containing both
> fixed-width and
> tab-delimited files? (More detail follows.)
>
> _*Details:*_
> The U.S. Bureau of Labor Statistics provides local area unemployment
> statistics at ftp://ftp.bls.gov/pub/time.series/la/, and the data are
> documented in the file la.txt
> <ftp://ftp.bls.gov/pub/time.series/la/la.txt>. Each data file has five
> tab-delimited fields:
>
>* series_id
>* year
>* period (codes for things like quarter or month of year)
>* value
>* footnote_codes
>
> The series_id consists of five fixed-width subfields (length in
> parentheses):
>
>* survey abbreviation (2)
>* seasonal code (1)
>* area type code (2)
>* area code (6)
>* measure code (2)
>
> So an example record might be:
>
> LASPS36040003   1990M01 8.8 L
>
> I want to read in the data in one pass and convert them to a data
> frame with the following columns (actual name, class in parentheses):
>
>Survey abbreviation (survey, character)
>Seasonal (seasonal, logical seasonal=T)
>Area type (area_type_code, factor)
>Area (area_code, factor)
>Measure (measure_code, factor)
>Year (year, Date)
>Period (period, factor)
>Value (value, numeric)
>Footnote (footnote_codes, character but see note)
>
> (Regarding the Footnote, I have to look at the data more. If there's
> just one code per record, this will be a factor; if there are
> multiple,
> it will either be character or a list. For not I'm making it only
> character.)
>
> Currently I can read the data just fine using read.table, but this
> makes
> series_id the first variable. I want to break out the subfields as
> separate columns.
>
> Any suggestions?
>
> Thanks.
> Marsh Feldman
>
>
>
>
>[[alternative HTML version deleted]]
>
>     __
> R-help@r-project.org <mailto:R-help@r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

-- 
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
CUSR Logo
Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)


  Contact Information:


Kingston:

202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511


Providence:

206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Do colClasses in readHTMLTable (XML Package) work?

2010-03-17 Thread Marshall Feldman
Hi,

I can't get the colClasses option to work in the readHTMLTable function 
of the XML package. Here's a code fragment:

require("XML")
doc <- "http://www.nber.org/cycles/cyclesmain.html";
table <- getNodeSet(htmlParse(doc),"//table") [[2]]# The
main table is the second one because it's embedded in the page table.
xt <- readHTMLTable(
 table,
 header =
c("peak","trough","contraction","expansion","trough2trough","peak2peak"),
 colClasses =
c("character","character","character","character","character","character"),
 trim = TRUE
 )

Does anyone know what's wrong?

 Marsh Feldman

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Substitute NAs in a data frame

2010-03-18 Thread Marshall Feldman
Excuse me for what I'm sure is a stupid beginner's question, but I've 
given up trying to find the answer to this question from the help, 
RSiteSearch, or any of the usual places.


I have a list that looks like this:
>myList
$first
[1] "--" "18" "8" "32"

$second
[1] "--" "--" "40" "54"

I want a straightforward way to replace "--" with NA so that the list 
looks like:


>myList
$first
[1] NA "18" "8" "32"

$second
[1] NA NA "40" "54"

Now I know I can do something like:

myList$first <- sub("--",NA,myList$first)

but the real list has lots of components. So is there some easy way to 
do something like:


myList <- applier(myList,sub,"--",NA)

where "applier" is a function that will do what I want? I tried using 
lapply, sapply, etc. without luck.


Thank,
Marsh

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to handle missing value as first item in yearmon (zoo package)

2010-03-19 Thread Marshall Feldman

Hi,

Some time series have missing values in the time index. For example 
historical data on business cycles will typically date them from peak to 
trough, but some information may be missing. In most cases, this does 
not cause trouble,
but if the first date is missing (e.g., we know the date of the first 
trough but not the earlier peak), we want the first element in a list of 
dates to be NA. Using yearmon, this causes trouble. Consider this:


> x <- as.yearmon("March 2010","%B %Y")
> y <- c(x,NA)
> z <- c(NA,x)
> y <- yearmon(y)
> z <- yearmon(z)
> y
[1] "Mar 2010" NA
> z
Error in charToDate(x) :
  character string is not in a standard unambiguous format

Can someone explain how to make an object of type yearmon with a missing 
value in its first element?


Thanks.

Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Preserving both yearmon and numeric data in an xls object

2010-03-28 Thread Marshall Feldman
Hi R gourmets,

I am trying to convert an HTML table into an xts object. The table has 
six columns, with the data of interest in a single row with each cell 
containing a long, \n-delimited character string. Initially, I work with 
these strings as elements in a list. This is necessary because the 
strings in each cell do not translate into a regular matrix with 
equal-length columns. Once I fix the entries, I'm ready to convert them 
into an xts object. The first two columns of the original table contain 
dates, and I want them to be of type "yearmon" in the xts object. The 
other four columns have numeric data.

Here's my problem. If I convert the list into a data frame on the way to 
making it an xlt, the first two columns correctly keep the data as the 
yearmon class, but the remaining columns are converted to character 
class. Alternatively, if I convert the list into a matrix on the way to 
making the xlt, all six columns become numeric; the first two columns 
lose their yearmon class.

How can I make an xlt from a list, such that the first two columns are 
yearmon and the last four are numeric?

Thanks.

 Marsh Feldman

P.S. Here is some sample code:

 > #... input has been converted to a list of length 6; each element
is a vector, but they themselves are of different length
 > #... fix the 6 vectors to have equal length
 > mylist[1:2] <- lapply(mylist[1:2], as.yearmon)# Convert
the first two elements into yearmon class
 > #... Now try one of the following; tindex is the time index in
yearmon format
 > myxts <- as.xts(as.data.frame(mylist), order.by=tindex)# this
makes columns 3-6 of myxts all have class = character
 > myxts <- as.xts(matrix(unlist(mylist), ncol=6), order.by=tindex)
# this makes myxts entirely numeric
 I even tried using the following statement afterwards but
had no luck
 > myxts[,1] <- as.yearmon(myxts[,1])



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Plots don't update with xlab, etc. What am I doing wrong.

2010-04-02 Thread Marshall Feldman
Hi,

I've been struggling with this problem the last few days and finally 
discovered it's happening at a very fundamental level. Going through 
Stephen Turner's tutorial on ggplot2, I entered these base graphics 
commands:

>  with(diamonds, plot(carat,price))
>  with(diamonds, plot(carat,price), xlab="Weight in Carats",
ylab="Price in USD", main="Diamonds are expensive!")

The first command works as expected and draws the plot with labels 
"carat" and "price" and no title. The second command makes R redraw the 
plot (I can see it clear and redraw), but it's identical to the first! 
What am I doing wrong?

 Marsh Feldman



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ggplot2 geom_rect(): What am I missing here

2010-04-04 Thread Marshall Feldman
Hi R fans,

As a newbie following the five-hour rule (after hitting my head against 
the wall for five hours, post to this list), I am appealing for some 
help understanding geom_rect() in ggplot2.

What I want to do is very simple. I want to generate a plot of 
rectangles. Each one represents a business cycle. The x-values will be 
pairs representing the start and end of each cycle. The y-values 
represent the duration of the cycle (in months). In other words, all 
rectangles have coordinates (start, duration) and (end, duration).
rr
I've spent hours trying to figure out the documentation and pouring over 
Google and RSeek searches and am at an impasse. The documentation refers 
to xmin, xmax, ymin, and ymax but doesn't say anything about them. The 
only example gives them both as vectors, so I assume they refer to a 
sequence of coordinates in which each rectangle's vertices is given by 
(xmin[i],ymin[i]), (xmin[i],ymax[i]), (xmax[i],ymax[i]), and 
(xmax[i],ymin[i]). But when I try to plot something simple using this 
understanding,  I get a blank plot.

Here's my code:

df <- data.frame(
 xmin = c(1,5),
 xmax = c(2,7),
 ymin = c(0,3),
 ymax = c(2,5)
 )
ggplot(df, aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymin)) +
 geom_rect(fill="grey80")

Please help me before I Google again! :-)

Thanks.

Marsh Feldman


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ggplot2 geom_rect(): What am I missing here

2010-04-05 Thread Marshall Feldman
Thanks to David Winsemius, Peter Ehlers, and Paul Murrell who pointed 
out my careless error working with ggplot2's geom_rect(). Not to make 
excuses, but when you've done something successfully dozens of times and 
suddenly it doesn't work, you're more likely to look for careless errors 
on your part. When you've never done something before and unsure that 
you understand the proper use of the tool, you're more likely to think 
you're missing something about the tool's proper use and to overlook 
your own careless errors.


This list is great! I posted my question, went off to do something else, 
and within a few hours had the answer to my problem.


Thanks again

Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Combining ggplot2 objects and/or extracting layers

2010-04-09 Thread Marshall Feldman
Hi,

Other then rebuilding the plots, is there any way either (1) to combine 
existing ggplot2 plots or (2) to extract a layer from an existing plot 
so that it can be added to another?

 Thanks.
-- 
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs

Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)


  Contact Information:


Kingston:

202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511


Providence:

206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Beyond reshape: automatically streamlining data

2010-04-09 Thread Marshall Feldman

Hello,

I've been very impressed by the reshape package and how easy it makes 
reorganizing statistical data structures. This makes me wonder if 
there's another package out there that addresses another set of tasks 
that one often does when preparing data for analysis.


For any particular set of analyses, one typically recodes variables and 
deletes cases and variables. It would be really nice to have a package 
that, for example, if one selected cases from a larger data set based on 
the values of certain variables would inspect the resulting data and 
drop any variables that have the same value for all cases. Similarly, if 
any cases are entirely zero or NA, the package could (under user 
control) drop these cases. Finally, it could take a set of data 
transformations and keep them as an object, so that the same 
selection/reshape/streamlining can easily be applied to similar data sets.


My motivation for this came from working with employment data this 
morning. I started out with 11 variables and 35569 cases for Rhode 
Island, a few selections later I had only 420 cases and 3 variables. It 
struck me that the process I went through, which included not only 
making selections but also inspecting the results and deleting 
unnecessary cases/variables, could be automated at least to eliminate 
the inspection step. Also, since I want to do the same thing with data 
for other states, automation would be very nice indeed.


I realize that programming this kind of stuff in R is relatively easy, 
but the reshape package makes me wonder if someone has already done it.


Thanks
Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Combining ggplot2 objects and/or extracting layers

2010-04-09 Thread Marshall Feldman
Hi Hadley,

Thanks for the terrific package!

If you'd like I could give you my code, but conceptually what I'm trying 
to do is pretty simple.

The chart on this page 

 
(http://www.businessinsider.com/20-reasons-why-the-us-economy-is-dying-and-is-simply-not-going-to-recover-2010-2#hard-to-find-jobs-3)
 
is pretty typical. It shows a line chart of time series data against a 
backdrop of shaded bars that indicate periods of recession. This is what 
I'm doing.

The tis package can do this and has a function that works with ggplot2. 
However, I see three problems with the approach in tis. (1) It only adds 
the bars to an existing plot being displayed. I would like to have it as 
a separate object that can be constructed once and added to any number 
of plots whether they are displayed or not.  (2) I'd like to see the 
bars by themselves on a plot. For consistency's sake, once I do this and 
am satisfied with the display, I don't want to have to and do a separate 
reconstruction. Instead, I want to take the bars from the satisfactory 
display. This way there's less room for accidentally breaking the 
consistency of the plots. (3) The tis plots are fixed in their format. 
They span the y dimension and have widths equal to the durations of the 
recessions. There are instances when one might like something different, 
such as stacked bars or multiple bars of varying heights (patterns, 
etc.) side-by-side that together have a width equal to the recession's 
duration.

Obviously what I'm trying to do can be done with more work, but I'm 
trying to minimize unnecessary repetitions. I already coded a function 
that draws not only the recession bars but also that can draw bars whose 
height represents the value of some variable but with widths equal to 
the durations of the recessions. Once I create a free-standing plot, I'd 
like to be able to use it in various other contexts, including adding it 
to other existing plots. The alternative is to reconstruct the plot as a 
layer and add it to the other plots, but this is time-consuming and 
introduces more room for programming error.

Thanks for your help.

 Marsh

On 4/9/2010 8:48 AM, hadley wickham wrote:
>> Other then rebuilding the plots, is there any way either (1) to combine
>> existing ggplot2 plots or (2) to extract a layer from an existing plot
>> so that it can be added to another?
>>  
> Not really, although you can always pull apart the plot components.
> Can you give an example of what you are trying to achieve?
>
> Hadley
>
>



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Identifying names of matrix columns shared by many matrices

2010-04-19 Thread Marshall Feldman
Greetings R-Geniuses,

What is the most efficient way to handle the problem described below?

Thanks
 Marsh Feldman


Problem description:

Each U.S. state has its own matrix. The rows are dates, the columns are 
industries, and each cell contains total statewide employment at the 
given time and industry. There is a similar matrix for the U.S. as a 
whole. Due to disclosure rules and other limitations, one or more 
industries may be missing from any given matrix (including the national 
one), but industries missing from one matrix are sometimes not missing 
from others. Industry numbers are treated as factors commonly used as 
column names.

I want to do two things:

   1. For any given set of states, find the set of industries present in
  all of them and use this to select this subset of industries from
  each state's matrix.
   2. For any given set of states, find the set of industries present in
  any of the states.
   3. Given that one or more cells in the table may be NA, identify
  those industries present in all states and have no values equal to NA.

I can do this using for() statements and %in%, but is there is a more 
efficient way? Your thoughts?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Upgrading R using the "global library folder" strategy -, what do you think about it?

2010-04-26 Thread Marshall Feldman

On 4/25/2010 19:39:52, Tal Galili wrote:

*c) R core implementation ?!*
I hope I am not being rude (or jumping into any open doors) in asking this
but...
What do you think about implementing this strategy into the R basic
installation?


   


Tal,

As a general rule, I think R should make upgrading as easy and seamless 
as possible. Upgrading strategies seem system-dependent, but we already 
have to download system-specific versions for Windows, OS X, and Linux. 
To me it appears that doing what you're doing on Windows could easily be 
implemented on *nix based systems with shell scripts. So why not have 
the appropriate scripts ask a few questions upon the first installation 
of R (e.g., "Do you want to configure R with a "global" library for 
packages to make future upgrading easier?") and at upgrade time ("Your 
previous version of R has a "global" library; do you want the new 
version to use it?). I'd even go so far as to have the shell script 
automatically call an R script to run update.packages().


The point is that most users just want to upgrade, and the upgrade 
procedure can and should (a) make this as seamless as possible and (b) 
allow those who may want to run specialized versions of R opt out of the 
automatic procedure.


Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R-help Digest, Vol 86, Issue 28

2010-04-27 Thread Marshall Feldman

On 4/26/10 21:45:55 R P Herrold wrote:

Date: Mon, 26 Apr 2010 21:45:55 -0400 (EDT)
From: R P Herrold
To: Marshall Feldman
Cc:r-help@r-project.org
Subject: [R] Upgrading R using the "global library folder" strategy -,
what do you think about it?
Message-ID:
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

On Mon, 26 Apr 2010, Marshall Feldman wrote:

   

>  So why not have the appropriate
>  scripts ask a few questions upon the first installation of R (e.g., "Do you
>  want to configure R with a "global" library for packages to make future
>  upgrading easier?") and at upgrade time ("Your previous version of R has a
>  "global" library; do you want the new version to use it?). I'd even go so far
>  as to have the shell script automatically call an R script to run
>  update.packages().
 

There is a large body of literature on this -- interactive
questions of non-root users are useless; root user actiuons
need to be scripted into the package management system
acessible to automation to be scaleable, and to attain the
needed administrator level permissions to make changed

   

>  The point is that most users just want to upgrade, and the upgrade procedure
>  can and should (a) make this as seamless as possible and (b) allow those who
>  may want to run specialized versions of R opt out of the automatic procedure.
 

and computers in a environment that has to conform to a
hard specification (think: pharma research for FDA report
preparation; financial service firms) that the IT department
manages, cannot tolerate such diversity

There is no easy answer here, as 'one size cannot fit all'

-- Russ herrold
   


Y'know, I hadn't thought of multi-user machines, network installs, and 
all that. It's been so long since I've worked on a system like that.


Still, this raises some questions. On multi-user installations is there 
a single library shared by all users or does each user have his/her own 
library? If the latter, then the question is moot because installation 
and library configuration are separate things. If the former, then would 
having a standard arrangement that a root user could modify/override work?


Also, assuming a large number of R installations are on machines used by 
a single user (and perhaps others in a relatively unsophisticated 
arrangement, such as a home computer shared by family members), do you 
think having scripts along the lines I suggested would work for a large 
portion of such users?


One could also have switches on an installation command line. I'm not 
trying to impose a one-size-fits-all model, but sometimes 
standardization is good.


The point is that no matter what the nature of the system, suggestions 
as to best practices and automations to accomplish them should be 
present if at all possible. One can always deviate, but it's good to do 
so consciously.


-- Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Multiple methods in models

2010-05-03 Thread Marshall Feldman

Hi,

The model specification formula language introduced in Chambers and 
Hastie potentially handles rather complex models. Typically the user 
specifies the model and in a separate argument specifies the method. For 
example, one specifies a general linear model with glm(formula,family). 
But with a complex model, one may want to use different methods to 
compute different parts of the model. This seems to imply either 
extending the formula language to integrate method specification as part 
of the model specification or extending the separate method argument 
(family in the case of glm) to include multiple methods along with a way 
to relate them to different parts of the model.


I've looked to see if any packages do such a thing and have found none. 
I've also looked through several documents on R without success.


Does anyone know of a package, document, or other thing in R-land that 
does something like this?


Thanks.
Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Hierarchical factors

2010-05-03 Thread Marshall Feldman
Hello,

Hierarchical factors are a very common data structure. For instance, one 
might have municipalities within states within countries within 
continents. Other examples include occupational codes, biological 
species, software types (R within statistical software within analytical 
software), etc.

Such data structures commonly use hierarchical coding systems. For 
example, the 2007 North American Industry Classification System (NAICS) 
has twenty 
two-digit codes (e.g., 42 = Wholesale trade), within each of these 
varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers, 
durable goods), then varying numbers of 4-digit codes (4231 = Motor 
Vehicle and Motor Vehicle Parts and Supplies Merchant Wholesalers), then 
varying numbers of five-digit codes, varying numbers of six-digit codes, 
etc. At the lowest level (longest code) one can readily tell all the 
higher levels. For example, 441222 is "Boat Dealers" who are part of 
44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which is 
part of 4412 (Other Motor Vehicle Dealers), which is part of 441 (Motor 
Vehicle and Parts Dealers), which is part of 44 (Retail Trade). (The US 
Census Bureau has extended the 6-digit NAICS to an even more 
fine-grained 10-digit system.)

I haven't seen any R packages or sample code that handles this kind of 
data, but I don't want to reinvent the wheel and would rather stand on 
the shoulders of you giants. Is there any package or other R-based 
software out there that handles this kind of data structure?

 Thanks,
 Marsh Feldman






[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Hierarchical factors

2010-05-03 Thread Marshall Feldman
Thanks for getting back so quickly Ista,

I was actually casting about for any examples of R software that deals 
with this kind of structure. But your question is a good one. Here are a 
few things I'd like to be able to do:

* Store data in R at the finest level of detail but easily refer to
  higher levels of aggregation. If the data include such higher
  levels, this is trivial, but otherwise I'd like to aggregate
  fairly easily. The following is not functioning code, but it
  should give you the idea:

start with a data frame (call it d) having row.names = to the 6
digit NAICS code and columns w/ various variables, assume one is
named employment.
d[,"employment"]   # Would print all
employment data
d["441222","employment"]# Would print only Boat Dealer
employment
d["44","employment] # Would print total
employment for Retail Trade

* Recursive nesting. I'm not sure how to convey this except with
  examples. Suppose the data frame also has a "wages" column with
  average weekly wages in the industry, and the industry code is
  also a factor variable (industry). So a simple analysis of
  variance might look like:

 w <- aov(wages ~ industry, d)

 But now what I'd like to do is to break this down within 
2-digit sectors. Assuming the data frame has another variable, industry 
2, this would look like:

 w <- aov(wages ~ industry2/industry)

  But what if we either (a) don't want to bother creating 
separate variables for each level of aggregation in industry or (b) want 
to extended the model formula language to include various nesting 
strategies. This might look like:

 w <- aov(wages ~ industry//*)# 
Nest all meaningful levels 
industry/industry2/industry3/industry4/industry5/industry6. If the 
coding system skips some levels, R is smart enough to omit the skipped 
levels.
 w <- aov(wages ~ industry//levels 2,4,6) # I'm 
using "//" as a hypothetical extension to the model language that is 
followed by a "levels" keyword and then a list of levels within the 
hierarchy. This example would expand
 
# to aov(wages ~ industry2/industry4/industry6)

 One could extend this last example to include a notation 
allowing the analysis to be repeated at varying levels of depth (e.g., 
industry||2,6) would repeat the ANOVA for industry2 and industry6)

* Since the factor hierarchy is completely nested (i.e., every
  6-digit industry is below a 5 digit industry), a single function
  can operate on the codes recursively. Three variants come to mind.
  In the first, we'd use some kind of apply function to drill down
  to a certain level and return a list of results, one for each level:

   means <- drill(wages,industry,mean)
 # Would return a list. The first component would a vector of 
mean wages for industries at the 2-digit level, the second, a vector for 
the 3-digit level, etc.
   means <- drill(wages,industry,mean,maxlvl=3) 
# Would stop at the 3rd level of the hierarchy (4-digit code). One could 
also imagine a maxdigits optionas an alternative (maxdigits = y means 
stop at the y-digit level)

Second, suppose we have a data frame like d, only this time it's a
time series (each row is a different date). Now we might want to
generate vectors of the rate of change in employment at each
industry level. It might look like:

 rate <- function(x) { (x - lag(x))/lag(x)) }
 rates <- as.list()
 i <- 1
 rates <- for j %in% levels(industry)  {  
  # The levels function parses the
hierarchical factor into the various levels of its coding system
 rates[[i]] <- rate(emplyment[,level(industry)
== j]) # The level function sets a particular one of
these levels
 i <- i + 1
 }

A third variant would be a genuinely recursive function that keeps
on calling itself at each level of the factor until it has either
reached a pre-specified depth or exhausted all levels of the factor.

I hope this gives you a good idea of the sorts of things one might do 
with hierarchical factors.

 Marsh Feldman



On 5/3/2010 9:57 AM, Ista Zahn wrote:
> Hi Marshell,
> What exactly do you mean by "handles this kind of data structure"?
> What do you want R to do?
>
> Best,
> Ista
>
> On Mon, May 3, 2010 at 9:44 AM, M

[R] Accessing remote data (ftp) over the net

2009-11-24 Thread Marshall Feldman

Hi,

Is there any way to access data remotely over the Internet? In 
particular, I'm starting a project that will use data from the U.S. 
Bureau of Labor Statistics. The Bureau regularly updates various data 
series and publishes them as a series of flat files that can be 
downloaded via ftp (e.g., cf.  ftp://ftp.bls.gov/pub/time.series/sm/). 
Since some of these files are rather large, I'd like to retrieve 
selected elements by using sql queries. Since the files are often large 
and updated frequently, I'd like to leave them on the server.


I'm thinking of trying something like: R -> RODBC -> MySQL -> query of 
remote flat-file database and keeping the query as a view.


Has anyone been successful doing something like this? Is there a better 
approach?


Thanks.

   Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SAS "datalines" or "cards" statement equivalent in R?

2009-12-07 Thread Marshall Feldman
Regarding the various methods people have suggested, what if a typical 
tab-delimited data line looks like:


 SMS11001 1990 M01 688.0

and the SAS INPUT statement is

   INPUT survey $ 1-2 seasonal $ 3 state $ 4-5 area $ 6-10 supersector 
$ 11-12 @13 industry $8. datatype $ 21-22  year period $ value footnote $ ;


Note that most data lines have no footnote item, as in the sample.

Here (I think) we'd want all the character variables to be read as 
factors, possibly "year" as a date, and "value" as numeric.


   Marsh

On Sat, Dec 5, 2009 at 8:11 PM, Gary Miller <> wrote:





>>  Hi R Users,
>>
>> Is there a equivalent command in R where I can read in raw data? For
>> example
>> I'm looking for equivalent R code for following SAS code:
>>
>> DATA survey;
>>   INPUT id sex $ age inc r1 r2 r3 ;
>>   DATALINES;
>>  1  F  35 17  7 2 2
>> 17  M  50 14  5 5 3
>> 33  F  45  6  7 2 7
>> 49  M  24 14  7 5 7
>> 65  F  52  9  4 7 7
>> 81  M  44 11  7 7 7
>> 2   F  34 17  6 5 3
>> 18  M  40 14  7 5 2
>> 34  F  47  6  6 5 6
>> 50  M  35 17  5 7 5
>> ;
>>
>> Any help would be highly appreciated,
>> Gary


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SAS "datalines" or "cards" statement equivalent in R?

2009-12-07 Thread Marshall Feldman
I totally agree with Barry, although it's sometimes convenient to 
include data with analysis code for debugging and/or documentation purposes.

However, the example actually applies equally to separate data files. In 
fact, the example is from the U.S. Bureau of Labor Statistics at 
ftp://ftp.bls.gov/pub/time.series/sm/, which contains nothing but data 
and documentation files. At issue is not where the data come from, but 
rather how to parse relatively complex data organized inconsistently. 
SAS has built-in the ability to parse five different organizations of 
data: list (delimited), modified list, column, formatted, and mixed (see 
http://www.masil.org/sas/input.html). It seems R can parse such data, 
but only with considerable work by the user. It would be great to have a 
function/package that implements something with as easy (hah!) and 
flexible as SAS.

Marsh

Barry Rowlingson wrote:
> On Mon, Dec 7, 2009 at 3:53 PM, Marshall Feldman  wrote:
>   
>> Regarding the various methods people have suggested, what if a typical
>> tab-delimited data line looks like:
>>
>> SMS11001 1990 M01 688.0
>>
>> and the SAS INPUT statement is
>>
>>   INPUT survey $ 1-2 seasonal $ 3 state $ 4-5 area $ 6-10 supersector $
>> 11-12 @13 industry $8. datatype $ 21-22  year period $ value footnote $ ;
>>
>> Note that most data lines have no footnote item, as in the sample.
>>
>> Here (I think) we'd want all the character variables to be read as factors,
>> possibly "year" as a date, and "value" as numeric.
>> 
>
>  Actually I'm surprised that nobody has yet said what a clearly
> bonkers thing it is to mix up your data and your analysis code in a
> single file. Now suppose you have another set of data you want to
> analyse with the same code? Are you going to create a new file and
> paste the new data in? You've now got two copies of your analysis code
> - good luck keeping corrections to that code synchronised.
>
>  This just seems like horrendously bad practice, which is one reason
> it's kludgy in R. If it was good practice, someone would surely have
> written a way to do it neatly.
>
>  Keep your data in data files, and your functions in .R function
> files. You'll thank me later.
>
> Barry
>   


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SAS "datalines" or "cards" statement equivalent in R?

2009-12-07 Thread Marshall Feldman



Barry Rowlingson wrote:

 I'd love to duplicate this functionality of SAS, however, I fear:

http://www.sas.com/news/preleases/SASsuit.html

  


Amazing, since input statements in SAS bear an uncanny resemblance to 
how PL/I handles input from text files.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Access web content from within R

2010-07-16 Thread Marshall Feldman

On 7/16/2010 6:00 AM, r-help-requ...@r-project.org wrote:

Message: 5
Date: Thu, 15 Jul 2010 03:36:21 -0700 (PDT)
From: Bart Joosen
To:r-help@r-project.org
Subject: [R] Access web content from within R
Message-ID:<1279190181074-2289953.p...@n4.nabble.com>
Content-Type: text/plain; charset=us-ascii


Hi,

I have to search in an online db for registered manufacturers of raw
materials.
Can I use R for the following:
I have a list with monograph numbers eg: l<- c(198, 731,355)

Now I want to make a dataframe, containing the monograph number and the
information listed under COS:
Certificate holder, certificate number, Status, Type

Is this possible with R?

kind regards


Bart

-- View this message in context: 
http://r.789695.n4.nabble.com/Access-web-content-from-within-R-tp2289953p2289953.html 
Sent from the R help mailing list archive at Nabble.com.


You don't describe the format of the database. If it's HTML or XML, the 
scrapeR package may do the trick.


Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Historical Libor Rates

2010-07-20 Thread Marshall Feldman
Hi AAditya,

There's a great tool for searching the web, called "Google." I used it 
to find the following web site when I entered "historical libor rates" 
for the search: 
http://www.wsjprimerate.us/libor/libor_rates_history.htm. The site came 
up as the first hit. I suggest you use the scrapeR package to read data 
from the site. Also, to learn more about the terrific Google search 
tool, look at http://www.google.com/.

Good luck.

 Marsh Feldman

On 7/20/2010 6:00 AM, r-help-requ...@r-project.org wrote:
> Date: Mon, 19 Jul 2010 15:21:01 -0400
> From: Aaditya Nanduri
> To:r-help@r-project.org
> Subject: [R] Historical Libor Rates
> Message-ID:
>   
> Content-Type: text/plain
>
> Hello All,
>
> Does anyone know how to download historical LIBOR rates of different
> currencies into R?
>
> Or if anyone knows of a website that holds all this data...I only need up to
> january of 2000.
>
> Also, how can we make the row names the index of a plot (the names of the x
> values)?
>
>   [[alternative HTML version deleted]]
>

-- 
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs

Center for Urban Studies and Research
The University of Rhode Island

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Historical Libor Rates

2010-07-20 Thread Marshall Feldman

Hi AAditya,

I really wasn't trying to be rude. Sarcastic, yes. Rude, no.

The list is frequented by people ranging from leading statistics 
professors to students looking for someone to do their homework for 
them. I only receive the digest form of the r-help list and saw that 
someone had already answered you by telling you to post your question on 
another list. I thought the LIBOR must be online and did a quick search 
for it, immediately finding the link I sent you. But I thought you might 
be looking for it in a format other than a web page, such as a cvs file 
for download. In other words, I actually thought you had either not 
searched for the data yourself or didn't know how to read a web page 
into R. Rather than let stand the suggestion you take your question to 
another list, I thought I could be more helpful by directing you to a 
data source and giving you the lead on how to read it into R. On the 
chance you had not searched for it yourself, I included the sarcastic humor.


Frankly, I didn't even look to see if the web page had the overnight 
LIBOR but probably should have, because I'm familiar with the LIBOR for 
my own work. ("Probably" because some purposes are better served by data 
covering longer intervals.)


As a general rule, besides being explicit, it's always a good idea to 
tell others on a help list what one has already tried, so they don't do 
unnecessary, duplicate work in their efforts to help. Had you said you'd 
tried searching with Google, I would not have had the opening for 
sarcasm (which I couldn't resist), and you probably would have realized 
you needed to mention the overnight rate in your post.


I appreciate you saying that you should have been more explicit, and I 
hope you'll accept my explanation and apology. I was genuinely trying to 
be helpful yet funny, believing you may not have done the search 
yourself. If you had done the search, then I thought you would either 
just blush and realize you should have been more explicit about a 
missing detail or appreciate that someone had told you how to read the 
web page data into R. I certainly did not intend to offend.


Hopefully this clears the air.

Best wishes,
Marsh Feldman

On 7/20/2010 10:16 AM, Aaditya Nanduri wrote:

Mr. Feldman,

I would love nothing more than to reply to your wonderful email with 
just as much sarcasm.


However, the fault lies with my question; I should have been more 
explicit.


It should have been phrased : Where can I find historical OVERNIGHT 
LIBOR rates?


And surprisingly, we both use the great tool, "Google". What a 
wonderful coincidence.

Via Google, I found this : http://www.econstats.com/r/rlib__d13.htm
However, this site has a lot of missing points and I was really hoping 
for a complete set of data.


But, in all honesty, try to be a little less rude next time.
I've been looking for a good source for a while now and the mailing 
lists are usually my last resort.



On Tue, Jul 20, 2010 at 6:58 AM, Marshall Feldman <mailto:ma...@uri.edu>> wrote:


Hi AAditya,

There's a great tool for searching the web, called "Google." I used it
to find the following web site when I entered "historical libor rates"
for the search:
http://www.wsjprimerate.us/libor/libor_rates_history.htm. The site
came
up as the first hit. I suggest you use the scrapeR package to read
data
from the site. Also, to learn more about the terrific Google search
tool, look at http://www.google.com/.

Good luck.

Marsh Feldman

On 7/20/2010 6:00 AM, r-help-requ...@r-project.org
<mailto:r-help-requ...@r-project.org> wrote:
> Date: Mon, 19 Jul 2010 15:21:01 -0400
> From: Aaditya Nandurimailto:aaditya.nand...@gmail.com>>
> To:r-help@r-project.org <mailto:to%3ar-h...@r-project.org>
> Subject: [R] Historical Libor Rates
> Message-ID:
> mailto:aanlktik-dl2kc7e7mkr4hzsxphxyn5mz0jb2esbgg...@mail.gmail.com>>
> Content-Type: text/plain
>
> Hello All,
>
> Does anyone know how to download historical LIBOR rates of different
> currencies into R?
>
> Or if anyone knows of a website that holds all this data...I
only need up to
> january of 2000.
>
    > Also, how can we make the row names the index of a plot (the
names of the x
> values)?
>
>   [[alternative HTML version deleted]]
>

--
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs

Center for Urban Studies and Research
The University of Rhode Island

   [[alternative HTML version deleted]]

__
R-help@r-project.org <mailto:R-help@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLE

Re: [R] how to generate a random data from a empirical, distribition

2010-07-27 Thread Marshall Feldman

On 7/27/2010 6:00 AM, r-help-requ...@r-project.org wrote:

Date: Mon, 26 Jul 2010 11:36:29 -0700 (PDT)
From: xin wei
To:r-help@r-project.org
Subject: [R] how to generate a random data from a empirical
distribition
Message-ID:<1280169389379-2302716.p...@n4.nabble.com>
Content-Type: text/plain; charset=us-ascii


hi, this is more a statistical question than a R question. but I do want to
know how to implement this in R.
I have 10,000 data points. Is there any way to generate a empirical
probablity distribution from it (the problem is that I do not know what
exactly this distribution follows, normal, beta?). My ultimate goal is to
generate addition 20,000 data point from this empirical distribution created
from the existing 10,000 data points.
thank you all in advance.


-- View this message in context: 
http://r.789695.n4.nabble.com/how-to-generate-a-random-data-from-a-empirical-distribition-tp2302716p2302716.html 
Sent from the R help mailing list archive at Nabble.com.


Ah! This brings back memories of the halcyon days of my youth when, as a 
junior in college, I took a course in introductory probability theory 
around this time during the summer in preparation for working as a co-op 
student the coming fall.


Conceptually, why not treat your empirical sample as an "urn" with 
10,000 items. Then take a sample of 20,000 by sampling with equal 
probabilities and replacement (otherwise you'll run out of cases before 
20,000). Remember that all the common distributions (normal, etc.) 
either were derived because they fit certain common situations (e.g., 
binomial), are of particular use (e.g., Student's t), can be derived 
from other distributions (e.g., normal and the Central Limit Theorem), 
or some combination of such things. In other words, whether or not an 
empirical sample fits one of them is always contingent, although 
understanding any underlying processes that generate the sample might 
point in the direction of certain distributions over others. 
Nonetheless, for something like a Monte Carlo simulation, knowledge of 
an underlying distribution is not necessary.


Also remember that many things in statistics were developed largely 
because they made certain problems mathematically tractable. (Hence, for 
example, the large number of situations involving independent, 
identically distributed random samples or the popularity of ordinary 
least-squares regression.) Today, most of us have more computing power 
at our desks than entire mainframe computing centers had a few decades 
ago. So in many instances, we don't need no stinkin' complex formulas 
anymore.


If you suspect the distribution corresponds to one of the mathematically 
studied distributions, why not fit a curve to a plot of your data points 
and see if it looks familiar? Then do some kind of goodness-of-fit test 
to see if the theoretical distribution is a reasonable approximation.


--
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
CUSR Logo
Center for Urban Studies and Research 
<http://www.uri.edu/prov/research/urbanstudies.html>

The University of Rhode Island <http://www.uri.edu>
email: marsh @ uri .edu (remove spaces) <mailto:marsh%20%5C%20uri%20.edu>
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] read.table: skipping trailing delimiters

2010-05-04 Thread Marshall Feldman

Hi,

I am trying to read a tab-delimited file that has trailing tab 
delimiters. It's a simple file with two legitimate fields. I'm using the 
first as row.names, and the second should be the only column in the 
resulting data frame.


Initially, R was filling the last column with NA's, but I was able to 
stop that by setting colClasses=c("character","character",NULL). Still, 
the data frame is coming in with an extra column, only now its values 
are set to "".


Is there any way to skip the trailing delimited field entirely? I've 
searched for an answer without luck.


Thanks.
Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Flushing print buffer

2010-05-04 Thread Marshall Feldman
Hello,

I have a function with these lines:

test <- function(object,...){
  cat("object: has ",nrow(object),"labels\n")
  cat("Head:\n")
  head(object,...)
  cat("\nTail:\n")
  tail(object,...)
  }

If I feed it a data frame object, it only prints out the tail part. If I 
comment out the last two lines of the function, it does print the head 
part. Obviously there's a buffer not being flushed between the head and 
the tail calls, but I don't know how to flush it. Can someone help me?

Thanks.

 Marsh Feldman



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] concatenate values of two columns

2010-05-05 Thread Marshall Feldman

On 5/5/2010 6:00 AM, n.via...@libero.it wrote:

Dear list,
I'm trying to concatenate the values of two columns but im not able to do it:

i have a dataframe with the following two columns:

X   VAR1   VAR2
1   2
2   1
3   2
4   3
5   4
6  4


what i would like to obtain is:
X   VAR3
1   2
2   1
3   2
4   3
5  4
6  4

I try with paste but what I obtain is:
X VAR3

1   NA2
21NA

32NA

4NA3

5NA4

64NA

  Thanks a lot!!

[[alternative HTML version deleted]]

   


Hi,

You don't say what you want to do when both VAR1 and VAR2 have 
non-trivial values. Neither do you indicate what is in the cells that 
are blank in your example. Nonetheless, consider this code:


> X <- data.frame()
> X <- edit(X)
> X
  VAR1 VAR2
1   NA2
21   NA
32   NA
4   NA3
5   NA4
64   NA

> VAR3 <- X$VAR1
> VAR3
[1] NA  1  2 NA NA  4
> VAR3[is.na(VAR3)] <- X$VAR2[!is.na(X$VAR2)]
> VAR3
[1] 2 1 2 3 4 4
> X <- cbind(X,VAR3)
> X
  VAR1 VAR2 VAR3
1   NA22
21   NA1
32   NA2
4   NA33
5   NA44
64   NA4

Q.E.D.

Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Hierarchical factors

2010-05-05 Thread Marshall Feldman
Thanks for sharing this, Ista.

I've come to the conclusion that R doesn't have what I'm looking for, 
either in the base or the packages.

Although your examples are insightful, the examples we've been 
discussing are deliberately easier than what one would expect in most 
serious applications. Imagine for instance that we're studying wage 
structures of industries in different geographic labor markets. We 
therefore might have four variables: wages, industries, occupations, and 
places. We might want to see if wage differentials are more or less 
constant or if they are higher in some geographic areas than in others. 
Since industries, occupations, and places are typically coded 
hierarchically as we've been discussing, we might want to figure out how 
to examine different wage levels within industries, etc. Doing this 
manually would require lots of w
whereas conceptually  the

On 5/4/2010 6:00 AM,
> Message: 49 Date: Mon, 3 May 2010 13:22:59 -0400 From: Ista Zahn 
>  To: Marshall Feldman  Cc: 
> r-help@r-project.org Subject: Re: [R] Hierarchical factors Message-ID: 
>  
> Content-Type: text/plain; charset=ISO-8859-1 Hi Marshall, I'm not 
> aware of any packages that implement these features as you described 
> them. But most of the tasks are already fairly easy in R -- see below. 
> On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman  wrote:
>> >
>> >  Thanks for getting back so quickly Ista,
>> >
>> >  I was actually casting about for any examples of R software that deals 
>> > with this kind of structure. But your question is a good one. Here are a 
>> > few things I'd like to be able to do:
>> >
>> >  Store data in R at the finest level of detail but easily refer to higher 
>> > levels of aggregation. If the data include such higher levels, this is 
>> > trivial, but otherwise I'd like to aggregate fairly easily. The following 
>> > is not functioning code, but it should give you the idea:
>> >
>> >  start with a data frame (call it d) having row.names = to the 6 digit 
>> > NAICS code and columns w/ various variables, assume one is named 
>> > employment.
>> >  d[,"employment"]??? ??? ??? ??? ??? ?? # Would print all employment data
>> >  d["441222","employment"]??? ??? # Would print only Boat Dealer employment
>> >  d["44","employment]??? ??? ???  # Would print total employment for 
>> > Retail Trade
>>  
> d[,"employment"] #prints all employment data
> d[rownames(d) == "441222","employment"] #prints only boat dealer employment
> d[grep("^44", rownames(d)),"employment"] # prints total employment for
> retail trade
>
>
>> >
>> >  Recursive nesting. I'm not sure how to convey this except with examples. 
>> > Suppose the data frame also has a "wages" column with average weekly wages 
>> > in the industry, and the industry code is also a factor variable 
>> > (industry). So a simple analysis of variance might look like:
>> >
>> >  ??? ??? ??? ??? ??? w<- aov(wages ~ industry, d)
>> >
>> >  ??? ??? But now what I'd like to do is to break this down within 2-digit 
>> > sectors. Assuming the data frame has another variable, industry 2, this 
>> > would look like:
>> >
>> >  ??? ??? ??? ??? ??? w<- aov(wages ~ industry2/industry)
>> >
>> >   ??? But what if we either (a) don't want to bother creating separate 
>> > variables for each level of aggregation in industry or (b) want to 
>> > extended the model formula language to include various nesting strategies. 
>> > This might look like:
>> >
>> >  ??? ??? ??? ??? ??? w<- aov(wages ~ industry//*)??? ??? ??? ??? ??? # 
>> > Nest all meaningful levels 
>> > industry/industry2/industry3/industry4/industry5/industry6. If the coding 
>> > system skips some levels, R is smart enough to omit the skipped levels.
>> >  ??? ??? ??? ??? ??? w<- aov(wages ~ industry//levels 2,4,6) # I'm 
>> > using "//" as a hypothetical extension to the model language that is 
>> > followed by a "levels" keyword and then a list of levels within the 
>> > hierarchy. This example would expand
>> >  ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? 
>> > ??? ??? ??? ?? # to aov(wages ~ industry2/industry4/industry6)
>> >
>> >  ??? ??? One could extend this last example to include a notation allowing 
>> > the analysis to be repeated at varyin

Re: [R] Hierarchical factors

2010-05-06 Thread Marshall Feldman

On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
I think you are perhaps unintentionally obscuring two issues. One is 
whether R might have the statistical functions to deal with such an 
arrangement, and here "mixed models" would be the phrase you ought to 
be watching for, while the other would be whether it would have 
pre-written data management functions that would directly support the 
particular data layout you might be getting from public-access gov't 
files. The second is what I _thought_ you were soliciting in your 
original posting. I was a bit surprised that no one mentioned the 
survey package, since I have seen it used in such situations,  but I 
cannot track down the citation at the moment. You might want to look 
at Gelman's blogs:


http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html 



See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1

And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis 
Techniques in Health Policy Data"

R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf

First, I apologize for my last, somewhat incoherent post. I was 
composing it late at night, grew too tired to think, and thought I left 
it open to finish this morning. Looks as if I should have quit about an 
hour earlier since apparently the garbled message went out anyway.


Dave, you're right, although I would describe my question as combining 
rather than obscuring two issues. My thinking is that first one would 
want the data structure (actually a data type or class). A set of 
functions could then handle conversion to factors, etc. that would allow 
easy use of most existing statistical functions. New statistical 
functions could then be designed, or old ones retrofitted, to handle the 
new data type internally. Eventually, it would be great to integrate it 
into the formula language.


The data type would have an inheritance pattern sort of like this: 
factor -> hierarchy -> specific system. By "specific system" I mean 
either a standard or user-defined coding system that extends the 
hierarchy class. For example, NAICS would be a data type and any 
variable in this class would be both hierarchical and map to the labels 
associated with the industry definitions. The hierarchy class would be 
what I was describing, with information on how to parse individual 
character strings at various levels of aggregation. Finally, although my 
idea would extend R's factor data type, strictly speaking this would not 
be inheritance. Real factors replicate and include labels in the storage 
associated with individual variables. Most hierarchical systems are very 
large, including hundreds of levels and long labels. So factors would 
usually be a very inefficient way to handle them. Imagine, for example, 
an application analyzing Internet routing or airline traffic, with each 
node on a route having a spatial hierarchical code 
(country.state.county.city) and a separate variable for each node. Ugh!


Instead, my idea would be to use an approach similar to SAS's formats, 
where the labels are stored separately and the individual codes map 
through a few relatively simple algorithms. SAS, for example, maps codes 
to labels either 1:1 (a character representation of the code maps to a 
label) or by evaluating the code and mapping it according to a 
predefined range of values. SAS recently implemented a feature that 
allows 1:many mapping so that, for instance, an AGE variable could map 
to simultaneously map to "Adult" and "Senior Citizen." Some statistical 
procedures in SAS will now repeat the analysis for all the mappings, so 
a single call to describe a variable generates counts of both adults and 
seniors.


While something similar to SAS formats would itself be a useful addition 
to R (and has been discussed before), my idea extends this by adding the 
ability to parse a hierarchical code at its various levels. This could 
then be integrated into appropriate statistical functions, or the 
analyst could write a function to deparse the code into its levels and 
then call the statistical function as needed. At a minimum, the 
hierarchy class would have to include an as.factor() function.


Given R's thousands of packages, I sent my post to find out if something 
like this already existed.


Thanks to everyone for your feedback. This list is great! The answer to 
my question is:


> answer <- little.red.hen(question)

Marsh Feldman

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to re-arrange data in R

2010-05-07 Thread Marshall Feldman
Hi Zablone,

I have a few questions about your data, but think the reshape package is 
ultimately what you want. So just look at it and see if you can get it 
to do what you want.

 Marsh Feldman

On Fri, 7 May 2010 01:21:09 -0700 , Zablone Owiti wrote:
> Dear users,
>
> I have monthly station data (44) stations data for 45 years which I have
> read in R using read.table. the data is in the format:
>
> Year  Month  Station1  Station2 ..
> Station 44  (i.e the column names in the 1st row), I also  have the
> latitude and longitude of the stations in a separate file in R (in the
> format : StationLATLON).
>
> I wish to rearrage this data to a format:
>
> Year  MonthStation   LatLon
>Variable
>
> 1960   01 station001  -22.992200   -43.232800
>  70
>
> 1960   01 station002  -22.955600   -43.166700
>  69
> 1960   01 station003   -22.931700 -43.221700 7
> 89
> "
> "
> "
> "
> 2003   12  station043  -23.46473 -47.3836383
>183
> 2003   12  station 044-22.817500-43.21 7
>   179
>
> How do I go about the task in R?
>
> Thanks
>   ZABLONE OWITI
>   GRADUATE STUDENT
> College of Atmospheric Science
> Nanjing University of Information, Science and Technology
> Add: 219 Ning Liu Rd, Nanjing, Jiangsu, 21004, P.R. China
>   Tel: +86-25-58731402
> Fax: +86-25-58731456
> Mob. 15077895632
> Website:www.nuist.edu.cn
>

-- 
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
CUSR Logo
Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)


  Contact Information:


Kingston:

202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511


Providence:

206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Data Frame as Hash Table

2010-05-30 Thread Marshall Feldman
Besides data.table, there's the hash package. It does not use data.frame 
type structures but is a bit more flexible.


Marsh Feldman

On 5/30/10 [May 30, 10] 6:00 AM, r-help-requ...@r-project.org wrote:

Message: 40
Date: Sun, 30 May 2010 09:24:22 +0100
From: Patrick Burns
To:r-help@r-project.org,alan@gmail.com
Subject: Re: [R] Data Frame as Hash Table
Message-ID:<4c0220b6.7090...@pburns.seanet.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

You might want to investigate the 'data.table'
package.

On 30/05/2010 09:03, Alan Lue wrote:
   

>  I'm interested in using a data frame as if it were a hash table.  For
>  instance if I had the following,
>
 

>>  (d<- data.frame(key=seq(0.5, 3, 0.5), value=rnorm(6)))
   

>  keyvalue
>  1 0.5 -1.118665122
>  2 1.0  0.465122921
>  3 1.5 -0.529239211
>  4 2.0 -0.147324638
>  5 2.5 -1.531503795
>  6 3.0 -0.002720434
>
>  Then I'd like to be able to quickly retrieve the "value" of "key" 1.5
>  to get -0.53.  How would one go about doing this?
>
>  Yours,
>  Alan Lue
>
>  __
>  R-help@r-project.org  mailing list
>  https://stat.ethz.ch/mailman/listinfo/r-help
>  PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>  and provide commented, minimal, self-contained, reproducible code.
>
 
-- Patrick Burns pbu...@pburns.seanet.com http://www.burns-stat.com 
(home of 'Some hints for the R beginner' and 'The R Inferno')


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.