Send R-help mailing list submissions to
r-help@r-project.org
To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/r-help
or, via email, send a message with subject or body 'help' to
r-help-requ...@r-project.org
You can reach the person managing the list at
r-help-ow...@r-project.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of R-help digest..."
Today's Topics:
1. Re: Bug in R 3.6.0? (Martin Maechler)
2. Re: Bug in R 3.6.0? (o...@free.fr)
3. Time series (trend over time) for irregular sampling dates
and multiple sites (=?UTF-8?Q?Catarina_Serra_Gon=C3=A7alves?=)
4. Re: Time series (trend over time) for irregular sampling
dates and multiple sites (Bert Gunter)
5. Passing formula as parameter to `lm` within `sapply` causes
error [BUG?] (Jens Heumann)
6. (no subject) (Haddison Mureithi)
7. Help with loop for column means into new column by a subset
Factor w/131 levels (Bill Poling)
8. Re: Help with loop for column means into new column by a
subset Factor w/131 levels (Bill Poling)
9. transpose and split dataframe (Matthew)
10. Re: transpose and split dataframe (David L Carlson)
11. Re: Passing formula as parameter to `lm` within `sapply`
causes error [BUG?] (David Winsemius)
12. Fwd: Re: transpose and split dataframe (Matthew)
13. Re: transpose and split dataframe (Jim Lemon)
14. Re: Time series (trend over time) for irregular sampling
dates and multiple sites (Abs Spurdle)
15. Re: Fwd: Re: transpose and split dataframe (David L Carlson)
16. Re: Passing formula as parameter to `lm` within `sapply`
causes error [BUG?] (Duncan Murdoch)
17. Re: Time series (trend over time) for irregular sampling
dates and multiple sites (Abs Spurdle)
18. Re: Time series (trend over time) for irregular sampling
dates and multiple sites (Abs Spurdle)
19. Re: Passing formula as parameter to `lm` within `sapply`
causes error [BUG?] (Jens Heumann)
20. Re: Passing formula as parameter to `lm` within `sapply`
causes error [BUG?] (peter dalgaard)
----------------------------------------------------------------------
Message: 1
Date: Tue, 30 Apr 2019 16:54:10 +0200
From: Martin Maechler <maech...@stat.math.ethz.ch>
To: Morgan Morgan <morgan.email...@gmail.com>
Cc: <r-help@r-project.org>
Subject: Re: [R] Bug in R 3.6.0?
Message-ID: <23752.24978.45927.96...@stat.math.ethz.ch>
Content-Type: text/plain; charset="utf-8"
Morgan Morgan
on Mon, 29 Apr 2019 21:42:36 +0100 writes:
> Hi,
> I am using the R 3.6.0 on windows. The issue that I report below
does not
> exist with previous version of R.
> In order to reproduce the error you must install a package of your
choice
> from source (tar.gz).
> -Create a .Rprofile file with the following command in it :
setwd("D:/")
> -Close your R session and re-open it. Your working directory must be
now set
> to D:
> -Install a package of your choice from source, example :
> install.packages("data.table",type="source")
> In my case the package fail to install and I get the following error
> message:
> ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> Error in tools:::.read_description(file) :
> file 'DESCRIPTION' does not exist
> Calls: suppressPackageStartupMessages ... withCallingHandlers ->
> .getRequiredPackages -> <Anonymous> -> <Anonymous>
> Execution halted
> ERROR: lazy loading failed for package 'data.table'
> * removing 'C:/Users/Morgan/Documents/R/win-library/3.6/data.table'
> * restoring previous
> 'C:/Users/Morgan/Documents/R/win-library/3.6/data.table'
> Warning in install.packages :
> installation of package ‘data.table’ had non-zero exit status
> Now remove the .Rprofile file, restart your R session and try to
install th
e
> package with the same command.
> In that case everything should be installed just fine.
> FYI the issue happens on macOS as well and I suspect it also does on
all
> linux systems.
> My question: Is this expected or is it a bug?
It is a bug, thank you very much for reporting it.
I've been told privately by Ömer An (thank you!) who's been
affected as well, that this problem seems to affect others, and
that there's a thread about this over at the Rstudio support site
https://support.rstudio.com/hc/en-us/community/posts/200704708-Build-tool-does-not-recognize-DESCRIPTION-file
There, users mention that (all?) packages are affected which
have a multiline 'Description:' field in their DESCRIPTION file.
Of course, many if not most packages have this property.
Indeed, I can reproduce the problem (e.g. with my 'sfsmisc'
package) if I ("silly enough to") add a setwd() call to my
Rprofile file (the one I set via env.var R_PROFILE or R_PROFILE_USER).
This is clearly a bug, and indeed a bad one.
It seems all R core (and other R expert users who have tried R
3.6.0 alpha, beta, and RC versions) have *not* seen the bug as they
are intuitively smart not to mess with R's working directory in
a global R profile file ...
For now you definitively have to work around by not doing what's
the problem : do *NOT* setwd() in your ~/.Rprofile or other
such R init files.
Best,
Martin Maechler
ETH Zurich and R Core Team
------------------------------
Message: 2
Date: Tue, 30 Apr 2019 16:15:46 +0200
From: <o...@free.fr>
To: "'Morgan Morgan'" <morgan.email...@gmail.com>,
<r-help@r-project.org>
Subject: Re: [R] Bug in R 3.6.0?
Message-ID: <002d01d4ff5f$34816be0$9d8443a0$@free.fr>
Content-Type: text/plain; charset="utf-8"
Hello,
I have exactly the same problem when I install one of my own packages:
Error in tools:::.read_description(file) :
file 'DESCRIPTION' does not exist
Calls: suppressPackageStartupMessages ... withCallingHandlers ->
.getRequiredPackages -> <Anonymous> -> <Anonymous>
Exécution arrêtée
ERROR: lazy loading failed for package 'RRegArch'
Best,
Ollivier
-----Message d'origine-----
De : R-help <r-help-boun...@r-project.org> De la part de Morgan Morgan
Envoyé : lundi 29 avril 2019 22:43
À : r-help@r-project.org
Objet : [R] Bug in R 3.6.0?
Hi,
I am using the R 3.6.0 on windows. The issue that I report below does not
exist with previous version of R.
In order to reproduce the error you must install a package of your choice
from source (tar.gz).
-Create a .Rprofile file with the following command in it : setwd("D:/")
-Close your R session and re-open it. Your working directory must be now
set to D:
-Install a package of your choice from source, example :
install.packages("data.table",type="source")
In my case the package fail to install and I get the following error
message:
** R
** inst
** byte-compile and prepare package for lazy loading Error in
tools:::.read_description(file) :
file 'DESCRIPTION' does not exist
Calls: suppressPackageStartupMessages ... withCallingHandlers ->
.getRequiredPackages -> <Anonymous> -> <Anonymous> Execution halted
ERROR: lazy loading failed for package 'data.table'
* removing 'C:/Users/Morgan/Documents/R/win-library/3.6/data.table'
* restoring previous
'C:/Users/Morgan/Documents/R/win-library/3.6/data.table'
Warning in install.packages :
installation of package ‘data.table’ had non-zero exit status
Now remove the .Rprofile file, restart your R session and try to install
the package with the same command.
In that case everything should be installed just fine.
FYI the issue happens on macOS as well and I suspect it also does on all
linux systems.
My question: Is this expected or is it a bug?
Thank you
Best regards,
Morgan
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
------------------------------
Message: 3
Date: Wed, 1 May 2019 00:57:43 +1000
From: =?UTF-8?Q?Catarina_Serra_Gon=C3=A7alves?= <catarin...@gmail.com>
To: r-help@r-project.org
Subject: [R] Time series (trend over time) for irregular sampling
dates and multiple sites
Message-ID:
<
caoqwjbvy+jky80sksmfc8tu-c+5qq-tzwad21xbygvjayyj...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
I have a dataset of marine debris items (number of items standardized per
effort: Items/(number of volunteers*Hours*Lenght)) taken from 2 main
locations (WA and Queensland) in Australia (8 Sub Sites in total: 4 in WA
and 4 in Queensland) at irregular sampling intervals over a period 15
years.
I want to test if there is a change over the years on the amount of debris
in these locations and more specifically a change after the implementation
of a mitigation strategy (in 2013).
Here’s the head of the data:[image: enter image description here]
<https://i.stack.imgur.com/VNIpb.png>Description of each one of the
varables in the dataframe:
*eventid *= each sampling (clean-up) event Location = Queensland and New
South Wales Sites = all the 9 sampling beaches
*Date *= specific dates for the clean-up events (day-month-year)
*Date1 *= specific dates for the clean-up events (day-month-year) on the
POSICXT format Year= Year of sampling event (2004 to 2018)
*Month*= Month of the sampling event (jan to dec)
*nMonth*= a number was determined to the respective month of the sampling
event (1 to 12)
*Day*= Day of sampling (1 to 31) Days = Days since the first date of clean
up = just another way of using the dates
*MARPOL *= before and after implementation (factor with 2 levels)
*DaysC *= days between sampling events for the same sites = number of days
since the previous clean-up event
*DaysI *= Days since intervention, all the dates before implementation are
zero, and after we count the number of days since the implementation date
(1 jan 2013)
*DaysIa*= same as DayI but instead of zero for before the intervention we
have negative values (days)
*Items *= number of fishing and shipping items counted in each clean-up
event
*Hours *= hours spent by all volunteers together at each clean up event
*Lenght *= Lenght of beach sampled by all volunteers together at each clean
up event volunteers = all volunteers at each clean up event
*HoursVolunteer *= hours spent bt each volunteer at each clean up event
(Hours/volunteers)
*Ieffort *= the items standarized by the effort (hours, volunteers and
lenght)
*GrossWeight & **GrossTotal are not relevant *
------------------------------
Problems:
My data has a few problems: (1) I think I will need to fix the effects of
seasonal variation (Monthly) and (2) of possible spatial correlation
(probability of finding an item is higher after finding one since they can
come from the same ship). (3) How do I handle the fact that the
measurements were not taken at a regular interval?
I was trying to use GAMs to analyse the data and see the trends over time.
The model I came across is the following:
m4<- gamm(Ieffort ~ s(DaysIa)+MARPOL+ s(nMonth, bs = "ps", k = 12),
random=list(Site=~1,Location=~1),data = d)
*thank you in advance.*
-
*Catarina Serra Gonçalves *
PhD candidate
Adrift Lab <https://adriftlab.org>
University of Tasmania <http://www.utas.edu.au/> | Institute for Marine
and
Antarctic Studies <http://www.imas.utas.edu.au/>
Launceston, TAS | Australia
Personal website <https://catarinasg.wixsite.com/acserra>
<https://catarinasg.wixsite.com/acserra>| E-mail <acse...@utas.edu.au> |
Twitter <https://twitter.com/CatarinaSerraG>
Research Gate
<https://www.researchgate.net/profile/Catarina_Serra_Goncalves> | Google
Scholar <https://scholar.google.pt/citations?user=8nBrRFwAAAAJ&hl=en>
[[alternative HTML version deleted]]
------------------------------
Message: 4
Date: Tue, 30 Apr 2019 08:28:37 -0700
From: Bert Gunter <bgunter.4...@gmail.com>
To: =?UTF-8?Q?Catarina_Serra_Gon=C3=A7alves?= <catarin...@gmail.com>
Cc: R-help <r-help@r-project.org>
Subject: Re: [R] Time series (trend over time) for irregular sampling
dates and multiple sites
Message-ID:
<CAGxFJbT2YSB1xcs0MajpeqtHbbn4T1ycYoSOBEFvMucFme1t=
g...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
I have 0 expertise, but I suggest that you check out the SPatioTemporal
taskview on CRAN (or possibly others, like environmetrics). You might also
want to move this to the R-Sig-geo list,where you probably are more likely
to find relevant expertise.
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Tue, Apr 30, 2019 at 8:13 AM Catarina Serra Gonçalves <
catarin...@gmail.com> wrote:
I have a dataset of marine debris items (number of items standardized per
effort: Items/(number of volunteers*Hours*Lenght)) taken from 2 main
locations (WA and Queensland) in Australia (8 Sub Sites in total: 4 in WA
and 4 in Queensland) at irregular sampling intervals over a period 15
years.
I want to test if there is a change over the years on the amount of
debris
in these locations and more specifically a change after the
implementation
of a mitigation strategy (in 2013).
Here’s the head of the data:[image: enter image description here]
<https://i.stack.imgur.com/VNIpb.png>Description of each one of the
varables in the dataframe:
*eventid *= each sampling (clean-up) event Location = Queensland and New
South Wales Sites = all the 9 sampling beaches
*Date *= specific dates for the clean-up events (day-month-year)
*Date1 *= specific dates for the clean-up events (day-month-year) on the
POSICXT format Year= Year of sampling event (2004 to 2018)
*Month*= Month of the sampling event (jan to dec)
*nMonth*= a number was determined to the respective month of the sampling
event (1 to 12)
*Day*= Day of sampling (1 to 31) Days = Days since the first date of
clean
up = just another way of using the dates
*MARPOL *= before and after implementation (factor with 2 levels)
*DaysC *= days between sampling events for the same sites = number of
days
since the previous clean-up event
*DaysI *= Days since intervention, all the dates before implementation
are
zero, and after we count the number of days since the implementation date
(1 jan 2013)
*DaysIa*= same as DayI but instead of zero for before the intervention we
have negative values (days)
*Items *= number of fishing and shipping items counted in each clean-up
event
*Hours *= hours spent by all volunteers together at each clean up event
*Lenght *= Lenght of beach sampled by all volunteers together at each
clean
up event volunteers = all volunteers at each clean up event
*HoursVolunteer *= hours spent bt each volunteer at each clean up event
(Hours/volunteers)
*Ieffort *= the items standarized by the effort (hours, volunteers and
lenght)
*GrossWeight & **GrossTotal are not relevant *
------------------------------
Problems:
My data has a few problems: (1) I think I will need to fix the effects of
seasonal variation (Monthly) and (2) of possible spatial correlation
(probability of finding an item is higher after finding one since they
can
come from the same ship). (3) How do I handle the fact that the
measurements were not taken at a regular interval?
I was trying to use GAMs to analyse the data and see the trends over
time.
The model I came across is the following:
m4<- gamm(Ieffort ~ s(DaysIa)+MARPOL+ s(nMonth, bs = "ps", k = 12),
random=list(Site=~1,Location=~1),data = d)
*thank you in advance.*
-
*Catarina Serra Gonçalves *
PhD candidate
Adrift Lab <https://adriftlab.org>
University of Tasmania <http://www.utas.edu.au/> | Institute for Marine
and
Antarctic Studies <http://www.imas.utas.edu.au/>
Launceston, TAS | Australia
Personal website <https://catarinasg.wixsite.com/acserra>
<https://catarinasg.wixsite.com/acserra>| E-mail <acse...@utas.edu.au>
|
Twitter <https://twitter.com/CatarinaSerraG>
Research Gate
<https://www.researchgate.net/profile/Catarina_Serra_Goncalves> | Google
Scholar <https://scholar.google.pt/citations?user=8nBrRFwAAAAJ&hl=en>
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
------------------------------
Message: 5
Date: Tue, 30 Apr 2019 17:24:33 +0200
From: Jens Heumann <jens.heum...@students.unibe.ch>
To: <r-help@r-project.org>
Subject: [R] Passing formula as parameter to `lm` within `sapply`
causes error [BUG?]
Message-ID: <75abba2b-c528-460e-df92-08f8479ba...@students.unibe.ch>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
Hi,
`lm` won't take formula as a parameter when it is within a `sapply`; see
example below. Please, could anyone either point me to a syntax error or
confirm that this might be a bug?
Best,
Jens
[Disclaimer: This is my first post here, following advice of how to
proceed with possible bugs from here: https://www.r-project.org/bugs.html]
SUMMARY
While `lm` alone accepts formula parameter `FO` well, the same within a
`sapply` causes an error. When putting everything as parameter but
formula `FO`, it's still working, though. All parameters work fine
within a similar `for` loop.
MCVE (see data / R-version at bottom)
> summary(lm(y ~ x, df1, df1[["z"]] == 1, df1[["w"]]))$coef[1, ]
Estimate Std. Error t value Pr(>|t|)
1.6269038 0.9042738 1.7991275 0.3229600
> summary(lm(FO, data, data[[st]] == st1, data[[ws]]))$coef[1, ]
Estimate Std. Error t value Pr(>|t|)
1.6269038 0.9042738 1.7991275 0.3229600
> sapply(unique(df1$z), function(s)
+ summary(lm(y ~ x, df1, df1[["z"]] == s, df1[[ws]]))$coef[1, ])
[,1] [,2] [,3]
Estimate 1.6269038 -0.1404174 -0.010338774
Std. Error 0.9042738 0.4577001 1.858138516
t value 1.7991275 -0.3067890 -0.005564049
Pr(>|t|) 0.3229600 0.8104951 0.996457853
> sapply(unique(data[[st]]), function(s)
+ summary(lm(FO, data, data[[st]] == s, data[[ws]]))$coef[1, ]) # !!!
Error in eval(substitute(subset), data, env) : object 's' not found
> sapply(unique(data[[st]]), function(s)
+ summary(lm(y ~ x, data, data[[st]] == s, data[[ws]]))$coef[1, ])
[,1] [,2] [,3]
Estimate 1.6269038 -0.1404174 -0.010338774
Std. Error 0.9042738 0.4577001 1.858138516
t value 1.7991275 -0.3067890 -0.005564049
Pr(>|t|) 0.3229600 0.8104951 0.996457853
> m <- matrix(NA, 4, length(unique(data[[st]])))
> for (s in unique(data[[st]])) {
+ m[, s] <- summary(lm(FO, data, data[[st]] == s, data[[ws]]))$coef[1, ]
+ }
> m
[,1] [,2] [,3]
[1,] 1.6269038 -0.1404174 -0.010338774
[2,] 0.9042738 0.4577001 1.858138516
[3,] 1.7991275 -0.3067890 -0.005564049
[4,] 0.3229600 0.8104951 0.996457853
# DATA #################################################################
df1 <- structure(list(x = c(1.37095844714667, -0.564698171396089,
0.363128411337339,
0.63286260496104, 0.404268323140999, -0.106124516091484, 1.51152199743894,
-0.0946590384130976, 2.01842371387704), y = c(1.30824434809425,
0.740171482827397, 2.64977380403845, -0.755998096151299, 0.125479556323628,
-0.239445852485142, 2.14747239550901, -0.37891195982917, -0.638031707027734
), z = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), w = c(0.7, 0.8,
1.2, 0.9, 1.3, 1.2, 0.8, 1, 1)), class = "data.frame", row.names = c(NA,
-9L))
FO <- y ~ x; data <- df1; st <- "z"; ws <- "w"; st1 <- 1
########################################################################
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 6.0
year 2019
month 04
day 26
svn rev 76424
language R
version.string R version 3.6.0 (2019-04-26)
nickname Planting of a Tree
#########################################################################
NOTE: Question on SO two days ago
(
https://stackoverflow.com/questions/55893189/passing-formula-as-parameter-to-lm-within-sapply-causes-error-bug-confirmation)
brought many views but neither answer nor bug confirmation.
------------------------------
Message: 6
Date: Mon, 29 Apr 2019 21:38:00 +0300
From: Haddison Mureithi <mureithihaddi...@gmail.com>
To: r-help@r-project.org
Subject: [R] (no subject)
Message-ID:
<CABVwvn6y_M2M1o41HryKYp=
lqcbsajdtginyw_rpvf81o4b...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hello guys this problem was never answered and I happened to come across
the same problem , kindly help. This is a simple R program that I have been
trying to run. I keep running into the "singular matrix" error. I end up
with no sensible results. Can anyone suggest any changes or a way around
this?
I am a total rookie when working with R.
Thanks,
Rasika
library(survival)
Loading required package: splines
args(coxph)
function (formula, data, weights, subset, na.action, init, control,
method = c("efron", "breslow", "exact"), singular.ok = TRUE,
robust = FALSE, model = FALSE, x = FALSE, y = TRUE, tt, ...)
NULL
test1<-read.table("S:/FISHDO/03_Phase_I_Field_Work/Data_6_28_2011/Working
Folder/R_files/4SondesJuly24.csv", header=T, sep=",")
sondes<-coxph(Surv(Start, Stop, Depart)~DOLoomis + DOI55 + DODamen,
data=test1)
Warning messages:
1: In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1,2 ; beta may be infinite.
2: In coxph(Surv(Start, Stop, Depart) ~ DOLoomis + DOI55 + DODamen, :
X matrix deemed to be singular; variable 3
summary(sondes)
Call:
coxph(formula = Surv(Start, Stop, Depart) ~ DOLoomis + DOI55 +
DODamen, data = test1)
n= 1737, number of events= 58
(1 observation deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
DOLoomis -2.152e+00 1.163e-01 1.161e+05 0 1
DOI55 4.560e-01 1.578e+00 3.755e+04 0 1
DODamen NA NA 0.000e+00 NA NA
exp(coef) exp(-coef) lower .95 upper .95
DOLoomis 0.1163 8.5995 0 Inf
DOI55 1.5777 0.6338 0 Inf
DODamen NA NA NA NA
Concordance= 0.5 (se = 0 )
Rsquare= 0 (max possible= 0.01 )
Likelihood ratio test= 0 on 2 df, p=1
Wald test = 0 on 2 df, p=1
Score (logrank) test = 0 on 2 df, p=1
[[alternative HTML version deleted]]
------------------------------
Message: 7
Date: Tue, 30 Apr 2019 16:50:48 +0000
From: Bill Poling <bill.pol...@zelis.com>
To: "r-help (r-help@r-project.org)" <r-help@r-project.org>
Subject: [R] Help with loop for column means into new column by a
subset Factor w/131 levels
Message-ID:
<
bn7pr02mb50737455e93f882b58eaa4f4ea...@bn7pr02mb5073.namprd02.prod.outlook.com
Content-Type: text/plain; charset="windows-1252"
Good afternoon.
#RStudio Version 1.1.456
sessionInfo()
#R version 3.5.3 (2019-03-11)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#Running under: Windows >= 8 x64 (build 9200)
#I have a DF of 8 columns and 14025 rows
str(hcd2tmp2)
# 'data.frame':14025 obs. of 8 variables:
# $ Submitted_Charge: num 21021 15360 40561 29495 7904 ...
# $ Allowed_Amt : num 18393 6254 40561 29495 7904 ...
# $ Submitted_Units : num 60 240 420 45 120 215 215 15 57 2 ...
# $ Procedure_Code1 : Factor w/ 131 levels "A9606","J0129",..: 43 113 117
125 24 85 85 90 86 25 ...
# $ AllowByLimit : num 4.268 0.949 7.913 6.124 3.524 ...
# $ UnitsByDose : num 600 240 420 450 120 215 215 750 570 500 ...
# $ LimitByUnits : num 4310 6591 5126 4816 2243 ...
# $ HCPCSCodeDose1 : num 10 1 1 10 1 1 1 50 10 250 ...
#I would like to create four additional columns that are the mean of four
current columns in the DF.
#Current columns
#Allowed_Amt
#LimitByUnits
#AllowByLimit
#UnitsByDose
#The goal is to be able to identify rows where (for instance) Allowed_Amt
is greater than the average (aka outliers).
#The trick Is I want the means of those columns based on a Factor value
#The Factor is:
#Procedure_Code1 : Factor w/ 131 levels "A9606","J0129"
#So each of my four new columns will have 131 distinct values based on the
mean for the specific Procedure_Code1 grouping
#In SQL it would look something like this:
#SELECT *,
# NewCol1 = mean(Allowed_Amt) OVER (PARTITION BY Procedure_Code1),
# NewCol2 = mean(LimitByUnits) OVER (PARTITION BY Procedure_Code1),
# NewCol3 = mean(AllowByLimit) OVER (PARTITION BY Procedure_Code1),
# NewCol4 = mean(UnitsByDose) OVER (PARTITION BY Procedure_Code1)
#INTO NewTable
#FROM Oldtable
#Here are some sample data
head(hcd2tmp2, n=40)
# Submitted_Charge Allowed_Amt Submitted_Units Procedure_Code1
AllowByLimit UnitsByDose LimitByUnits HCPCSCodeDose1
# 1 21020.70 18393.12 60 J1745
4.2679810 600 4309.56 10
# 2 15360.00 6254.40 240 J9299
0.9488785 240 6591.36 1
# 3 40561.32 40561.32 420 J9306
7.9133539 420 5125.68 1
# 4 29495.25 29495.25 45 J9355
6.1244417 450 4815.99 10
# 5 7904.30 7904.30 120 J0897
3.5243000 120 2242.80 1
# 6 15331.95 10614.31 215 J9034
2.0586686 215 5155.91 1
# 7 15331.95 10614.31 215 J9034
2.0586686 215 5155.91 1
# 8 461.90 0.00 15 J9045
0.0000000 750 46.38 50
# 9 27340.96 15092.21 57 J9035
3.2600227 570 4629.48 10
# 10 768.00 576.00 2 J1190
1.3617343 500 422.99 250
# 11 101.00 38.38 5 J2250
59.9687500 5 0.64 1
# 12 17458.40 0.00 200 J9033
0.0000000 200 5990.00 1
# 13 7885.10 7569.70 1 J1745
105.3835445 10 71.83 10
# 14 2015.00 1155.78 4 J2785
5.0051100 0 230.92 0
# 15 443.72 443.72 12 J9045
11.9601078 600 37.10 50
# 16 113750.00 113750.00 600 J2350
3.3025003 600 34443.60 1
# 17 3582.85 3582.85 10 J2469
30.5573561 250 117.25 25
# 18 5152.65 5152.65 50 J2796
1.4362988 500 3587.45 10
# 19 5152.65 5152.65 50 J2796
1.4362988 500 3587.45 10
# 20 39664.09 0.00 74 J9355
0.0000000 740 7919.63 10
# 21 166.71 102.53 9 J9045
3.6841538 450 27.83 50
# 22 13823.61 9676.53 1 J2505
2.0785247 6 4655.48 6
# 23 90954.00 26436.53 360 J1786
1.7443775 3600 15155.28 10
# 24 4800.00 3494.40 800 J3262
0.8861838 800 3943.20 1
# 25 216.00 105.84 4 J0696
42.3360000 1000 2.50 250
# 26 5300.00 4770.00 1 J0178
4.9677151 1 960.20 1
# 27 35203.00 35203.00 200 J9271
3.5772498 200 9840.80 1
# 28 17589.15 17589.15 300 J3380
2.9696855 300 5922.90 1
# 29 18394.64 17842.79 1 J9355
166.7238834 10 107.02 10
# 30 770.00 731.50 10 J2469
6.2388060 250 117.25 25
# 31 461.90 0.00 15 J9045
0.0000000 750 46.38 50
# 32 8160.00 3342.40 80 J1459
1.0260818 40000 3257.44 500
# 33 1653.48 314.16 6 J9305
0.7661505 60 410.05 10
# 34 13036.50 0.00 194 J9034
0.0000000 194 4652.31 1
# 35 10486.87 0.00 156 J9034
0.0000000 156 3741.04 1
# 36 15360.00 6254.40 240 J9299
0.9488785 240 6591.36 1
# 37 1616.83 1616.83 150 J1453
5.2528590 150 307.80 1
# 38 80685.74 34772.43 96 J9035
4.4597077 960 7797.02 10
# 39 85220.58 35925.13 287 J9299
4.5577715 287 7882.17 1
# 40 3860.17 1627.27 13 J9299
4.5577963 13 357.03 1
#I hope this is enough inforamtion to warrant your support
#Thank you
#WHP
Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}
------------------------------
Message: 8
Date: Tue, 30 Apr 2019 18:45:40 +0000
From: Bill Poling <bill.pol...@zelis.com>
To: "r-help (r-help@r-project.org)" <r-help@r-project.org>
Subject: Re: [R] Help with loop for column means into new column by a
subset Factor w/131 levels
Message-ID:
<
bn7pr02mb5073d732498ab265872f5750ea...@bn7pr02mb5073.namprd02.prod.outlook.com
Content-Type: text/plain; charset="windows-1252"
I ran this routine but I was thinking there must be a more elegant way of
doing this.
#
https://community.rstudio.com/t/how-to-average-mean-variables-in-r-based-on-the-level-of-another-variable-and-save-this-as-a-new-variable/8764/8
hcd2tmp2_summmary <- hcd2tmp2 %>%
select(.) %>%
group_by(Procedure_Code1) %>%
summarize(average = mean(Allowed_Amt))
# A tibble: 131 x 2
# Procedure_Code1 average
# <fct> <dbl>
# 1 A9606 57785.
# 2 J0129 5420.
# 3 J0178 4700.
# 4 J0180 13392.
# 5 J0202 56328.
# 6 J0256 17366.
# 7 J0257 7563.
# 8 J0485 2450.
# 9 J0490 6398.
# 10 J0585 4492.
# ... with 121 more rows
hcd2tmp2 <- hcd2tmp %>%
group_by(Procedure_Code1) %>%
summarise(Avg_Allowed_Amt = mean(Allowed_Amt))
view(hcd2tmp2)
hcd2tmp3 <- hcd2tmp %>%
group_by(Procedure_Code1) %>%
summarise(Avg_AllowByLimit = mean(AllowByLimit))
view(hcd2tmp3)
hcd2tmp4 <- hcd2tmp %>%
group_by(Procedure_Code1) %>%
summarise(Avg_UnitsByDose = mean(UnitsByDose))
view(hcd2tmp4)
hcd2tmp5 <- hcd2tmp %>%
group_by(Procedure_Code1) %>%
summarise(Avg_LimitByUnits = mean(LimitByUnits))
view(hcd2tmp5)
#Joins----
hcd2tmp <- left_join(hcd2tmp2, hcd2tmp, by =
c("Procedure_Code1"="Procedure_Code1"))
hcd2tmp <- left_join(hcd2tmp3, hcd2tmp, by =
c("Procedure_Code1"="Procedure_Code1"))
hcd2tmp <- left_join(hcd2tmp4, hcd2tmp, by =
c("Procedure_Code1"="Procedure_Code1"))
hcd2tmp <- left_join(hcd2tmp5, hcd2tmp, by =
c("Procedure_Code1"="Procedure_Code1"))
view(hcd2tmp)
hcd2tmp$Avg_LimitByUnits <- round(hcd2tmp$Avg_LimitByUnits, digits = 2)
hcd2tmp$Avg_Allowed_Amt <- round(hcd2tmp$Avg_Allowed_Amt, digits = 2)
hcd2tmp$Avg_AllowByLimit <- round(hcd2tmp$Avg_AllowByLimit, digits = 2)
hcd2tmp$Avg_UnitsByDose <- round(hcd2tmp$Avg_UnitsByDose, digits = 2)
view(hcd2tmp)
#Over under columns----
hcd2tmp$AllowByLimitFlag <- hcd2tmp$AllowByLimit > hcd2tmp$Avg_AllowByLimit
hcd2tmp$LimitByUnitsFlag <- hcd2tmp$LimitByUnits > hcd2tmp$Avg_LimitByUnits
hcd2tmp$Allowed_AmtFlag <- hcd2tmp$Allowed_Amt > hcd2tmp$Avg_Allowed_Amt
hcd2tmp$UnitsByDoseFlag <- hcd2tmp$UnitsByDose > hcd2tmp$Avg_UnitsByDose
view(hcd2tmp)
-----Original Message-----
From: Bill Poling
Sent: Tuesday, April 30, 2019 12:51 PM
To: r-help (r-help@r-project.org) <r-help@r-project.org>
Cc: Bill Poling <bill.pol...@zelis.com>
Subject: Help with loop for column means into new column by a subset
Factor w/131 levels
Good afternoon.
#RStudio Version 1.1.456
sessionInfo()
#R version 3.5.3 (2019-03-11)
#Platform: x86_64-w64-mingw32/x64 (64-bit) #Running under: Windows >= 8
x64 (build 9200)
#I have a DF of 8 columns and 14025 rows
str(hcd2tmp2)
# 'data.frame':14025 obs. of 8 variables:
# $ Submitted_Charge: num 21021 15360 40561 29495 7904 ...
# $ Allowed_Amt : num 18393 6254 40561 29495 7904 ...
# $ Submitted_Units : num 60 240 420 45 120 215 215 15 57 2 ...
# $ Procedure_Code1 : Factor w/ 131 levels "A9606","J0129",..: 43 113 117
125 24 85 85 90 86 25 ...
# $ AllowByLimit : num 4.268 0.949 7.913 6.124 3.524 ...
# $ UnitsByDose : num 600 240 420 450 120 215 215 750 570 500 ...
# $ LimitByUnits : num 4310 6591 5126 4816 2243 ...
# $ HCPCSCodeDose1 : num 10 1 1 10 1 1 1 50 10 250 ...
#I would like to create four additional columns that are the mean of four
current columns in the DF.
#Current columns
#Allowed_Amt
#LimitByUnits
#AllowByLimit
#UnitsByDose
#The goal is to be able to identify rows where (for instance) Allowed_Amt
is greater than the average (aka outliers).
#The trick Is I want the means of those columns based on a Factor value
#The Factor is:
#Procedure_Code1 : Factor w/ 131 levels "A9606","J0129"
#So each of my four new columns will have 131 distinct values based on the
mean for the specific Procedure_Code1 grouping
#In SQL it would look something like this:
#SELECT *,
# NewCol1 = mean(Allowed_Amt) OVER (PARTITION BY Procedure_Code1),
# NewCol2 = mean(LimitByUnits) OVER (PARTITION BY Procedure_Code1),
# NewCol3 = mean(AllowByLimit) OVER (PARTITION BY Procedure_Code1),
# NewCol4 = mean(UnitsByDose) OVER (PARTITION BY Procedure_Code1)
#INTO NewTable
#FROM Oldtable
#Here are some sample data
head(hcd2tmp2, n=40)
# Submitted_Charge Allowed_Amt Submitted_Units Procedure_Code1
AllowByLimit UnitsByDose LimitByUnits HCPCSCodeDose1
# 1 21020.70 18393.12 60 J1745
4.2679810 600 4309.56 10
# 2 15360.00 6254.40 240 J9299
0.9488785 240 6591.36 1
# 3 40561.32 40561.32 420 J9306
7.9133539 420 5125.68 1
# 4 29495.25 29495.25 45 J9355
6.1244417 450 4815.99 10
# 5 7904.30 7904.30 120 J0897
3.5243000 120 2242.80 1
# 6 15331.95 10614.31 215 J9034
2.0586686 215 5155.91 1
# 7 15331.95 10614.31 215 J9034
2.0586686 215 5155.91 1
# 8 461.90 0.00 15 J9045
0.0000000 750 46.38 50
# 9 27340.96 15092.21 57 J9035
3.2600227 570 4629.48 10
# 10 768.00 576.00 2 J1190
1.3617343 500 422.99 250
# 11 101.00 38.38 5 J2250
59.9687500 5 0.64 1
# 12 17458.40 0.00 200 J9033
0.0000000 200 5990.00 1
# 13 7885.10 7569.70 1 J1745
105.3835445 10 71.83 10
# 14 2015.00 1155.78 4 J2785
5.0051100 0 230.92 0
# 15 443.72 443.72 12 J9045
11.9601078 600 37.10 50
# 16 113750.00 113750.00 600 J2350
3.3025003 600 34443.60 1
# 17 3582.85 3582.85 10 J2469
30.5573561 250 117.25 25
# 18 5152.65 5152.65 50 J2796
1.4362988 500 3587.45 10
# 19 5152.65 5152.65 50 J2796
1.4362988 500 3587.45 10
# 20 39664.09 0.00 74 J9355
0.0000000 740 7919.63 10
# 21 166.71 102.53 9 J9045
3.6841538 450 27.83 50
# 22 13823.61 9676.53 1 J2505
2.0785247 6 4655.48 6
# 23 90954.00 26436.53 360 J1786
1.7443775 3600 15155.28 10
# 24 4800.00 3494.40 800 J3262
0.8861838 800 3943.20 1
# 25 216.00 105.84 4 J0696
42.3360000 1000 2.50 250
# 26 5300.00 4770.00 1 J0178
4.9677151 1 960.20 1
# 27 35203.00 35203.00 200 J9271
3.5772498 200 9840.80 1
# 28 17589.15 17589.15 300 J3380
2.9696855 300 5922.90 1
# 29 18394.64 17842.79 1 J9355
166.7238834 10 107.02 10
# 30 770.00 731.50 10 J2469
6.2388060 250 117.25 25
# 31 461.90 0.00 15 J9045
0.0000000 750 46.38 50
# 32 8160.00 3342.40 80 J1459
1.0260818 40000 3257.44 500
# 33 1653.48 314.16 6 J9305
0.7661505 60 410.05 10
# 34 13036.50 0.00 194 J9034
0.0000000 194 4652.31 1
# 35 10486.87 0.00 156 J9034
0.0000000 156 3741.04 1
# 36 15360.00 6254.40 240 J9299
0.9488785 240 6591.36 1
# 37 1616.83 1616.83 150 J1453
5.2528590 150 307.80 1
# 38 80685.74 34772.43 96 J9035
4.4597077 960 7797.02 10
# 39 85220.58 35925.13 287 J9299
4.5577715 287 7882.17 1
# 40 3860.17 1627.27 13 J9299
4.5577963 13 357.03 1
#I hope this is enough inforamtion to warrant your support
#Thank you
#WHP
Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}
------------------------------
Message: 9
Date: Tue, 30 Apr 2019 15:24:57 -0400
From: Matthew <mccorm...@molbio.mgh.harvard.edu>
To: "r-help (r-help@r-project.org)" <r-help@r-project.org>
Subject: [R] transpose and split dataframe
Message-ID:
<0d6ac524-4291-ab03-6bcb-592b3996c...@molbio.mgh.harvard.edu>
Content-Type: text/plain; charset="utf-8"; Format="flowed"
I have a data frame that is a lot bigger but for simplicity sake we can
say it looks like this:
Regulator hits
AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675
AT2G55980 AT2G85403,AT4G89223
In other words:
data.frame : 2 obs. of 2 variables
$Regulator: Factor w/ 2 levels
$hits : Factor w/ 6 levels
I want to transpose it so that Regulator is now the column headings
and each of the AGI numbers now separated by commas is a row. So,
AT1G69490 is now the header of the first column and AT4G31950 is row 1
of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header of
column 2 and AT2G85403 is row 1 of column 2, etc.
I have tried playing around with strsplit(TF2list[2:2]) and
strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
Matthew
------------------------------
Message: 10
Date: Tue, 30 Apr 2019 21:04:50 +0000
From: David L Carlson <dcarl...@tamu.edu>
To: "r-help@r-project.org" <r-help@r-project.org>, Matthew
<mccorm...@molbio.mgh.harvard.edu>
Subject: Re: [R] transpose and split dataframe
Message-ID: <db8cede89a724defb691cea72a25b...@tamu.edu>
Content-Type: text/plain; charset="utf-8"
I neglected to copy this to the list:
I think we need more information. Can you give us the structure of the
data with str(YourDataFrame). Alternatively you could copy a small piece
into your email message by copying and pasting the results of the following
code:
dput(head(YourDataFrame))
The data frame you present could not be a data frame since you say "hits"
is a factor with a variable number of elements. If each value of "hits" was
a single character string, it would only have 2 factor levels not 6 and
your efforts to parse the string would make more sense. Transposing to a
data frame would only be possible if each column was padded with NAs to
make them equal in length. Since your example tries use the name TF2list,
it is possible that you do not have a data frame but a list and you have no
factor levels, just character vectors.
If you are not familiar with R, it may be helpful to tell us what your
overall goal is rather than an intermediate step. Very likely R can easily
handle what you want by doing things a different way.
----------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Matthew
Sent: Tuesday, April 30, 2019 2:25 PM
To: r-help (r-help@r-project.org) <r-help@r-project.org>
Subject: [R] transpose and split dataframe
I have a data frame that is a lot bigger but for simplicity sake we can
say it looks like this:
Regulator hits
AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675
AT2G55980 AT2G85403,AT4G89223
In other words:
data.frame : 2 obs. of 2 variables
$Regulator: Factor w/ 2 levels
$hits : Factor w/ 6 levels
I want to transpose it so that Regulator is now the column headings
and each of the AGI numbers now separated by commas is a row. So,
AT1G69490 is now the header of the first column and AT4G31950 is row 1
of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header of
column 2 and AT2G85403 is row 1 of column 2, etc.
I have tried playing around with strsplit(TF2list[2:2]) and
strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
Matthew
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
------------------------------
Message: 11
Date: Tue, 30 Apr 2019 15:03:09 -0600
From: David Winsemius <dwinsem...@comcast.net>
To: Jens Heumann <jens.heum...@students.unibe.ch>
Cc: r-help@r-project.org
Subject: Re: [R] Passing formula as parameter to `lm` within `sapply`
causes error [BUG?]
Message-ID: <924255d4-912e-4c24-8e85-6e313ec50...@comcast.net>
Content-Type: text/plain; charset="utf-8"
Try using do.call
—
David
Sent from my iPhone
On Apr 30, 2019, at 9:24 AM, Jens Heumann <
jens.heum...@students.unibe.ch> wrote:
Hi,
`lm` won't take formula as a parameter when it is within a `sapply`; see
example below. Please, could anyone either point me to a syntax error or
confirm that this might be a bug?
Best,
Jens
[Disclaimer: This is my first post here, following advice of how to
proceed with possible bugs from here: https://www.r-project.org/bugs.html]
SUMMARY
While `lm` alone accepts formula parameter `FO` well, the same within a
`sapply` causes an error. When putting everything as parameter but formula
`FO`, it's still working, though. All parameters work fine within a similar
`for` loop.
MCVE (see data / R-version at bottom)
summary(lm(y ~ x, df1, df1[["z"]] == 1, df1[["w"]]))$coef[1, ]
Estimate Std. Error t value Pr(>|t|)
1.6269038 0.9042738 1.7991275 0.3229600
summary(lm(FO, data, data[[st]] == st1, data[[ws]]))$coef[1, ]
Estimate Std. Error t value Pr(>|t|)
1.6269038 0.9042738 1.7991275 0.3229600
sapply(unique(df1$z), function(s)
+ summary(lm(y ~ x, df1, df1[["z"]] == s, df1[[ws]]))$coef[1, ])
[,1] [,2] [,3]
Estimate 1.6269038 -0.1404174 -0.010338774
Std. Error 0.9042738 0.4577001 1.858138516
t value 1.7991275 -0.3067890 -0.005564049
Pr(>|t|) 0.3229600 0.8104951 0.996457853
sapply(unique(data[[st]]), function(s)
+ summary(lm(FO, data, data[[st]] == s, data[[ws]]))$coef[1, ]) # !!!
Error in eval(substitute(subset), data, env) : object 's' not found
sapply(unique(data[[st]]), function(s)
+ summary(lm(y ~ x, data, data[[st]] == s, data[[ws]]))$coef[1, ])
[,1] [,2] [,3]
Estimate 1.6269038 -0.1404174 -0.010338774
Std. Error 0.9042738 0.4577001 1.858138516
t value 1.7991275 -0.3067890 -0.005564049
Pr(>|t|) 0.3229600 0.8104951 0.996457853
m <- matrix(NA, 4, length(unique(data[[st]])))
for (s in unique(data[[st]])) {
+ m[, s] <- summary(lm(FO, data, data[[st]] == s, data[[ws]]))$coef[1,
]
+ }
m
[,1] [,2] [,3]
[1,] 1.6269038 -0.1404174 -0.010338774
[2,] 0.9042738 0.4577001 1.858138516
[3,] 1.7991275 -0.3067890 -0.005564049
[4,] 0.3229600 0.8104951 0.996457853
# DATA #################################################################
df1 <- structure(list(x = c(1.37095844714667, -0.564698171396089,
0.363128411337339,
0.63286260496104, 0.404268323140999, -0.106124516091484,
1.51152199743894,
-0.0946590384130976, 2.01842371387704), y = c(1.30824434809425,
0.740171482827397, 2.64977380403845, -0.755998096151299,
0.125479556323628,
-0.239445852485142, 2.14747239550901, -0.37891195982917,
-0.638031707027734
), z = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), w = c(0.7, 0.8,
1.2, 0.9, 1.3, 1.2, 0.8, 1, 1)), class = "data.frame", row.names = c(NA,
-9L))
FO <- y ~ x; data <- df1; st <- "z"; ws <- "w"; st1 <- 1
########################################################################
R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 6.0
year 2019
month 04
day 26
svn rev 76424
language R
version.string R version 3.6.0 (2019-04-26)
nickname Planting of a Tree
#########################################################################
NOTE: Question on SO two days ago (
https://stackoverflow.com/questions/55893189/passing-formula-as-parameter-to-lm-within-sapply-causes-error-bug-confirmation)
brought many views but neither answer nor bug confirmation.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
------------------------------
Message: 12
Date: Tue, 30 Apr 2019 17:31:28 -0400
From: Matthew <mccorm...@molbio.mgh.harvard.edu>
To: "r-help@r-project.org" <r-help@r-project.org>
Subject: [R] Fwd: Re: transpose and split dataframe
Message-ID:
<e4a9e321-b437-eed6-344b-472319e85...@molbio.mgh.harvard.edu>
Content-Type: text/plain; charset="utf-8"
Thanks for your reply. I was trying to simplify it a little, but must
have got it wrong. Here is the real dataframe, TF2list:
str(TF2list)
'data.frame': 152 obs. of 2 variables:
$ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54
54 82 82 82 82 82 ...
$ hits : Factor w/ 97 levels
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
__truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
And the first few lines resulting from dput(head(TF2list)):
dput(head(TF2list))
structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
"AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
"AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600",
"AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ...
This is another way of looking at the first 4 entries (Regulator is
tab-separated from hits):
Regulator
hits
1
AT1G69490
AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
2
AT1G29860
AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
3
AT1G2986
AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830
So, the goal would be to
first: Transpose the existing dataframe so that the factor Regulator
becomes a column name (column 1 name = AT1G69490, column2 name
AT1G29860, etc.) and the hits associated with each Regulator become
rows. Hits is a comma separated 'list' ( I do not not know if
technically it is an R list.), so it would have to be comma
'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950,
col 1 row 2 - AT5G24410, etc); like this :
AT1G69490
AT4G31950
AT5G24110
AT1G05675
AT5G64905
... I did not include all the rows)
I think it would be best to actually make the first entry a separate
dataframe ( 1 column with name = AT1G69490 and number of rows depending
on the number of hits), then make the second column (column name =
AT1G29860, and number of rows depending on the number of hits) into a
new dataframe and do a full join of of the two dataframes; continue by
making the third column (column name = AT1G2986) into a dataframe and
full join it with the previous; continue for the 152 observations so
that then end result is a dataframe with 152 columns and number of rows
depending on the entry with the greatest number of hits. The full joins
I can do with dplyr, but getting up to that point seems rather difficult.
This would get me what my ultimate goal would be; each Regulator is a
column name (152 columns) and a given row has either NA or the same hit.
This seems very difficult to me, but I appreciate any attempt.
Matthew
On 4/30/2019 4:34 PM, David L Carlson wrote:
External Email - Use Caution
I think we need more information. Can you give us the structure of the
data with str(YourDataFrame). Alternatively you could copy a small piece
into your email message by copying and pasting the results of the following
code:
dput(head(YourDataFrame))
The data frame you present could not be a data frame since you say
"hits" is a factor with a variable number of elements. If each value of
"hits" was a single character string, it would only have 2 factor levels
not 6 and your efforts to parse the string would make more sense.
Transposing to a data frame would only be possible if each column was
padded with NAs to make them equal in length. Since your example tries use
the name TF2list, it is possible that you do not have a data frame but a
list and you have no factor levels, just character vectors.
If you are not familiar with R, it may be helpful to tell us what your
overall goal is rather than an intermediate step. Very likely R can easily
handle what you want by doing things a different way.
----------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message-----
From: R-help<r-help-boun...@r-project.org> On Behalf Of Matthew
Sent: Tuesday, April 30, 2019 2:25 PM
To: r-help (r-help@r-project.org)<r-help@r-project.org>
Subject: [R] transpose and split dataframe
I have a data frame that is a lot bigger but for simplicity sake we can
say it looks like this:
Regulator hits
AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675
AT2G55980 AT2G85403,AT4G89223
In other words:
data.frame : 2 obs. of 2 variables
$Regulator: Factor w/ 2 levels
$hits : Factor w/ 6 levels
I want to transpose it so that Regulator is now the column headings
and each of the AGI numbers now separated by commas is a row. So,
AT1G69490 is now the header of the first column and AT4G31950 is row 1
of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header of
column 2 and AT2G85403 is row 1 of column 2, etc.
I have tried playing around with strsplit(TF2list[2:2]) and
strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
Matthew
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://
www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
------------------------------
Message: 13
Date: Wed, 1 May 2019 07:46:32 +1000
From: Jim Lemon <drjimle...@gmail.com>
To: Matthew <mccorm...@molbio.mgh.harvard.edu>
Cc: "r-help (r-help@r-project.org)" <r-help@r-project.org>
Subject: Re: [R] transpose and split dataframe
Message-ID:
<CA+8X3fUjv3APb=
ucsnqad61pmosbvoybfsw3cazw7p11ed7...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi Matthew,
Is this what you are trying to do?
mmdf<-read.table(text="Regulator hits
AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675
AT2G55980 AT2G85403,AT4G89223",header=TRUE,
stringsAsFactors=FALSE)
# split the second column at the commas
hitsplit<-strsplit(mmdf$hits,",")
# define a function that will fill with NAs
NAfill<-function(x,n) return(x[1:n])
# get the maximum length of hits
maxlen<-max(unlist(lapply(hitsplit,length)))
# fill the list with NAs
hitsplit<-lapply(hitsplit,NAfill,maxlen)
# change the names of the list
names(hitsplit)<-mmdf$Regulator
# convert to a data frame
tmmdf<-as.data.frame(hitsplit)
Jim
On Wed, May 1, 2019 at 5:25 AM Matthew <mccorm...@molbio.mgh.harvard.edu>
wrote:
I have a data frame that is a lot bigger but for simplicity sake we can
say it looks like this:
Regulator hits
AT1G69490 AT4G31950,AT5G24110,AT1G26380,AT1G05675
AT2G55980 AT2G85403,AT4G89223
In other words:
data.frame : 2 obs. of 2 variables
$Regulator: Factor w/ 2 levels
$hits : Factor w/ 6 levels
I want to transpose it so that Regulator is now the column headings
and each of the AGI numbers now separated by commas is a row. So,
AT1G69490 is now the header of the first column and AT4G31950 is row 1
of column 1, AT5G24110 is row 2 of column 1, etc. AT2G55980 is header of
column 2 and AT2G85403 is row 1 of column 2, etc.
I have tried playing around with strsplit(TF2list[2:2]) and
strsplit(as.character(TF2list[2:2]), but I am getting nowhere.
Matthew
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
------------------------------
Message: 14
Date: Wed, 1 May 2019 09:58:34 +1200
From: Abs Spurdle <spurdl...@gmail.com>
To: =?UTF-8?Q?Catarina_Serra_Gon=C3=A7alves?= <catarin...@gmail.com>
Cc: r-help <r-help@r-project.org>
Subject: Re: [R] Time series (trend over time) for irregular sampling
dates and multiple sites
Message-ID:
<
cab8pepxhybcxqpx5cauq868kmap80z+zsxh7lhak+xdabjo...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
My data has a few problems: (1) I think I will need to fix the effects of
seasonal variation (Monthly) and (2) of possible spatial correlation
(probability of finding an item is higher after finding one since they
can
come from the same ship). (3) How do I handle the fact that the
measurements were not taken at a regular interval?
Can I ask two questions:
(1) Is the data autocorrelated (or "Seasonal") over time?
If not then this problem is a lot simpler.
(2) Can you expand on the following statement?
"possible spatial correlation (probability of finding an item is higher
after finding one since they can come from the same ship"
[[alternative HTML version deleted]]
------------------------------
Message: 15
Date: Tue, 30 Apr 2019 22:29:24 +0000
From: David L Carlson <dcarl...@tamu.edu>
To: Matthew <mccorm...@molbio.mgh.harvard.edu>, "r-help@r-project.org"
<r-help@r-project.org>
Subject: Re: [R] Fwd: Re: transpose and split dataframe
Message-ID: <1d59b3c0584a40c1b322b0efd5de7...@tamu.edu>
Content-Type: text/plain; charset="utf-8"
If you read the data frame with read.csv() or one of the other read()
functions, use the asis=TRUE argument to prevent conversion to factors. If
not do the conversion first:
# Convert factors to characters
DataMatrix <- sapply(TF2list, as.character)
# Split the vector of hits
DataList <- sapply(DataMatrix[, 2], strsplit, split=",")
# Use the values in Regulator to name the parts of the list
names(DataList) <- DataMatrix[,"Regulator"]
# Now create a data frame
# How long is the longest list of hits?
mx <- max(sapply(DataList, length))
# Now add NAs to vectors shorter than mx
DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x))))
# Finally convert back to a data frame
TF2list2 <- do.call(data.frame, DataList2)
Try this on a portion of the list, say 25 lines and print each object to
see what is happening.
----------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Matthew
Sent: Tuesday, April 30, 2019 4:31 PM
To: r-help@r-project.org
Subject: [R] Fwd: Re: transpose and split dataframe
Thanks for your reply. I was trying to simplify it a little, but must
have got it wrong. Here is the real dataframe, TF2list:
str(TF2list)
'data.frame': 152 obs. of 2 variables:
$ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54
54 82 82 82 82 82 ...
$ hits : Factor w/ 97 levels
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
__truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
And the first few lines resulting from dput(head(TF2list)):
dput(head(TF2list))
structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,