from:"Avi Gross via R\-help"

Re: [R] How to decrease size of points?

2020-10-01 Thread Avi Gross via R-help

It is better to understand the requirements before suggesting a way to do it.

My GUESS is that the questioner  wants the circles of different sizes based on 
a factor but the natural choices start too high for their needs/taste. So they 
want a base size of 0.8 that is either the minimum or maximum size. I offer a 
verbal solution but end with possibly a simple solution using other 
functionality.

So I am proposing a solution but only if the above is what you want. If it is 
something else, state it clearly.

So say you want the sizes to go from 0.8 to 1.2. The approach offered can be 
tweaked to do the following in English.

Calculate the number of unique levels of the factor in the data. Make sure the 
factor is (re)ordered to have those N levels and in the right order.

Create a vector of N values but they should not all be 0.8. They should have 
the first one be 0.8, and the second would have added to that something like  
(1.2 - 0.8)/N and the next has double that added and so on. With the proper 
calculations, you get a smoothly increasing value for a size for the circles in 
a vector of the right size. Many other methods can be used like multiplying the 
preceding size by 1.1 and you can play rounding games or anything else. In the 
end, you have a_vector looking a bit like c(0.8, 0.84, 0.88, 0.92, ... 1.16, 
1.20) or whatever.

Now note the word "manual" in scale_size_manual(values = a_vector)

It means you are doing things manually rather than allowing ggplot to choose 
them automatically. Saying size=variable is now not useful or even wanted. You 
are choosing manually.

Just give it the vector of increasing sizes you wanted and it should use them 
in that order as long as the factor you use is in the order you want. If it is 
a categorical variable with your required order, there are ways to get what you 
want either by making it ordered or re-ordering the factors so the hidden index 
values of 1,2,3 ... correspond.

I have not tried this and am not supplying actual code, just the concept.  But 
this approach feels too long and cumbersome given how common a need like the 
one I think you want might be.

A much better solution would be a way to specify a baseline minimum and perhaps 
maximum and let the underlying ggplot print method figure things out. If the 
manual pages are correct, it may make sense to use other "scale" functions that 
allow you to specify a minimum and maximum size 


https://www.rdocumentation.org/packages/ggplot2/versions/1.0.1/topics/scale_size

scale_size_continuous(..., range = c(1, 6))
scale_size(..., range = c(1, 6))
scale_size_discrete(..., range = c(1, 6))

I see examples that suggest it works the way I want:

(p <- qplot(mpg, cyl, data=mtcars, size=cyl))
...
p + scale_size(range = c(0, 10))
p + scale_size(range = c(1, 2))

In the above you both ask ggplot to adjust the size to match the 
variable/column called cylinder and also provide a range for those values to be 
distributed between.

Sounds like a plan to try?

-Original Message-
From: R-help  On Behalf Of Medic
Sent: Wednesday, September 30, 2020 3:01 PM
To: Rui Barradas ; r-help@r-project.org
Subject: Re: [R] How to decrease size of points?

№1 Medic:
The code works as I want, but the points (circles) on the plot are too big. How 
to decrease them? Where to insert (for instance) size = 0.8 for points 
(circles) on plot?

p1 <- p + geom_point(aes(size = Stage), alpha = 1/3) + xlab ("X") +
ylab("Y") + geom_smooth()

Stage is factor, x and y - continuous
===
№2 Rui Barradas:
add the scale_size
p1 + scale_size_manual(values = 0.8)
===
№3 Medic:
Thanks Rui, but I got:
Error: Insufficient values in manual scale. 12 needed but only 1 provided.
(or Error: Continuous value supplied to discrete scale) ===
№4 Rui Barradas:
Try
nsize <- length(unique(df1$Stage))
before the plot and then
p1 + scale_size_manual(values = rep(0.8, nsize)) ===
№5 Medic:
Rui, your example is very good!
Now your code works, but not as I want.

Why did I use:
geom_point(aes(size = Stage)...?
In order to receive points of DIFFERENT size!

And what does your code do?
It assigns the same fixed size to ALL points.

I don't need this.
I sincerely thank you and closing the topic!

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to get a numeric vector?

2020-10-04 Thread Avi Gross via R-help

Always hard to tell if THIS is a homework project. As with most things in R,
if you can not find at least a dozen ways to do it, it is not worth doing.

The question (way below) was how to take two vectors of length two and make
a longer results based on using the ":" operator to generate a range between
the first element of each array and then between the second elements and
return the combined results as a vector.

Using simple loops on the length of the two vectors you want combined this
way can be done either in-line or by making a simple function.

Something like this:

results = c()

for (index in 1:length(a)) {
  results <- c(results, a[index]:b[index])
}

The above generalizes to any size vectors of the same length.

There are probably lots of ways to do this using functional programming
(after loading the tidyverse or just purrr)  but here is one:

unlist( map2(a, b, `:`) )

Or more explicitly in an extended length-4 vector example:

Library(purr)
alpha <- c(1, 4, 10, 20)
beta  <- c(5, 8, 19, 27)
unlist(map2(.x=alpha, 
.y=beta, 
.f=`:`))

Result:

[1]  1  2  3  4  5  4  5  6  7  8 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27

Be warned the `:` operator happily works with descending or negative numbers
as well as  numbers with decimal points. You do NOT want to call the above
uses of `:`  unless you have no NA or NaN or Inf  in your original vectors.






-Original Message-
From: R-help  On Behalf Of vod vos via R-help
Sent: Sunday, October 4, 2020 6:47 PM
To: r-help 
Subject: [R] how to get a numeric vector?

Hi,

a <- c(1, 4)
b <- c(5, 8)

a:b

[1] 1 2 3 4 5
Warning messages:
1: In a:b : numerical expression has 2 elements: only the first used
2: In a:b : numerical expression has 2 elements: only the first used

how to get:

c(1:5, 4:8)

Thanks.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R-help Digest, Vol 212, Issue 4

2020-10-06 Thread Avi Gross via R-help

Nice alternative for some cases but I do not get the desires result as one
long vector. I would change the last line to this:

unlist(as.vector(sapply(1:length(a), 
FUN=function(x, 
 a, 
 b) a[x]:b[x], 
a=a,
b=b)))

The indenting works better using a constant-width font, LOL.

With all these twists, including using the base R methods like uppercase
Map, the relative speed of these methods comes to mind. No doubt someone
will suggest rewriting this as yet another C function for speed.

-Original Message-
From: R-help  On Behalf Of Izmirlian, Grant
(NIH/NCI) [E] via R-help
Sent: Monday, October 5, 2020 1:06 PM
To: 'r-help@r-project.org' 
Subject: Re: [R] R-help Digest, Vol 212, Issue 4

Hi -- there are lots of replies --I have not read them all, if someone else
suggested this, sorry for duplication. This is similar to the suggestion
using mapply, but not specific to matrices. In fact it's a kludge that
applies to many settings. You 'sapply' over the index 1:2, and pass a, b as
arguments:


a <- c(1,4)
 b <- c(5,8)

sapply(1:2, FUN=function(x, a, b)a[x]:b[x], a=a,b=b)

-Original Message-
From: r-help-requ...@r-project.org 
Sent: Monday, October 05, 2020 6:04 AM
To: r-help@r-project.org
Subject: R-help Digest, Vol 212, Issue 4

Send R-help mailing list submissions to
r-help@r-project.org

To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/r-help
or, via email, send a message with subject or body 'help' to
r-help-requ...@r-project.org

You can reach the person managing the list at
r-help-ow...@r-project.org

When replying, please edit your Subject line so it is more specific than
"Re: Contents of R-help digest..."

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] counting duplicate items that occur in multiple groups

2020-11-17 Thread Avi Gross via R-help

Many problems can often be solved with some thought by using the right tools, 
such as the ones from the tidyverse.

Without giving a specific answer, you might want to think about using the 
group_by() functionality in a pipeline that would lump together all rows 
matching say having the same value in several columns. Then in something like a 
mutate() or summarize() you can use special functions like n() that return how 
many rows exist within each grouping. There are many more such verbs and 
features that let you build up something, often by removing the grouping along 
the way and perhaps adding some other form of grouping including the new 
rowwise() that then lets you do things across columns on a row at a time and so 
on.

I think the point is to think of steps that lead to a result that can be used 
in the next step and so on. 

And, for some problems, you can  think outside the pipelines and create 
multiple intermediate data.frames with parts of what you will need and then 
combine them with joins or whatever it takes to efficiently get a result, or by 
brute force. Sometimes (as when making graphs) you might want to convert data 
between forms that are often called long versus wide. 

Yes, plenty can be done in base R or using other packages. But a good set of 
tools might be part of what you need to investigate.

Of course, others can chime in suggesting that there are negatives to dplyr and 
other aspects of the tidyverse and they would be right too. 

-Original Message-
From: R-help  On Behalf Of Tom Woolman
Sent: Tuesday, November 17, 2020 6:30 PM
To: Bill Dunlap 
Cc: r-help@r-project.org
Subject: Re: [R] counting duplicate items that occur in multiple groups

Hi Bill. Sorry to be so obtuse with the example data, I was trying (too hard) 
not to share any actual values so I just created randomized values for my 
example; of course I should have specified that the random values would not 
provide the expected problem pattern. I should have just used simple dummy 
codes as Bill Dunlap did.

So per Bill's example data for Data1, the expected (hoped for) output should be:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1 V1  A1  0
2 V2  A2  3
3 V3  A2  3
4 V4  A2  3

Where the new calculated variable is Num_Vendors_Sharing_Bank_Acct.  
The value is 3 for V2, V3 and V4 because they each share bank account A2.

Likewise, in the Data2 frame, the same logic applies:

  Vendor Account Num_Vendors_Sharing_Bank_Acct
1 V1  A1 0
2 V2  A2 3
3 V3  A2 3
4 V1  A2 3
5 V4  A3 0
6 V2  A4 0

Thanks!

Quoting Bill Dunlap :

> What should the result be for
>   Data1 <- data.frame(Vendor=c("V1","V2","V3","V4"),
> Account=c("A1","A2","A2","A2"))
> ?
>
> Must each vendor have only one account?  If not, what should the 
> result be for
>Data2 <- data.frame(Vendor=c("V1","V2","V3","V1","V4","V2"),
> Account=c("A1","A2","A2","A2","A3","A4"))
> ?
>
> -Bill
>
> On Tue, Nov 17, 2020 at 1:20 PM Tom Woolman 
> wrote:
>
>> Hi everyone.  I have a dataframe that is a collection of Vendor IDs 
>> plus a bank account number for each vendor. I'm trying to find a way 
>> to count the number of duplicate bank accounts that occur in more 
>> than one unique Vendor_ID, and then assign the count value for each 
>> row in the dataframe in a new variable.
>>
>> I can do a count of bank accounts that occur within the same vendor 
>> using dplyr and group_by and count, but I can't figure out a way to 
>> count duplicates among multiple Vendor_IDs.
>>
>>
>> Dataframe example code:
>>
>>
>> #Create a sample data frame:
>>
>> set.seed(1)
>>
>> Data <- data.frame(Vendor_ID = sample(1:1), Bank_Account_ID =
>> sample(1:1))
>>
>>
>>
>>
>> Thanks in advance for any help.
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Printing upon calling a function

2020-11-30 Thread Avi Gross via R-help

TOPIC: Why some returned values do not automatically print.

Again, not seeing the internals, my guess is the function returned not the 
expected but "invisible(expected)" which just marks it as not to be 
automatically printed.

So if you want it printed, ask for it explicitly as in:

print(me.probit(obj))

What you did was to copy the object and then the copy does not keep the 
invisibility so when invoked, gets the default print action.

-Original Message-
From: R-help  On Behalf Of Steven Yen
Sent: Monday, November 30, 2020 4:42 AM
To: Jim Lemon ; r-help mailing list 
Subject: Re: [R] Printing upon calling a function

Thanks. I know, my point was on why I get something printed by simply doing 
line 1 below and at other occasions had to do line 2.

me.probit(obj)

v<-me.probit(obj); v

On 2020/11/30 下午 05:33, Jim Lemon wrote:
> Hi Steven,
> You seem to be assigning the result of me.oprobit(obj) to v instead of 
> printing it. By appending ";v" tp that command line, you implicitly 
> call "print".
>
> Jim
>
> On Mon, Nov 30, 2020 at 7:15 PM Steven Yen  wrote:
>> I hope I can get away without presenting a replicable set of codes 
>> because doing so would impose burdens.
>>
>> I call a function which return a data frame, with the final line
>>
>> return(out)
>>
>> In one case the data frame gets printed (similar to a regression 
>> printout), with simply a call
>>
>> me.probit(obj)
>>
>> In another case with a similar function, I could not get the results 
>> printed and the only way to print is to do the following:
>>
>> v<-me.oprobit(obj); v
>>
>> This is a puzzle, and I hope to find some clues. Thanks to all.
>>
>> My function looks like the following:
>>
>> me.oprobit0 <- function(obj,mean=FALSE,vb.method,jindex=NA,
>> resampling=FALSE,ndraws=100,mc.method=1,times100=TRUE,
>>   Stata.mu=FALSE,testing=FALSE,digits=3){
>> ...
>> return(out) # out is a data frame
>> }
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Scanned by McAfee and confirmed virus-free. 
Find out more here: https://bit.ly/2zCJMrO

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Printing upon calling a function

2020-11-30 Thread Avi Gross via R-help

Steven,

You need to mention what you actually did to get proper advice. Your problem is 
at the source.

Simply put, the R interpreter does have somewhat different behavior when the 
program is directly typed in (or slightly indirectly as in R STUDIO) than when 
you ask it to open another file as a source, perhaps recursively.

If you read the manual page, "source()" has lots of options.

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/source

The default is to change some of the behavior with the assumption there is no 
user typing things in and waiting for responses. One of the options will change 
the behavior to make statements auto-print.  Look to see if

source(..., echo=TRUE)

gets what you want. Obviously the "..." above is replaced with your additional 
code. Of course, an explicit print statement works too. You may also want to 
look at the print.eval option.

What next? Will you now tell us you forgot that you used the 
sink(file="hide.txt") function in the code and wonder why it does not print to 
your terminal? 

-Original Message-
From: R-help  On Behalf Of Steven Yen
Sent: Monday, November 30, 2020 5:55 AM
To: Stefan Evert 
Cc: R-help Mailing List 
Subject: Re: [R] Printing upon calling a function

No. I wrote the function so I am sure no "invisible" command was used. 
Strangely enough, compiling the function isto part of a package, results were 
NOT printed. Yes if I call the function during run, by preceding the call with 
a line that attach the source code:

source("A:/.../R/oprobit.R")

it did print. I do not understand.

On 2020/11/30 下午 06:41, Stefan Evert wrote:
>> On 30 Nov 2020, at 10:41, Steven Yen  wrote:
>>
>> Thanks. I know, my point was on why I get something printed by simply doing 
>> line 1 below and at other occasions had to do line 2.
>>
>> me.probit(obj)
> That means the return value of me.probit() has been marked as 
> invisible, so it won't auto-print.  You have to use an explicit print
>
>   print(me.probit(obj))
>
> or use your work-around to convince R that you actually meant to print the 
> output.
>
> If you dig through the full code of me.probit(), you'll probably find the 
> function invisible() called somewhere.
>
> Best,
> Stefan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Scanned by McAfee and confirmed virus-free. 
Find out more here: https://bit.ly/2zCJMrO

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Passing variable name

2020-12-28 Thread Avi Gross via R-help

There are endless ways to do what you want, Seyit. If you wish to remain in 
base R, using the names function on the left-hand side changes the names as in:

names(something) <- c("new", "names")

And in general, you may want to learn how to use an alternate set of methods 
that work well with ggplot in the tidyverse such as select() that lets you 
choose which columns of a data.frame (or tibble) to keep while optionally 
renaming them, and the rename() function and the mutate() function and ways to 
combine them to do lots of work. They also allow you to not use within() the 
way you are doing.

It is amusing you want names like V1. Indeed some R functions default to such 
names.

-Original Message-
From: R-help  On Behalf Of Seyit Ali KAYIS
Sent: Monday, December 28, 2020 1:32 PM
To: 'Bert Gunter' ; 'R-help' ; 
erdogancev...@gmail.com; drjimle...@gmail.com
Subject: Re: [R] Passing variable name

Hi Bert, 

 

Thanks a lot for informing me regarding the html format of my email. 

 

I also would like to thank to Erdogan CEVHER and Jim LEMON for their kind 
reply/suggestions. Yes I am aware of names function in R which is not the one I 
am looking for in here. Let me try to explain in another way.

 

The below part includes data generation, making cross-tab, Chi-Squared test and 
bar plot through ggplot. 

 

###

MyData<-data.frame("Gender" = c("F", "F", "F", "F", "M", "M", "M", "M", "M",
 "M", "F", "F"),

   "Hand" = c("R",   "R", "L", "L", "R", "R", "L", "L", "R",
 "R", "L", "L"), 

   "Gr" = c(1,  2,   1,   2,   1,   2,   1,   2,   1,   2, 
1,   2) )



MyData <- within(MyData, {

  Gender  <- factor(Gender)

  Hand <- factor(Hand)

  Gr   <- factor(Gr)

}

)



str(MyData)



library(ggplot2)

  

# Part 1   #



MyT <- table(MyData$Gender, MyData$Hand)

print(MyT)



MyChi<- chisq.test(MyT)

print(MyChi)



dMyT <- data.frame(as.table(as.matrix(table(MyData$Gender, MyData$Hand, useNA = 
"ifany"



name2<- c("Gender", "Hand", "Frequency")

names(dMyT) <- name2



ggplot(data = na.omit(dMyT), aes(fill=Hand, y=Frequency, x=Gender)) +

geom_bar(position="dodge", stat="identity")



###



Let’s say I have hundreds of variables (e.g SNP data). By using above codes I 
can perform what I need. However , I need to copy/paste variable name(s) for 
making table, Chi-Square test, and ggplot. This increase the chance of 
incorrectly copying/pasting variable name(s). What I can do is define variable 
name(s) earlier and pass that names to making table, Chi-Square test, and 
ggplot part. I believe there is a way to do it. I tried “paste” function (as 
below), but it did not work either.



Any comment/help is deeply appreciated.



Kind Regards



Seyit Ali



##

V1 <- "Gender"

V2 <- "Hand"



MyT2 <- table(paste('MyData$',V1), paste('MyData$',V2) )

print(MyT)



MyChi<- chisq.test(MyT)

print(MyChi)



dMyT <- data.frame(as.table(as.matrix(table(paste('MyData$',V1), 
paste('MyData$',V2), useNA = "ifany"

name2<- c(V1, V2, "Frequency")

names(dMyT) <- name2



ggplot(data = na.omit(dMyT), aes(fill=V2, y=Frequency, x=V1)) +

 geom_bar(position="dodge", stat="identity")



#









From: Bert Gunter [mailto:bgunter.4...@gmail.com] 
Sent: Monday, 28 December 2020 12:08 AM
To: seyitali.ka...@ibu.edu.tr
Cc: R-help 
Subject: Re: [R] Passing variable name



This is a *plain text* list. As you can see from the included text that I 
received,  the HTML version that you sent was somewhat mangled by the server. I 
do not know whether or not enough got through for you to get a helpful reply, 
but if not, re-send *to the list, not me* in *plain text*.  




Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )





On Sun, Dec 27, 2020 at 12:25 PM Seyit Ali KAYIS mailto:seyitali.ka...@ibu.edu.tr> > wrote:

Dear R users,

 �

I have a data frame as below. In part 1, I have created a table for Gender and 
Hand, performed Chi-Square test and made graph using ggplot.

 �

I want to replace the original variable names (Gender and Hand) with V1 and V2 
and to be able to perform those things again as in #part 2. Is there a way to 
be able to replace the original names?

 �

Any help is deeply appreciated

 �

Kind Regards

 �

Seyit Ali 

 �

#

 �

MyData<-data.frame("Gender" = c("F", "F","F","F",
"M",  "M",  "M",  "M",  "M",  "M",  "F",
"F"),

   "Hand" = c("R",   "R","L","L",   
 "R",

Re: [R] union of two sets are smaller than one set?

2021-01-31 Thread Avi Gross via R-help

Martin,

You did not say your two starting objects were already sets. You said they
were vectors of strings. It may well be that your strings included
duplicates. For example, If I read in lots of text with a blank line between
paragraphs, I would have lots of seemingly empty and identical parts. Just
converting that into a set would shrink it.

You have not said how you created or processed your initial two vectors. It
is also possible parts were sort of DELETED as in removing the string
pointed to by some entry but leaving a null pointer of sorts which would
leave the length of the vector longer than the useful contents.

Your strings seem to be what may be filenames. Are they unique, especially
if they are files in different folders/directories?

There are many ways to check, but using your method, try this:

length(base::union(s1, s1))

-Original Message-
From: R-help  On Behalf Of Martin Møller
Skarbiniks Pedersen
Sent: Sunday, January 31, 2021 3:57 PM
To: R mailing list 
Subject: [R] union of two sets are smaller than one set?

This is really puzzling me and when I try to make a small example everything
works like expected.

The problem:

I got these two large vectors of strings.

> str(s1)
 chr [1:766608] "0.dk" ...
> str(s2)
 chr [1:59387] "043.dk" "0606.dk" "0618.dk" "0888.dk" "0iq.dk" "0it.dk" ...

And I need to create the union-set of s1 and s2.
I expect the size of the union-set to be between 766608 and 766608+59387.
However it is 681193 which is less that number of elements in s1!

> length(base::union(s1, s2))
[1] 681193

Any hints?

Regards
Martin

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help with regular expressions.

2021-02-08 Thread Avi Gross via R-help

There are many ways, Rolf. You need to look into the syntax of regular
expressions. It depends on how sure you are that the formats are exactly as
needed. Escaping the period with one or more backslashes is one way. Using
string functions is another.

Suggestion. See if you can make a regular expression that is greedy and will
match everything up to period then a period then the rest and keep the first
and third parts and replace the middle with a minus sign. Or, match five
things. Everything up to a single period, the period, everything between,
the second period, and the rest, and keep the needed parts as above.

Periods and dashes must be used carefully though. A period means match one
of almost anything so a good way to catch it is [.] which matches a single
character of only a period. Put parens around that: "([.])" and you have a
replaceable item. In your case, you may want the parens around everything
else before and after, perhaps ([^.]*[.][^.]) then [.] then  ([^.]*) as one
long string and replace it with \1-\2 or some similar notation.

There are many other variation on this theme and some are simpler if the
exact format is consistent such as 'a' being a single character or the
string being a fixed length. If you are sure the period in "a.b.c" is always
the fourth character, no RE is needed. Use string methods. Even if not, you
can use string methods to search for a period from the end backwards or
search forward once to find the first and second time starting just past it.
Then replace. Fairly straightforward and very possibly much faster.
-Original Message-
From: R-help  On Behalf Of Rolf Turner
Sent: Monday, February 8, 2021 9:29 PM
To: "r-help@R-project.org" "@r-project.org
Subject: [R] Help with regular expressions.


I want to deal with strings of the form "a.b.c" and to change (using
sub() or whatever is appropriate) the second "." to a "-", i.e. to change
"a.b.c" to "a.b-c".  I want to leave the first "." as-is.

I guess I could do a gsub(), changing all "."s to "-"s, and then do a sub()
changing the first "-" back to a ".".  But this seems very kludgy.  There
must be a sexier way.  Mustn't there?  Is there regular expression syntax
for picking out the second occurence of a particular string?

cheers,

Rolf Turner

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Dimensioning lists.

2021-02-14 Thread Avi Gross via R-help

Rolf,

Try:

xxx[[2,3]]

The double bracket return an item, not a list containing the item.

> xxx[2,3]
[[1]]
[[1]]$a
[1] "m"

[[1]]$b
[1] 95


> xxx[[2,3]]
$a
[1] "m"

$b
[1] 95

-Original Message-
From: R-help  On Behalf Of Rolf Turner
Sent: Sunday, February 14, 2021 10:35 PM
To: "r-help@R-project.org" "@r-project.org
Subject: [R] Dimensioning lists.


I have a setting in which it would be convenient to treat a list as an
array, i.e. to address its entries via a pair of indices.

A toy example:

xxx <- vector("list",9)
set.seed(42)
for(i in 1:9) xxx[[i]] <- list(a=sample(letters,1),b=sample(1:100,1))

I would like to be able to treat "xxx" as a 3 x 3 matrix.

I tried

   dim(xxx) <- c(3,3)

When I do, e.g.

xxx[2,3]

I get:

> [[1]]
> [[1]]$a
> [1] "n"
> 
> [[1]]$b
> [1] 20

That is I get a list of length 1, whose (sole) entry is the desired object.
I would *like* to get just the desired object, *not* wrapped in a list,
i.e.:

> $a
> [1] "n"
> 
> $b
> [1] 20
> 

(which is what I get by typing xxx[2,3][[1]]).

Is there any way to prevent the entries of xxx from being wrapped up in
lists of length 1?

Thanks for any enlightenment.

cheers,

Rolf Turner

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Read

2021-02-22 Thread Avi Gross via R-help

This discussion is a bit weird so can we step back.

Someone wants help on how to read in a file that apparently was not written
following one of several consistent sets of rules.

If it was fixed width, R has functions that can read that.

If it was separated by commas, tabs, single spaces, arbitrary whitespace,
with or without a header line, we have functions that can read that if
properly called.

ALL the above normally assume that all the resulting columns are the same
length. If any are meant to be shorter, you still leave the separators in
place and put some NA or similar into the result. And, the functions we
normally talk about do NOT read in and produce multiple vectors but
something like a data.frame.

So the choice is either to make sure the darn data is in a consistent
format, or try a different plan. Fair enough?

Some are suggesting parsing it yourself line by line. Certainly that can be
done. But unless you know some schema to help you disambiguate, what do you
do it you reach a row that is too short and has enough data for two columns.
Which of the columns do you assign it to? If you had a clear rule, ...

And what if you have different data types? R does not handle that within a
single vector or row of a data.frame, albeit it can if you make it a list
column.

If this data is a one-time thing, perhaps it should be copied into something
like EXCEL by a human and edited so every column is filled as you wish and
THEN saved as something like a CSV file and then it can happily be imported
the usual way, including NA values as needed. 

If the person really wants 4 independent vectors of different lengths to
read in, there are plenty of ways to do that and no need to lump them in
this odd format.

-Original Message-
From: R-help  On Behalf Of jim holtman
Sent: Monday, February 22, 2021 9:01 PM
To: Jeff Newmiller 
Cc: r-help@R-project.org (r-help@r-project.org) 
Subject: Re: [R] Read

It looks like we can look at the last digit of the data and that would be
the column number; is that correct?

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

On Mon, Feb 22, 2021 at 5:34 PM Jeff Newmiller 
wrote:
>
> This gets it into a data frame. If you know which columns should be
numeric you can convert them.
>
> s <-
> "x1  x2  x3 x4
> 1 B22
> 2 C33
> 322 B22  D34
> 4 D44
> 51 D53
> 60 D62
> "
>
> tc <- textConnection( s )
> lns <- readLines(tc)
> close(tc)
> if ( "" == lns[ length( lns ) ] )
>   lns <- lns[ -length( lns ) ]
>
> L <- strsplit( lns, " +" )
> m <- do.call( rbind, lapply( L[-1], function(v) if 
> (length(v) ) else v ) ) colnames( m ) <- L[[1]] result <- as.data.frame( m, 
> stringsAsFactors = FALSE ) result
>
> On February 22, 2021 4:42:57 PM PST, Val  wrote:
> >That is my problem. The spacing between columns is not consistent.  
> >It
> >  may be  single space  or multiple spaces (two or three).
> >
> >On Mon, Feb 22, 2021 at 6:14 PM Bill Dunlap 
> >
> >wrote:
> >>
> >> You said the column values were separated by space characters.
> >> Copying the text from gmail shows that some column names and column 
> >> values are separated by single spaces (e.g., between x1 and x2) and 
> >> some by multiple spaces (e.g., between x3 and x4.  Did the mail 
> >> mess up the spacing or is there some other way to tell where the 
> >> omitted values are?
> >>
> >> -Bill
> >>
> >> On Mon, Feb 22, 2021 at 2:54 PM Val  wrote:
> >> >
> >> > I Tried that one and it did not work. Please see the error message
> >> > Error in read.table(text = "x1  x2  x3 x4\n1 B12 \n2   C23
> >> > \n322 B32  D34 \n4D44 \n51 D53\n60 D62
> >",
> >> > :
> >> >   more columns than column names
> >> >
> >> > On Mon, Feb 22, 2021 at 5:39 PM Bill Dunlap
> > wrote:
> >> > >
> >> > > Since the columns in the file are separated by a space 
> >> > > character,
> >" ",
> >> > > add the read.table argument sep=" ".
> >> > >
> >> > > -Bill
> >> > >
> >> > > On Mon, Feb 22, 2021 at 2:21 PM Val  wrote:
> >> > > >
> >> > > > Hi all, I am trying to read a messy data  but facing
> >difficulty.  The
> >> > > > data has several columns separated by blank space(s).  Each
> >column
> >> > > > value may have different lengths across the rows.   The first
> >> > > > row(header) has four columns. However, each row may not have
> >the four
> >> > > > column values.  For instance, the first data row has only the
> >first
> >> > > > two column values. The fourth data row has the first and last
> >column
> >> > > > values, the second and the third column values are missing 
> >> > > > for
> >this
> >> > > > row..  How do I read this data set correctly? Here is my 
> >> > > > sample
> >data
> >> > > > set, output and desired output.   To make it clear to each data
> >poin

Re: [R] Help Required for R Markdown function.

2021-02-28 Thread Avi Gross via R-help

I am sure you can get more done with a caret than a stick. I need a stick for 
another problem, though.

A serious question. I somehow upset my R/RSTUDIO setup while trying to see why 
a markdown only allowed me to save an HTML version, not PDF and DOC as it used 
to. It now fails on any such document with a code that indicates it is not set 
to find a CRAN mirror:

It seems to be upset by a simple call to get the tidyverse loaded and may 
succeed in one sense but not continue:

" Installing package into �C:/Users/avid2016/Documents/R/win-library/4.0�
(as �lib� is unspecified)
trying URL 
'https://cran.rstudio.com/bin/windows/contrib/4.0/tidyverse_1.3.0.zip'
Content type 'application/zip' length 439972 bytes (429 KB)
downloaded 429 KB

package ‘tidyverse’ successfully unpacked and MD5 sums checked
..."

The R Markdown lower console window says:

"Error in contrib.url(repos, "source") : trying to use CRAN without setting a 
mirror calls  ...
withVisible -> eval -> eval install.packages -> contrib.url Exection halted"

I have done some work that failed. I am running on windows with the latest 
versions of both R and RSTUDIO after removing all old versions and 
re-installing both. I have seen hints I need to set some definitions in a 
.Rprofile or so and tried but it continues to fail. So I now can knit nothing!

Anyone have a pointer on problems like this? My next attempt would be to 
reinstall the programs on another hard disk entirely in case the problem is in 
my folder structure of ~/R and below but I dod not want to toss years of work 
because of one errant configuration file or the lack thereof.

Thanks in advance for any advice. It may be something trivial such as kit not 
knowing where to place a library so it puts it into a temp area?

Avi

-Original Message-
From: R-help  On Behalf Of John Kane
Sent: Saturday, February 27, 2021 3:07 PM
To: Kishor raut 
Cc: R. Help Mailing List 
Subject: Re: [R] Help Required for R Markdown function.

The "confusionMatrix" function appears to be from the 'caret' package.
Have you loaded 'caret' with the library(caret) command?

On Sat, 27 Feb 2021 at 14:20, Kishor raut  wrote:

> Respected Sir,
>
> I Mr Kishor Tried to get help online but wont found the solution so 
> writting an email.
>
> Step1: While writting in rmarkdown all codes get executted very well 
> till the fuction Confusionmatrix were written on it.
>
> Step2: As confusionmatrix command inserted following error is on 
> screen board
>
>
> processing file: RMarkdown.Rmd
>   |. |
>   7%
>   ordinary text without R code
>
>   |. |
>  13%
> label: unnamed-chunk-1
>   |..|
>  20%
>   ordinary text without R code
>
>   |...   |
>  27%
> label: unnamed-chunk-2
>   |...   |
>  33%
>   ordinary text without R code
>
>   |  |
>  40%
> label: unnamed-chunk-3
>   |. |
>  47%
>   ordinary text without R code
>
>   |. |
>  53%
> label: unnamed-chunk-4
>   |..|
>  60%
>   ordinary text without R code
>
>   |...   |
>  67%
> label: unnamed-chunk-5
>   |...   |
>  73%
>   ordinary text without R code
>
>   |  |
>  80%
> label: unnamed-chunk-6
>   |. |
>  87%
>   ordinary text without R code
>
>   |. |
>  93%
> label: unnamed-chunk-7
> Quitting from lines 98-104 (RMarkdown.Rmd) Error in 
> confusionMatrix(p1, train$CTestresult) :
>   could not find function "confusionMatrix"
> Calls:  ... handle -> withCallingHandlers -> withVisible -> 
> eval
> -> eval
>
> Execution halted
>
> Step4: As codes were written these
>
>
> Using function “sample” the data takes into sample of the specified 
> size from the stored data ```{r}
> set.seed(1234)
> ind<-sample(2,nrow(FEVER1),replace=TRUE, pro = c(0.7,0.3)) 
> train<-FEVER1[ind==1,] test<-FEVER1[ind==2,] ```
>
> Application of Random Forest
>
> ```{r}
> library(randomForest)
> set.seed(123)
> rfmodel<-randomForest(CTestresult~.,data = train,prox=TRUE)# Random
> Forest Model
> plot(rfmodel,main="RandomForest Model")
> print(rfmodel,train)# Printing Outcome of 'model for
> Training Data
>
> ```
>
> Prediction using Randome Forest Mod

Re: [R] Help Required for R Markdown function.

2021-03-01 Thread Avi Gross via R-help

Calum,

Thanks for your thoughts. As mentioned, I fixed the problem. I actually was 
using require() and also tried library() but the problem was elsewhere as my R 
installation and startup seemed to no longer have variables set properly to 
find packages remotely or locally.

And yes, I made a new empty markdown but as it asks for nothing, it has no 
problems 😉

From: CALUM POLWART  
Sent: Monday, March 1, 2021 2:31 AM
To: Avi Gross 
Cc: 'R. Help Mailing List' 
Subject: Re: [R] Help Required for R Markdown function.

Sounds like you have an install.packages("tidyverse") line in your Rmd rather 
than library (tidyverse)

Does this happen on a Rmd example created by doing File/New/R Markdown

With no changes?

On 28 Feb 2021 22:04, Avi Gross via R-help mailto:r-help@r-project.org> > wrote:

I am sure you can get more done with a caret than a stick. I need a stick for 
another problem, though. 

A serious question. I somehow upset my R/RSTUDIO setup while trying to see why 
a markdown only allowed me to save an HTML version, not PDF and DOC as it used 
to. It now fails on any such document with a code that indicates it is not set 
to find a CRAN mirror: 

It seems to be upset by a simple call to get the tidyverse loaded and may 
succeed in one sense but not continue: 

" Installing package into �C:/Users/avid2016/Documents/R/win-library/4.0� 
(as �lib� is unspecified) 
trying URL 
'https://cran.rstudio.com/bin/windows/contrib/4.0/tidyverse_1.3.0.zip' 
Content type 'application/zip' length 439972 bytes (429 KB) 
downloaded 429 KB 

package ‘tidyverse’ successfully unpacked and MD5 sums checked 
..." 

The R Markdown lower console window says: 

"Error in contrib.url(repos, "source") : trying to use CRAN without setting a 
mirror calls  ... 
withVisible -> eval -> eval install.packages -> contrib.url Exection halted" 

I have done some work that failed. I am running on windows with the latest 
versions of both R and RSTUDIO after removing all old versions and 
re-installing both. I have seen hints I need to set some definitions in a 
.Rprofile or so and tried but it continues to fail. So I now can knit nothing! 

Anyone have a pointer on problems like this? My next attempt would be to 
reinstall the programs on another hard disk entirely in case the problem is in 
my folder structure of ~/R and below but I dod not want to toss years of work 
because of one errant configuration file or the lack thereof. 

Thanks in advance for any advice. It may be something trivial such as kit not 
knowing where to place a library so it puts it into a temp area? 

Avi 

-Original Message- 
From: R-help mailto:r-help-boun...@r-project.org> > On Behalf Of John Kane 
Sent: Saturday, February 27, 2021 3:07 PM 
To: Kishor raut mailto:rautkisho...@gmail.com> > 
Cc: R. Help Mailing List mailto:r-help@r-project.org> > 
Subject: Re: [R] Help Required for R Markdown function. 

The "confusionMatrix" function appears to be from the 'caret' package. 
Have you loaded 'caret' with the library(caret) command? 

On Sat, 27 Feb 2021 at 14:20, Kishor raut mailto:rautkisho...@gmail.com> > wrote: 

> Respected Sir, 
> 
> I Mr Kishor Tried to get help online but wont found the solution so 
> writting an email. 
> 
> Step1: While writting in rmarkdown all codes get executted very well 
> till the fuction Confusionmatrix were written on it. 
> 
> Step2: As confusionmatrix command inserted following error is on 
> screen board 
> 
> 
> processing file: RMarkdown.Rmd 
>   |. | 
>   7% 
>   ordinary text without R code 
> 
>   |. | 
>  13% 
> label: unnamed-chunk-1 
>   |..| 
>  20% 
>   ordinary text without R code 
> 
>   |...   | 
>  27% 
> label: unnamed-chunk-2 
>   |...   | 
>  33% 
>   ordinary text without R code 
> 
>   |  | 
>  40% 
> label: unnamed-chunk-3 
>   |. | 
>  47% 
>   ordinary text without R code 
> 
>   |. | 
>  53% 
> label: unnamed-chunk-4 
>   |..| 
>  60% 
>   ordinary text without R code 
> 
>   |...

Re: [R] How to plot dates

2021-03-16 Thread Avi Gross via R-help

It sounds to me like you want to take your data and extract one column for
JUST the date and another column for just some measure of the time, such as
the number of seconds since midnight or hours in a decimal format where
12:45 PM might be 12.75.

You now can graph date along the X axis and time along the Y (or vice versa)
and show the result in one of many ways as points or horizontal line
segments or whatever works for you.

If you always had 4 time measures for each day, or did some work, you might
also have a column for which time that is for a day, a number ranging from
one to 4 and this column could be used to set the color or other attribute
if that was useful.

So, to review. Your one data item need not be kept as the only column and if
you make more columns, you might have something to graph.

Here is an example that worked for me doing roughly what I mentioned but
note my names changed. It makes two plots.

library (ggplot2)
myDat <- read.table(text =
  "datetimeraw
2021-03-12 05:16:46
2021-03-12 09:17:02
2021-03-12 13:31:43
2021-03-12 22:00:32
2021-03-13 09:21:43
2021-03-13 13:51:12
2021-03-13 18:03:13
2021-03-13 22:20:28
2021-03-14 08:59:03
2021-03-14 13:15:56
2021-03-14 17:25:23
2021-03-14 21:36:26",
sep = ",", header = TRUE)
head(myDat)
myDat$datetime <- as.POSIXct(myDat$datetimeraw, tz = "", format ="%Y-%M-%d
%H:%M:%OS")
myDat$date <- factor(format(myDat$datetime, "%Y-%m-%d"))
myDat$time <- format(myDat$datetime, "%H:%M")
myDat$seq <- factor(rep(1:4, 3))

# just dots
ggplot(data=myDat,aes(x=date, y=time)) + geom_point(aes(color=seq))

# Also text
ggplot(data=myDat,aes(x=date, y=time, label=time)) + 
  geom_point(aes(color=seq)) + 
   geom_text(aes(color=seq))

-Original Message-
From: R-help  On Behalf Of Gregory Coats via
R-help
Sent: Tuesday, March 16, 2021 6:32 PM
To: John Fox 
Cc: r-help mailing list 
Subject: Re: [R] How to plot dates

Thank you. Let me redefine the situation.
Each time an event starts, I record the date and time.
Each day there are 4 new events. Time is the only variable.
I would like to graphically show how the time for events 1, 2, 3, and 4 for
the current day compare to the times for events 1, 2, 3, and 4 for the
previous day. How would I plot / display those times differences?
Greg Coats

library (ggplot2)
myDat <- read.table(text =
"datetime
2021-03-12 05:16:46
2021-03-12 09:17:02
2021-03-12 13:31:43
2021-03-12 22:00:32
2021-03-13 09:21:43
2021-03-13 13:51:12
2021-03-13 18:03:13
2021-03-13 22:20:28
2021-03-14 08:59:03
2021-03-14 13:15:56
2021-03-14 17:25:23
2021-03-14 21:36:26",
sep = ",", header = TRUE)
head(myDat)
myDat$datetime <- as.POSIXct(myDat$datetime, tz = "", format ="%Y-%M-%d
%H:%M:%OS")

> On Mar 16, 2021, at 3:34 PM, John Fox  wrote:
> 
> Dear Greg,
> 
> Coordinate plots typically have a horizontal (x) and vertical (y) 
> axis. The command
> 
>   ggplot(myDat, aes(x=datetime, y = datetime)) + geom_point()
> 
> works, but I doubt that it produces what you want.
> 
> You have only one variable in your data set -- datetime -- so it's not
obvious what you want to do. If you can't clearly describe the structure of
the plot you intend to draw, it's doubtful that I or anyone else can help
you.
> 
> Best,
> John
> 
> John Fox, Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> web: https://socialsciences.mcmaster.ca/jfox/ 
> 

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to plot dates

2021-03-16 Thread Avi Gross via R-help

Not sure what you mean by a horizontal line, Greg.

I change one of my plots to add a path between corresponding values of the
sequence variable but obviously those lines are mostly not horizontal. Look
at geom_path and similar geoms like geom_line to connect endpoints grouped
whatever way you specify.

ggplot(data=myDat,aes(x=date, y=time, label=time)) + 
  geom_point(aes(color=seq)) + 
  geom_path(aes(group=seq)) +
  geom_text(aes(color=seq))

As this is a help group, I think I am done.

From: Gregory Coats <mailto:gregco...@me.com> 
Sent: Tuesday, March 16, 2021 8:26 PM
To: Avi Gross <mailto:avigr...@verizon.net>
Cc: mailto:r-help@r-project.org
Subject: Re: [R] How to plot dates

Thank you very much.
In addition to what your did, for event 1, I would like to draw a horizontal
line connecting from day 1 to day 2 to day 3 to day 4.
Then, for event 2, I would like to draw a horizontal line connecting from
day 1 to day 2 to day 3 to day 4.
Similarly for events 3, and 4. Is that convenient to do?
Greg Coats

On Mar 16, 2021, at 8:01 PM, Avi Gross via R-help
<mailto:r-help@r-project.org> wrote:

Here is an example that worked for me doing roughly what I mentioned but
note my names changed. It makes two plots.

library (ggplot2)
myDat <- read.table(text =
 "datetimeraw
2021-03-12 05:16:46
2021-03-12 09:17:02
2021-03-12 13:31:43
2021-03-12 22:00:32
2021-03-13 09:21:43
2021-03-13 13:51:12
2021-03-13 18:03:13
2021-03-13 22:20:28
2021-03-14 08:59:03
2021-03-14 13:15:56
2021-03-14 17:25:23
2021-03-14 21:36:26",
   sep = ",", header = TRUE)
head(myDat)
myDat$datetime <- as.POSIXct(myDat$datetimeraw, tz = "", format ="%Y-%M-%d
%H:%M:%OS")
myDat$date <- factor(format(myDat$datetime, "%Y-%m-%d"))
myDat$time <- format(myDat$datetime, "%H:%M")
myDat$seq <- factor(rep(1:4, 3))

# just dots
ggplot(data=myDat,aes(x=date, y=time)) + geom_point(aes(color=seq))

# Also text
ggplot(data=myDat,aes(x=date, y=time, label=time)) + 
 geom_point(aes(color=seq)) + 
  geom_text(aes(color=seq))


From: Gregory Coats  
Sent: Tuesday, March 16, 2021 8:26 PM
To: Avi Gross 
Cc: r-help@r-project.org
Subject: Re: [R] How to plot dates

Thank you very much.
In addition to what your did, for event 1, I would like to draw a horizontal
line connecting from day 1 to day 2 to day 3 to day 4.
Then, for event 2, I would like to draw a horizontal line connecting from
day 1 to day 2 to day 3 to day 4.
Similarly for events 3, and 4. Is that convenient to do?
Greg Coats


On Mar 16, 2021, at 8:01 PM, Avi Gross via R-help
<mailto:r-help@r-project.org> wrote:

Here is an example that worked for me doing roughly what I mentioned but
note my names changed. It makes two plots.

library (ggplot2)
myDat <- read.table(text =
 "datetimeraw
2021-03-12 05:16:46
2021-03-12 09:17:02
2021-03-12 13:31:43
2021-03-12 22:00:32
2021-03-13 09:21:43
2021-03-13 13:51:12
2021-03-13 18:03:13
2021-03-13 22:20:28
2021-03-14 08:59:03
2021-03-14 13:15:56
2021-03-14 17:25:23
2021-03-14 21:36:26",
   sep = ",", header = TRUE)
head(myDat)
myDat$datetime <- as.POSIXct(myDat$datetimeraw, tz = "", format ="%Y-%M-%d
%H:%M:%OS")
myDat$date <- factor(format(myDat$datetime, "%Y-%m-%d"))
myDat$time <- format(myDat$datetime, "%H:%M")
myDat$seq <- factor(rep(1:4, 3))

# just dots
ggplot(data=myDat,aes(x=date, y=time)) + geom_point(aes(color=seq))

# Also text
ggplot(data=myDat,aes(x=date, y=time, label=time)) + 
 geom_point(aes(color=seq)) + 
  geom_text(aes(color=seq))

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Including a ggplot call with a conditional geom in a function

2021-03-24 Thread Avi Gross via R-help

This may not be the right place to ask about ggplot which is part of
packages but are you aware how ggplot works additively?

You can say something like:

P <- ggplot(...) ... + ...

Then later say:

P <- p + geom_...()

And so on.

So if you set al the layers you want first into a variable like p, then in
an if statement you selectively add in one or another layer and finally add
in all remaining layers before printing it, would that simply meet your
need?

Realistically, ggplot creates a data structure and the PLUS of other layers
updates or expands that structure but nothing happens till you print it and
it evaluates the data structure.

-Original Message-
From: R-help  On Behalf Of p...@philipsmith.ca
Sent: Wednesday, March 24, 2021 10:24 PM
To: r-help@r-project.org
Subject: [R] Including a ggplot call with a conditional geom in a function

How can I write an R function that contains a call to ggplot within it, with
one of the ggplot geom statements being conditional? In my reprex, I want
the plot to contain a horizontal zero line if the y values are both positive
and negative, and to exclude the horizontal line if all of the y values are
of the same sign. I tried a simple if statement, but it does not work.
Suggestions appreciated. Philip

library(rlang)
library(tidyverse)

a <- c(1:8)
b <- c(23,34,45,43,32,45,68,78)
c <- c(0.34,0.56,0.97,0.33,-0.23,-0.36,-0.11,0.17)
df <- data.frame(a,b,c)

posNeg <- function(x) {
   ifelse(sum(x>0)>0 & sum(x>0)https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to average minutes per hour per month in the form of '# hours #minutes'

2021-03-26 Thread Avi Gross via R-help

There are rather straightforward ways to manipulate your data step by step to 
make harder things possible, or you can use creative ways harder for people to 
understand.

So adding columns to your data that take existing times/dates and record them 
with names like Q1Y2021 can give you abilities but as noted they will NOT line 
up with weeks as in 1 to 52.

You can calculate the sum of hours per week, if you had the ability to group by 
week, and place that in a column that repeats that number for each day recorded 
for that week. You can then take the same data and group by quarter and take 
some kind of average of that column but it probably will be WRONG if you did 
the above as it will take the average of whatever rows it encounters and that 
may include partial weeks or other anomalies like when you only recorded three 
days for that week.

So consider other plans. What if you kept track of the number of weeks per 
month as in 28 days is 4 weeks and 31 days is 4.43 or so weeks. You could 
simply calculate the sum of hours for that month and divide by the number of 
weeks by that measure in that month. Would that number satisfy them?

And, again, rather than trying to SORT Month names, consider adding a column 
with a numerical version. Sure, you can play with factors so the months are 
recorded in the order you want and some things like ggplot will then honor that 
order.

If and when you become more expert, much of what you want might be done other 
ways without making columns for real. But it may make sense to start simple.

Here is an example of a simple change to Months Abbreviations to be made into a 
factor in order:

df$mo <- factor(df$mo,levels=month.abb)

Similar ideas involve how you convert hours and minutes to just minutes for 
averaging by adding calculated columns and you can convert the results back to 
whatever format you need later.

Just FYI, many database programs might let you do much of this internally. 
Python using the tools you are using is arguably much more flexible.

-Original Message-
From: R-help  On Behalf Of Dr Eberhard W Lisse
Sent: Friday, March 26, 2021 3:22 AM
To: r-help@r-project.org
Subject: Re: [R] How to average minutes per hour per month in the form of '# 
hours #minutes'

Jeff,

thank you. However, if I knew how to do this, I would probably not have asked 
:-)-O

I think I have been reasonably comprehensive in describing my issue, but let me 
do it now with the real life problem:

My malpractice insurance gives me a discount if I consult up to 22 hours per 
week in a 3 months period.

I add every patient, date and minutes whenever I see her into a MySQL database. 
 I want to file the report of my hours worked with them for the first 3 month 
period (November to January and not properly quarterly unfortunately :-)-0), 
and while I can generate this with LyX/LateX and knitR producing a 
(super)tabular table containing the full list, and tables for time per week and 
time per month I really can't figure out is how to average the hours worked per 
week for each month (even if weeks don't align with months properly :-)-O)

While I am at it how would I get this to sort properly (year, month) if I used 
the proper names of the months, ie '%Y %B' or '%B %Y'?

   CONSMINUTES %>%
 select(datum, dauer)  %>%
 group_by(month = format(datum, '%Y %m'),
   week = format(datum, '%V'))  %>%
 summarise_if(is.numeric, sum) %>%
 mutate(hm=sprintf("%d Hour%s %d Minutes", dauer %/% 60,
   ifelse((dauer %/% 60) == 1, " ", "s"), dauer %% 60)) %>% 
 select(-dauer)

Any help, or just pointers to where I can read this up, are highly appreciated.

greetings, el

On 2021-03-25 22:37 , Jeff Newmiller wrote:
 > This is a very unclear question.  Weeks don't line up with months..
 > so you need to clarify how you would do this or at least give an  > explicit 
 > example of input data and result data.
 >
 > On March 25, 2021 11:34:15 AM PDT, Dr Eberhard W Lisse  
 > wrote:
 >> Thanks, that is helpful.
 >>
 >> But, how do I group it to produce hours worked per week per month?
 >>
 >> el
 >>
 >>
 >> On 2021-03-25 19:03 , Greg Snow wrote:
 >>> Here is one approach:
 >>>
 >>> tmp <- data.frame(min=seq(0,150, by=15))  >>>  >>> tmp %>%
 >>> mutate(hm=sprintf("%2d Hour%s %2d Minutes",
 >>>   min %/% 60, ifelse((min %/% 60) == 1, " ", "s"),
 >>>   min %% 60))
 >>>
 >>> You could replace `sprintf` with `str_glue` (and update the syntax  >>> as 
 >>> well) if you realy need tidyverse, but you would also loose some  >>> 
 >>> formatting capability.
 >>>
 >>> I don't know of tidyverse versions of `%/%` or `%%`.  If you need  >>> the 
 >>> numeric values instead of a string then just remove the  >>> `sprintf` and 
 >>> use mutate directly with `min %/% 60` and `min %% 60`.
 >>>
 >>> This of course assumes all of your data is in minutes (by the time  >>> 
 >>> you pipe to this code) and that all hours have 60 minutes (I don't  >>> 
 >>

Re: [R] What is an alternative to expand.grid if create a long vector?

2021-04-19 Thread Avi Gross via R-help

Just some thoughts I am considering about the issue of how to make giant 
objects in memory without making them giant or all in memory.

As stupid as this sounds, when things get really big, it can mean not only 
processing your data in smaller amounts but using other techniques than asking 
expand.grid to create all possible combinations in advance.

Some languages like python allow generators that yield one item at a time and 
are called until exhausted, which sounds more like your usage. A single 
function remains resident in memory and each time it is called it uses the 
resident values in a calculation and returns the next. That approach may not 
work well with the way expand.grid works.

So a less efficient way would be to write your own deeply nested loop that 
generates one set of ten or so variables each time through the deepest nested 
loop that you can use one at a time. Alternatively, you can use such a loop to 
write a line at a time in something like a .CSV format and later read N lines 
at a time from the file or even have multiple programs work in parallel by 
taking their own allocations after ignoring the lines not meant for them, or 
some other method.

Deeply nested loops in R tend to be slow, as I have found out, which is indeed 
why I switched to using pmap() on a data.frame made using expand.grid first. 
But if your needs are exorbitant and you have limited memory, 

Can you squeeze some memory out of your design? Your data seems highly 
repetitive and if you really want to store something like this in a column:
c(seq(0.001, 1, length.out = 100))

The size of that, for comparison, is:

object.size(seq(0.001, 1, length.out = 100))
848 bytes

So it is 8 bytes per number plus some overhead.

Then consider storing something like that another way. First, the c() wrapper 
around the above is redundant, albeit harmless. Why not store this:
1L:100L

object.size(1L:100L)
448 bytes

So, four bytes per number plus some overhead.

That stores integers between 1 and 100 and in your case that means that later 
you can divide by a thousand or so to get the number you want each time but not 
store a full double-precision number.

And if you use factors, it may take less space. I note some of your other 
values pick different starting and ending points but in all cases you ask for 
100 equally-spaced values to be calculated by seq() which is fine but you could 
simply record a factor with umpteen specific values as either doubles or 
integers and if expand.grid honors that, it would use less space in any final 
output.  My experiments (not shown here) suggest you can easily cut sizes in 
half and perhaps more with judicious usage.

Perhaps finding or writing a more efficient loop in a C or C++ function would 
allow a way to loop through all possibilities more efficiently and provide a 
function for it to call on each iteration. Depending on your need, that can do 
a calculation using local variables and perhaps add a line to an output file, 
or add another set of values to a vector or other data structure that gets 
returned at the end of processing.

One possibility to consider is using an on-line resource, perhaps paying a fee, 
that will run your R program for you in an environment with more allowed 
resources like memory:

 https://rstudio.cloud/

Some of the professional options allow 8 GB of memory and perhaps 4 CPU. You 
can, of course, configure your own machine to have more memory or perhaps 
allocate lots more swap space and allow your process to abuse it. 

There are many possible solutions but also consider if the sizes and amounts 
you are working on are realistic. I worked on a project a while ago where I 
generated a huge amount of instances with 500 iterations per instance and was 
asked to bump that up to 10,000 per instance (20 times as much) just to show 
the results were similar and that 500 had been enough. It ran for DAYS and 
luckily the rest of the project went back to more manageable numbers.

So, back to your scenario, I wonder if the regularity of your data would allow 
interesting games to be played. Imagine smaller combinations of say 10 levels 
each and for each row in the resulting data.frame, expand that out again so the 
number 2,3,4 (using just three for illustration) becomes (2:29, 3:39, 4:49) and 
is given to expand.grid to make a smaller local one-use expansion table to use. 
Your original giant problem is converted to making a modest table that for each 
row expands to a second modest table that is used and immediately discarded and 
replaced by a similar table. So for ten variables, instead of making 100^10 
variations all at once, you might make 10^10 variations and iterate on rows of 
that and make another 10^10 size table and do your processing on each row of 
that and then remove that table and replace it till done. In theory, you can 
use that in additional stages and cut memory use sharply albeit perhaps 
increasing CPU usage substantia

Re: [R] Finding strings in a dataset

2021-05-15 Thread Avi Gross via R-help

Tuhin,

What do you mean by a 2-D dataset? You say some columns contain strings so
it does not sound like you are using a matrix as then  ALL columns would be
of the same type.

So are you using a data.frame or tibble or something you made on your own?

Can you address one column at a time and would that be of type vector? Some
methods work fairly easily on those and some also on lists.

Once you have that vector, there are quite a few ways to find what you want.
Is it fixed text like looking for an exact full match so it would be
something like "theta" to be matched in full, or would you want to match
"the" and both "theta" and "lathe" would match? Or are you matching a
pattern that is more complex like looking for all text that has two vowels
in a row in it?

Once you figure out what you have and what you want, how do you want to
identify what you are looking for? Will there be one match or possibly many
or even all? Many methods will return a TRUE/FALSE vector of the same length
or the integer offset of a match such as telling you it is the fifth item.

R has collections of string functions including in packages like
stringr/stringi that deal well with many things you might need. For matching
patterns, there is a family of functions using "grep" and so on.

Good luck.

-Original Message-
From: R-help  On Behalf Of Tuhin Chakraborty
Sent: Saturday, May 15, 2021 1:08 PM
To: r-help@r-project.org
Subject: [R] Finding strings in a dataset

Hi,
How can I find the location of string data in my 2D dataset? spec(Dataset)
will reveal the columns that contain the strings. But can I know where
exactly the string values are in the column?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] most stable way to output text layout

2021-06-12 Thread Avi Gross via R-help

Just FYI, Jeremie, you can do what you want fairly easily if you look at the
options available to print() and sprint().

 

You can ask NA conversion to be done here directly at print time:

 

print(mat, na.print="")

  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]   [,9]
[,10]

[1,]   "a""n_missing:" "0"


 [2,]   "b""n_unique:"  "10"


 [3,]   "c""freq:"


 [4,]   "d" "a"  "b""c"
"d"  

 [5,]   "e" "1"  "1""1"
"1"  

 [6,]   "f"


 [7,]   "g"


 [8,]   "h"  "best match: [1]:" "foo"


 [9,]   "i"


[10,]   "j"


 

 

But you see to want the columns of constant width, try this:

 

mat <- structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, "n_missing:", "n_unique:", "freq:", NA, NA, NA, NA, NA, NA, NA, "0",
"10", NA, "a", "1", NA, NA, NA, NA, NA, NA, NA, NA, "b", "1", NA, NA, "best
match: [1]:", NA, NA, NA, NA, NA, "c", "1", NA, NA, "foo", NA, NA, NA, NA,
NA, "d", "1", NA, NA, NA, NA, NA), .Dim = c(10L, 10L))

mat[is.na(mat)] <-""

col_width <- max(unlist(lapply(mat, nchar))) + 1

mat <- sprintf("%-*s", col_width, mat)

print(mat, quote=FALSE)

 

I noted your maximum column need was 16 but set it up to calculate that
dynamically and add one. Then sprint is asked to make all columns that width
and finally print put that matrix out without quotes lie this:

 

  [1]


  [7]


 [13]


 [19] a b
c d

 [25] e f g h
i j

 [31]


 [37]


 [43]


 [49] n_missing:n_unique:
freq:  

 [55]


 [61] 0 10  a
1  

 [67]


 [73]   b 1
best match: [1]: 

 [79]
c

 [85] 1 foo


 [91]   d
1  

 [97]


 

I made it  left justified and removing the "-" changes that.

 

You can also, of course, convert the matric to a data.frame or tibble and
play various games along these lines and the n use the ways to print a
data.frame that get you what you need.

 

 

 

 

-Original Message-
From: R-help  On Behalf Of Jeremie Juste
Sent: Saturday, June 12, 2021 12:25 PM
To: r-help@r-project.org
Subject: [R] most stable way to output text layout

 

Hello,

 

I'm trying to print a razor thin front-end using just text matrices and the
command prompt.

 

I admit that it is a bit crazy, it seems to do the job and is very quick to
implement...  Except that I don't know of to fix the layout.

 

I'm just seeking to map column names to a standard domain in an interactive
way.

 

For instance, at one iteration, the following matrix is produced:

 

mat <- structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, "n_missing:", "n_unique:", "freq:", NA, NA, NA, NA, NA, NA, NA, "0",
"10", NA, "a", "1", NA, NA, NA, NA, NA, NA, NA, NA, "b", "1", NA, NA, "best
match: [1]:", NA, NA, NA, NA, NA, "c", "1", NA, NA, "foo", NA, NA, NA, NA,
NA, "d", "1", NA, NA, NA, NA, NA), .Dim = c(10L, 10L))

 

 

which I represent in the console using the following command

 

 

apply(

  mat,1,

  function(x) {

x[is.na(x)] <-""

cat(x,"\n")

  })

 

Do you have any suggestion for how can I have better control on the print
layout of the  matrix so that I can fix the width of each cell?

 

Best regards,

--

Jeremie Juste

 

__

  R-help@r-project.org mailing list -- To
UNSUBSCRIBE and more, see  
https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] covert categoral data into number

2021-06-13 Thread Avi Gross via R-help

JL,

There are many ways to do what you want. If you need to do it by yourself
using standard R, there are ways but if you are allowed to use packages,
like the forcats package in the tidyverse, it can be fairly simple. Here for
example is a way to convert a factor with the four levels you mention but
requires them to be character strings, so I converted it:

> library(forcats)
> orig <- factor(as.character(c(1:4, 4:1)))
> orig
[1] 1 2 3 4 4 3 2 1
Levels: 1 2 3 4
> changed <-  fct_recode(orig, bottom = "1", middle = "2", high = "3", top =
"4")
> changed
[1] bottom middle high   toptophigh   middle bottom
Levels: bottom middle high top

The fct_recode line can also be used with explicit strings, needed if you
have things like embedded spaces:

changed <-  fct_recode(orig, 
   "bottom" = "1", 
   "middle" = "2", 
   "high" = "3", 
   "top" = "4")

To make the changes yourself may require studying other things like how to
make substitutions  using regular expressions.

-Original Message-
From: R-help  On Behalf Of Jxay Ljj
Sent: Saturday, June 12, 2021 11:25 AM
To: r-help@r-project.org
Subject: [R] covert categoral data into number

Hi

I would like to convert numbers into different categorical levels . For
example,

In one of column of a dataframe, there are numbers: 1,2,3, 4. How can I
cover them into "bottom", "middle", "high" , "top" in R codes?

Thanks,
JL

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to spot/stop making the same mistake

2021-06-23 Thread Avi Gross via R-help

This is unfortunately a bad habit many of us got from earlier languages like
the C group of languages where 0 is FALSE and 1 (and anything non-zero) is
TRUE. A language like Python is arguably even worse in that all kinds of
things can be TRUE or FALSE in odd ways, like a non-empty string or even a
random object which has declared how to decide if it qualifies as true.

R here has an issue with having so many ways to index a vector so that use
of integers has another meaning and only indexing by Booleans has the
meaning you want. So, you just need to adjust your mindset or perhaps write
a little silly function like ensure_boolean() that checks if the vector it
has is ALL just full of zero or 1 with probably no NA and otherwise returns
the vector unchanged. If it is all 0/1, it converts it and returns the
Boolean equivalent. Then instead of supplying the vector directly, most of
the time, you can substitute ensure_boolean(vec) 

-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Wednesday, June 23, 2021 11:18 AM
To: r-help@r-project.org; Phillips Rogfield ;
r-help@r-project.org
Subject: Re: [R] How to spot/stop making the same mistake

I practically never construct vectors like your `t` so it isn't a problem.
And since I make a habit of verifying the types of all vectors I am using in
expressions, if it did come up I would notice.

On June 23, 2021 8:06:05 AM PDT, Phillips Rogfield 
wrote:
>I make the same mistake all over again.
>
>In particular, suppose we have:
>
>a = c(1,2,3,4,5)
>
>and a variable that equals 1 for the elements I want to select:
>
>t = c(1,1,1,0,0)
>
>To select the first 3 elements.
>
>The problem is that
>
>a[t]
>
>would repeat the first element 3 times .
>
>I have to either convert `t` to boolean:
>
>a[t==1]
>
>Or use `which`
>
>a[which(t==1)]
>
>How can I "spot" this error?
>
>It often happens in long scripts.
>
>Do I have to check the type each time?
>
>Do you have any suggestions?
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to spot/stop making the same mistake

2021-06-23 Thread Avi Gross via R-help

Just a caution. There IS an operator of `!!` in the tidyverse called "bang 
bang" that does a kind of substitution and you can look up the help page for it 
as:

?`!!`

I just tried it on an example and it definitely will in some cases do this 
other evaluation.

I doubt this will clash, but of course parentheses can force the normal 
evaluation as in !(!(a))

And there is a !!! symbol there too to make a big bang, theoretically.

But not for novices 😉

-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Wednesday, June 23, 2021 2:10 PM
To: r-help@r-project.org; Phillips Rogfield ; Bert 
Gunter ; Eric Berger 
Cc: r-help@r-project.org
Subject: Re: [R] How to spot/stop making the same mistake

For the record, `!!` is not an operator so it does not "operate" on anything. 
The right ! does per the help page (?`!`) interpret non-zero values as TRUE and 
invert that logic, yielding a logical result even if the input is not logical. 
The left ! inverts that again, yielding a logical vector without the inversion.

On June 23, 2021 10:39:07 AM PDT, Phillips Rogfield  
wrote:
>Dear all,
>
>thank for for your suggestion.
>
>Yes I come from languages where 1 means TRUE and 0 means FALSE. In 
>particular from C/C++ and Python.
>
>Evidently this is not the case for R.
>
>In my mind I kind took for granted that that was the case (1=TRUE, 
>0=FALSE).
>
>Knowing this is not the case for R makes things simpler.
>
>Mine was just an example, sometimes I load datasets taken from outside 
>and variables are coded with 1/0 (for example, a treatment variable may
>
>be coded that way).
>
>I also did not know the !!() syntax!
>
>Thank you for your help and best regards.
>
>On 23/06/2021 17:55, Bert Gunter wrote:
>> Just as a way to save a bit of typing, instead of
>>
>> > as.logical(0:4)
>> [1] FALSE  TRUE  TRUE  TRUE  TRUE
>>
>> > !!(0:4)
>> [1] FALSE  TRUE  TRUE  TRUE  TRUE
>>
>> DO NOTE that the parentheses in the second expression should never be
>
>> omitted, a possible reason to prefer the as.logical() construction.
>> Also note that !!  "acts [only] on raw, logical and number-like 
>> vectors," whereas as.logical() is more general. e.g. (from ?logical):
>>
>> > charvec <- c("FALSE", "F", "False", "false","fAlse", "0",
>> +  "TRUE",  "T", "True",  "true", "tRue",  "1")
>> > as.logical(charvec)
>>  [1] FALSE FALSE FALSE FALSENANA  TRUE  TRUE  TRUE  TRUE
> NA
>>NA
>> > !!charvec
>> Error in !charvec : invalid argument type
>>
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
>along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Wed, Jun 23, 2021 at 8:31 AM Eric Berger > > wrote:
>>
>> In my code, instead of 't', I name a vector of indices with a
>> meaningful
>> name, such as idxV, to make it obvious.
>>
>> Alternatively, a minor change in your style would be to replace
>your
>> definition of t by
>>
>> t <- as.logical(c(1,1,1,0,0))
>>
>> HTH,
>> Eric
>>
>>
>> On Wed, Jun 23, 2021 at 6:11 PM Phillips Rogfield
>> mailto:thebudge...@gmail.com>>
>> wrote:
>>
>> > I make the same mistake all over again.
>> >
>> > In particular, suppose we have:
>> >
>> > a = c(1,2,3,4,5)
>> >
>> > and a variable that equals 1 for the elements I want to select:
>> >
>> > t = c(1,1,1,0,0)
>> >
>> > To select the first 3 elements.
>> >
>> > The problem is that
>> >
>> > a[t]
>> >
>> > would repeat the first element 3 times .
>> >
>> > I have to either convert `t` to boolean:
>> >
>> > a[t==1]
>> >
>> > Or use `which`
>> >
>> > a[which(t==1)]
>> >
>> > How can I "spot" this error?
>> >
>> > It often happens in long scripts.
>> >
>> > Do I have to check the type each time?
>> >
>> > Do you have any suggestions?
>> >
>> > __
>> > R-help@r-project.org  mailing list
>> -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> 
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> 
>> > and provide commented, minimal, self-contained, reproducible
>code.
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org  mailing list
>--
>> To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> 
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>

Re: [R] Constructing stacked bar plot

2021-06-27 Thread Avi Gross via R-help

Why should that work in a dplyr function?

medal_data <- medal_counts_ctry %>% filter(medal_counts_ctry$.rows > 100)

Generally in dplyr you do not use the dollar sign notation. And is there a
column starting with a period called ".rows" ??

Without seeing what your data looks like, and assuming you have a column at
that point called rows, I might try:

medal_data <- 
  medal_counts_ctry %>% 
  filter(rows > 100)


-Original Message-
From: R-help  On Behalf Of Jeff Reichman
Sent: Sunday, June 27, 2021 12:36 PM
To: 'Bert Gunter' 
Cc: 'R-help' 
Subject: Re: [R] Constructing stacked bar plot

This line

 

medal_data <- medal_counts_ctry %>% filter(medal_counts_ctry$.rows > 100)

 

From: Bert Gunter 
Sent: Sunday, June 27, 2021 11:32 AM
To: reichm...@sbcglobal.net
Cc: R-help 
Subject: Re: [R] Constructing stacked bar plot

 

As has already been pointed out to you (several times, I believe) -- **HTML
code is stripped on this *plain text* list**.

Hence, "bolded, red code" is meaningless!




Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

 

 

On Sun, Jun 27, 2021 at 9:10 AM Jeff Reichman mailto:reichm...@sbcglobal.net> > wrote:

R-help Forum

I am attempting to create a stacked bar chart but I have to many categories
The following code works and I end up plotting all 134 countries but really
only need (say) the top 50 or so.

I am trying to figure out how to further filter out the countries with the
largest total medal counts to plot. The bolded red code is the point where I
am thinking is the point where I would do this . I've tried several
different methods but to no avail. Any suggestions?


# Load data file matching NOCs with mao regions (countries) noc <-
read_csv("~/NGA_Files/JuneMakeoverMonday/noc_regions.csv",
col_types = cols(
  NOC = col_character(),
  region = col_character()
))

# Add regions to data and remove missing points data_regions <- data %>%
  left_join(noc,by="NOC") %>%
  filter(!is.na  (region))

# Subset to variables of interest
medals <- data_regions %>%
  select(region, Medal)

# count number of medals awarded to each Team medal_counts_ctry <- medals
%>% filter(!is.na  (Medal))%>%
  group_by(region, Medal) %>%
  summarize(Count=length(Medal)) 

#head(medal_counts_ctry)

# order Team by total medal count
levs_medal <- medal_counts_ctry %>%
  group_by(region) %>%
  summarize(Total=sum(Count)) %>%
  arrange(desc(Total))

medal_counts_ctry$region <- factor(medal_counts_ctry$region,
levels=levs_medal$region)

medal_data <- medal_counts_ctry %>% filter(medal_counts_ctry$.rows > 100)

# plot
ggplot(medal_data, aes(x=region, y=Count, fill=Medal)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values=c("darkorange3","darkgoldenrod1","cornsilk3")) +
  ggtitle("Historical medal counts from Country Teams") +
  theme(plot.title = element_text(hjust = 0.5))


> str(medal_counts_ctry)
grouped_df [323 x 3] (S3: grouped_df/tbl_df/tbl/data.frame)  $ region:
Factor w/ 134 levels "USA","Russia",..: 101 70 70 70 29 29 29 73
73 73 ...
 $ Medal : Factor w/ 3 levels "Bronze","Gold",..: 1 1 2 3 1 2 3 1 2 3 ...
 $ Count : int [1:323] 2 8 5 4 91 91 92 9 2 5 ...
 - attr(*, "groups")= tibble [134 x 2] (S3: tbl_df/tbl/data.frame)
  ..$ region: Factor w/ 134 levels "USA","Russia",.: 1 2 3 4 5 6 7 8 9 10 ..
  ..$ .rows : list [1:134]
  .. ..$ : int [1:3] 307 308 309
  .. ..$ : int [1:3] 235 236 237
  .. ..$ : int [1:3] 102 103 104
  .. ..$ : int [1:3] 296 297 298
  .. ..$ : int [1:3] 95 96 97
  .. ..$ : int [1:3] 138 139 140
  .. ..$ : int [1:3] 263 264 265
  .. ..$ : int [1:3] 46 47 48
  .. ..$ : int [1:3] 11 12 13
  .. ..$ : int [1:3] 117 118 119
  .. ..$ : int [1:3] 194 195 196
  .. ..$ : int [1:3] 208 209 210
  .. ..$ : int [1:3] 52 53 54
  .. ..$ : int [1:3] 147 148 149
  .. ..$ : int [1:3] 92 93 94
  .. ..$ : int [1:3] 266 267 268
  .. ..$ : int [1:3] 232 233 234
  .. ..$ : int [1:3] 69 70 71
  .. ..$ : int [1:3] 253 254 255 ..

Jeff Reichman

[[alternative HTML version deleted]]

__
R-help@r-project.org   mailing list -- To
UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and m

Re: [R] concatenating columns in data.frame

2021-07-01 Thread Avi Gross via R-help

Micha,

Others have provided ways in standard R so I will contribute a somewhat odd 
solution using the dplyr and related packages in the tidyverse including a 
sample data.frame/tibble I made. It requires newer versions of R and other  
packages as it uses some fairly esoteric features including "the big bang" and 
the new ":=" operator and more.

You can use your own data with whatever columns you need, of course.

The goal is to have umpteen columns in the data that you want to add an 
additional columns to an existing tibble that is the result of concatenating 
the rowwise contents of a dynamically supplied vector of column names in 
quotes. First we need something to work with so here is a sample:

#--start
# load required packages, or a bunch at once!
library(tidyverse)

# Pick how many rows you want. For a demo, 3 is plenty N <- 3

# Make a sample tibble with N rows and the following 4 columns mydf <- 
tibble(alpha = 1:N, 
   beta=letters[1:N],
   gamma = N:1,
   delta = month.abb[1:N])

# show the original tibble
print(mydf)
#--end

In flat text mode, here is the output:

> print(mydf)
# A tibble: 3 x 4
alpha beta  gamma delta
   
  1 1 a 3 Jan  
2 2 b 2 Feb  
3 3 c 1 Mar

Now I want to make a function that is used instead of the mutate verb. I made a 
weird one-liner that is a tad hard to explain so first let me mention the 
requirements.

It will take a first argument that is a tibble and in a pipeline this would be 
passed invisibly.
The second required argument is a vector or list containing the names of the 
columns as strings. A column can be re-used multiple times.
The third optional argument is what to name the new column with a default if 
omitted.
The fourth optional argument allows you to choose a different separator than "" 
if you wish.

The function should be usable in a pipeline on both sides so it should also 
return the input tibble with an extra column to the output.

Here is the function:

my_mutate <- function(df, columns, colnew="concatenated", sep=""){
  df %>%
mutate( "{colnew}" := paste(!!!rlang::syms(columns), sep = sep )) }

Yes, the above can be done inline as a long one-liner:

my_mutate <- function(df, columns, colnew="concatenated", sep="") mutate(df, 
"{colnew}" := paste(!!!rlang::syms(columns), sep = sep ))

Here are examples of it running:


> choices <- c("beta", "delta", "alpha", "delta") mydf %>% 
> my_mutate(choices, "me2")
# A tibble: 3 x 5
alpha beta  gamma delta me2 
   
  1 1 a 3 Jan   aJan1Jan
2 2 b 2 Feb   bFeb2Feb
3 3 c 1 Mar   cMar3Mar
> mydf %>% my_mutate(choices, "me2",":")
# A tibble: 3 x 5
alpha beta  gamma delta me2
  
  1 1 a 3 Jan   a:Jan:1:Jan
2 2 b 2 Feb   b:Feb:2:Feb
3 3 c 1 Mar   c:Mar:3:Mar
> mydf %>% my_mutate(c("beta", "beta", "gamma", "gamma", "delta", 
> "alpha"))
# A tibble: 3 x 5
alpha beta  gamma delta concatenated
   
  1 1 a 3 Jan   aa33Jan1
2 2 b 2 Feb   bb22Feb2
3 3 c 1 Mar   cc11Mar3
> mydf %>% my_mutate(list("beta", "beta", "gamma", "gamma", "delta", 
> "alpha"))
# A tibble: 3 x 5
alpha beta  gamma delta concatenated
   
  1 1 a 3 Jan   aa33Jan1
2 2 b 2 Feb   bb22Feb2
3 3 c 1 Mar   cc11Mar3
> mydf %>% my_mutate(columns=list("alpha", "beta", "gamma", "delta", 
> "gamma", "beta", "alpha"),
 +sep="/*/",
 +colnew="NewRandomNAME"
 +)
# A tibble: 3 x 5
alpha beta  gamma delta NewRandomNAME  
  
  1 1 a 3 Jan   1/*/a/*/3/*/Jan/*/3/*/a/*/1
2 2 b 2 Feb   2/*/b/*/2/*/Feb/*/2/*/b/*/2
3 3 c 1 Mar   3/*/c/*/1/*/Mar/*/1/*/c/*/3

Does this meet your normal need? Just to show it works in a pipeline, here is a 
variant:

mydf %>%
  tail(2) %>%
  my_mutate(c("beta", "beta"), "betabeta") %>%
  print() %>%
  my_mutate(list("alpha", "betabeta", "gamma"),
"buildson", 
"&")

The above only keeps the last two lines of the tibble, makes a double copy of 
"beta" under a new name, prints the intermediate result, continues to make 
another concatenation using the variable created earlier then prints the result:

Here is the run:

> mydf %>%
  +   tail(2) %>%
  +   my_mutate(c("beta", "beta"), "betabeta") %>%
  +   print() %>%
  +   my_mutate(list("alpha", "betabeta", "gamma"),
+ "buildson", 
+ "&")
# A tibble: 2 x 5
alpha beta  gamma delta betabeta
   
  1 2 b 2 Feb   bb  
2 3 c 1 Mar   cc  
# A tibble: 2 x 6
alpha beta  gamma delta betabeta buildson
   
  1 2 b 2 Feb   bb   2&bb&2  
2 3 c 1 Mar   cc   3&cc&1  

As to how the darn functi

Re: [R] List / Matrix to Data Frame

2021-07-01 Thread Avi Gross via R-help

Bill,

A Matrix can only contain one kind of data. I ran your code after modifying
it to be proper and took a transpose to get it risght-side up:

t(wanted)
date netIncomegrossProfit  
2020-09-30 "2020-09-30" "5741100.00" "10495600.00"
2019-09-30 "2019-09-30" "5525600.00" "9839200.00" 
2018-09-30 "2018-09-30" "5953100.00" "10183900.00"
2017-09-30 "2017-09-30" "4835100.00" "8818600.00"

That looks better, I think.

So I did this:

wanted <- t(wanted)

It has rownames and colnames and you can make it a data frame easily enough:

mydf <- data.frame(wanted)

But the columns are all character strings, so CONVERT them as you wish:

> mydf
date  netIncome grossProfit
2020-09-30 2020-09-30 5741100.00 10495600.00
2019-09-30 2019-09-30 5525600.00  9839200.00
2018-09-30 2018-09-30 5953100.00 10183900.00

Your numbers are quite large and may or may not be meant to be integers:

> mydf$netIncome <- as.numeric(mydf$netIncome)
> mydf$grossProfit <- as.numeric(mydf$grossProfit)
> head(mydf)
date  netIncome grossProfit
2020-09-30 2020-09-30 5.7411e+10 1.04956e+11
2019-09-30 2019-09-30 5.5256e+10 9.83920e+10
2018-09-30 2018-09-30 5.9531e+10 1.01839e+11
2017-09-30 2017-09-30 4.8351e+10 8.81860e+10
2016-09-30 2016-09-30 4.5687e+10 8.42630e+10
2015-09-30 2015-09-30 5.3394e+10 9.36260e+10

The first entries may have something wrong as they become NA when I make
them integers.

The date column is the same as the rownames and is not in a normal vector
format. It shows as a list and you may want to convert it to one of several
formats R supports for dates or a more normal character string. 

So here is how I made it a character string:

> mydf <- as.data.frame(wanted)
> mydf$date <- as.character(mydf$date)
> mydf$netIncome <- as.numeric(mydf$netIncome)
> mydf$grossProfit <- as.numeric(mydf$grossProfit)
> head(mydf)
date  netIncome grossProfit
2020-09-30 2020-09-30 5.7411e+10 1.04956e+11
2019-09-30 2019-09-30 5.5256e+10 9.83920e+10
2018-09-30 2018-09-30 5.9531e+10 1.01839e+11
2017-09-30 2017-09-30 4.8351e+10 8.81860e+10
2016-09-30 2016-09-30 4.5687e+10 8.42630e+10
2015-09-30 2015-09-30 5.3394e+10 9.36260e+10

If you want a DATE, it can now be converted again using one of many methods.

Just FYI, numbers that big and rounded might work just as well measured in
millions as in what I did to grossProfit:

> mydf$grossProfit <- as.integer(mydf$grossProfit/100)
> mydf$netIncome <- as.integer(mydf$netIncome/100)
> head(mydf)
date netIncome grossProfit
2020-09-30 2020-09-30 57411  104956
2019-09-30 2019-09-30 55256   98392
2018-09-30 2018-09-30 59531  101839
2017-09-30 2017-09-30 48351   88186
2016-09-30 2016-09-30 45687   84263
2015-09-30 2015-09-30 53394   93626

Some of the numbers are negative though.

If rownames are not needed:

> rownames(mydf) <- NULL
> head(mydf)
date netIncome grossProfit
1 2020-09-30 57411  104956
2 2019-09-30 55256   98392
3 2018-09-30 59531  101839
4 2017-09-30 48351   88186
5 2016-09-30 45687   84263
6 2015-09-30 53394   93626

It may be easier to work with this, but again, if you need the dates to be
real dates, as in graphing.

Hope that helps. 





-Original Message-
From: R-help  On Behalf Of Bill Dunlap
Sent: Thursday, July 1, 2021 9:01 PM
To: Sparks, John 
Cc: r-help@r-project.org
Subject: Re: [R] List / Matrix to Data Frame

Does this do what you want?

> df <- data.frame(check.names=FALSE,
lapply(c(Date="date",netIncome="netIncome",`Gross Profit`="grossProfit"),
function(nm)vapply(ISY, "[[", nm, FUN.VALUE=NA_character_)))
> str(df)
'data.frame':   36 obs. of  3 variables:
 $ Date: chr  "2020-09-30" "2019-09-30" "2018-09-30" "2017-09-30"
...
 $ netIncome   : chr  "5741100.00" "5525600.00" "5953100.00"
"4835100.00" ...
 $ Gross Profit: chr  "10495600.00" "9839200.00" "10183900.00"
"8818600.00" ...
> df$Date <- as.Date(df$Date)
> df$netIncome <- as.numeric(df$netIncome) df$`Gross Profit` <- 
> as.numeric(df$`Gross Profit`)
> str(df)
'data.frame':   36 obs. of  3 variables:
 $ Date: Date, format: "2020-09-30" "2019-09-30" "2018-09-30"
"2017-09-30" ...
 $ netIncome   : num  5.74e+10 5.53e+10 5.95e+10 4.84e+10 4.57e+10 ...
 $ Gross Profit: num  1.05e+11 9.84e+10 1.02e+11 8.82e+10 8.43e+10 ...
> with(df, plot(Date, netIncome))

On Thu, Jul 1, 2021 at 5:35 PM Sparks, John  wrote:

> Hi R-Helpers,
>
> I am taking it upon myself to delve into the world of lists for R.  In 
> no small part because I appear to have discovered a source of data for 
> an exceptionally good price but that delivers much of that data in json
format.
>
> So over the last day or so I managed to fight the list processing 
> tools to a draw and get a list that has only selected elements 
> (actually it ends up in matrix form).  But when I try to convert that 
> to a data frame I can't get it to a f

Re: [R] concatenating columns in data.frame

2021-07-02 Thread Avi Gross via R-help

I know what you mean Jeff. Yes I am very familiar with base R techniques. What 
I had hoped for was to do two things that some of the other methods mentioned 
do that ended up bringing two data.frames together as part of the solution.

Much of what I used is now standard R. I was looking at the accessory functions 
now commonly used in dplyr that let you dynamically select which columns to 
work with like begins_with() to choose. Sadly, they seem to work on a top-level 
but not easily within a call to something like paste(...) where they are not 
evaluated in the way I want.

But the odd method I tried can also be used in standard R with a bit of work. 
You can create a function without using dplyr that takes your df and uses it to 
concatenate and end with something like:

df$new_col <- do_something(df, selected_cols)

That too adds a column without the need to merge larger structures explicitly..

But your other point is a tad religious in a sense. I happen to prefer learning 
a core language first then looking at enhancement opportunities. But at some 
point, if teaching someone new who wants to focus on getting a job done simply 
but not necessarily repeatedly or in some ideal way, it is best to do things in 
a way that their mind flows better.

Many things in the tidyverse are redundant with base R or just "fix" 
inconsistencies like making sure the first argument is always the same. But 
many add substantially to doing things in a more step-by-step manner.

I do not worship the base language as it first came out or even as it has 
evolved. I do like to know what choices I have and pick and choose among them 
as needed. Of course a forum like this is more about base R than otherwise and 
I acknowledge that. Still, the ":=" operator is now base R. There is a new 
pipeline operator "|>" in base R. Some ideas, good or otherwise, do get in 
eventually.

I started doing graphs using base R as in the plot() command. It was adequate 
but I wanted better. So I learned about Lattice and various packages and 
eventually ggplot. I can now do things I barely imagined before and am still 
learning that there is much more I can do with packages underneath much of the 
magic and also additional packages layered above it, in some sense. So I do not 
approach that with an either-or mentality either.

Note I am not really talking about just R. I have similar issues with other 
languages I program in such as Python. None of them were created fully-formed 
and many had to add huge amounts to adapt to additional wants and needs. Base R 
for me is often inadequate. But so what?

The task being asked for in this thread in isolation, indeed may not be done 
any better using packages. However, if it is part of a larger set of tasks that 
can be pipelined, it may well be and I personally was wondering if there was a 
way in dplyr. There probably is a much better way than I assembled if I only 
knew about it, and if not, they may add this kind of indirection in a future 
release if deemed worthy of doing. I have gone back to programs I did years ago 
with humungous amounts of code using what I knew then and reducing it 
drastically now that I can tell a function to select say all my column names 
that end in .orig and apply a set of functions to them with output going to the 
base name followed by .mean and .sd and so on. All that can often be done in 
one or two lines of code where previously I had to do 18 near repetitions of 
each part and then another and another. That used a limited form of dynamism.

Be that as it may I think the requester has enough info and we can move on.

-Original Message-
From: Jeff Newmiller  
Sent: Friday, July 2, 2021 1:03 AM
To: Avi Gross ; Avi Gross via R-help 
; R-help@r-project.org
Subject: Re: [R] concatenating columns in data.frame

I use parts of the tidyverse frequently, but this post is the best argument I 
can imagine for learning base R techniques.

On July 1, 2021 8:41:06 PM PDT, Avi Gross via R-help  
wrote:
>Micha,
>
>Others have provided ways in standard R so I will contribute a somewhat 
>odd solution using the dplyr and related packages in the tidyverse 
>including a sample data.frame/tibble I made. It requires newer versions 
>of R and other  packages as it uses some fairly esoteric features 
>including "the big bang" and the new ":=" operator and more.
>
>You can use your own data with whatever columns you need, of course.
>
>The goal is to have umpteen columns in the data that you want to add an 
>additional columns to an existing tibble that is the result of 
>concatenating the rowwise contents of a dynamically supplied vector of 
>column names in quotes. First we need something to work with so here is 
>a sample:
>
>#--start
># load required packages, or a bunch at once!
>library(tidyverse)
>
># Pick how many rows you want. For a

Re: [R] add a variable a data frame to sequentially count unique rows

2021-07-02 Thread Avi Gross via R-help

Ding,

Just to get you to stop asking, here is a solution I hope works.

In English, if you are asking that ONE instance of a duplicate be marked in
a new column with TRUE or 1 while all remaining ones are marked as FALSE or
2 or whatever, that is easy enough. The method is to use the assistive
function row_number() inside a grouped mutate() and only one item has a row
number of 1 and all others higher.

If I load your test variable (see below) I can add another column I called
count2 fairly easily with this:

test %>% 
  group_by(group1,group2) %>% 
  mutate(count2 = ifelse(row_number()==1, TRUE, FALSE)) %>%
  ungroup()

The output matches your first count variable made by hand:

> test %>% 
  +   group_by(group1,group2) %>% 
  +   mutate(count2 = ifelse(row_number()==1, TRUE, FALSE)) %>%
  +   ungroup()
# A tibble: 9 x 4
group1 group2 count count2
  
  1 g1 k1 1 TRUE  
2 g1 a2 1 TRUE  
3 g1 a2 2 FALSE 
4 g2 c5 1 TRUE  
5 g2 n6 2 TRUE  
6 g2 n6 2 FALSE 
7 g2 n6 2 FALSE 
8 g2 m103 TRUE  
9 g2 m103 FALSE

Now if you actually want to have a count of first and 2nd and third, it is
even easier:

test %>% 
  group_by(group1,group2) %>% 
  mutate(counter = row_number()) %>%
  ungroup()

Unfortunately for you, my version of output suggest you made a mistake on
the last row with g2/n6:

> test %>% 
  +   group_by(group1,group2) %>% 
  +   mutate(counter = row_number()) %>%
  +   ungroup()
# A tibble: 9 x 4
group1 group2 count counter
   
  1 g1 k1 1   1
2 g1 a2 1   1
3 g1 a2 2   2
4 g2 c5 1   1
5 g2 n6 2   1
6 g2 n6 2   2
7 g2 n6 2   3
8 g2 m103   1
9 g2 m103   2

There are of course other ways to do such things but the above seems simple
enough.

-Original Message-
From: R-help  On Behalf Of Yuan Chun Ding
Sent: Friday, July 2, 2021 6:27 PM
To: r-help@r-project.org
Subject: [R] add a variable a data frame to sequentially count unique rows

Hi R users,

In this test file,
test  <- data.frame(group1=c("g1", "g1", "g1", "g2", "g2", "g2", "g2", "g2",
"g2"),
   group2=c("k1", "a2", "a2", "c5",
"n6", "n6", "n6", "m10","m10"),
   count= c( 1, 1,2,   1, 2,
2, 2,3,3 ));

I have group 1 and group2 variable and want to add the count variable to
sequentially count unique rows defined by group1 and group2.  

I hope to use the following functions in library (tidyverse),  No one worked
well.
test %>% group_by(group1, group2) %>% mutate(count = row_number()) test %>%
group_by(group1, group2) %>% mutate(count = 1:n()) test %>% group_by(group1,
group2) %>% mutate(count = seq_len(n())) test %>% group_by(group1, group2)
%>% mutate(count = seq_along(group1, group2))

Can you help me to make the third column in the test data frame?

Thank you,

Ding

--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or
entity to which they are addressed. This communication may contain
information that is privileged, confidential, or exempt from disclosure
under applicable law (e.g., personal health information, research data,
financial information). Because this e-mail has been sent without
encryption, individuals other than the intended recipient may be able to
view the information, forward it to others or tamper with the information
without the knowledge or consent of the sender. If you are not the intended
recipient, or the employee or person responsible for delivering the message
to the intended recipient, any dissemination, distribution or copying of the
communication is strictly prohibited. If you received the communication in
error, please notify the sender immediately by replying to this message and
deleting the message and any accompanying files from your system. If, due to
the security risks, you do not wish to receive further communications via
e-mail, please reply to this message and inform the sender that you do not
wish to receive further e-mail from the sender. (LCP301)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Plotting confidence intervals with ggplot, in multiple facets.

2021-07-18 Thread Avi Gross via R-help

Rolf,

Your example shows two plots with one above the other.

If that is what you want, then a solution like the one  Jeff provided  using
facet_grid() to separate data based on the parameter value. It also scales
up if you add additional sets of data for gamma and delta up to a point.

An alternative to consider if your ggplot wizardry has not kicked in, or if
your need is to connect more diverse plots into a sort of collage, is to
make multiple plots and save them as in:

P1 <- ggplot(...) ...
P2 <- ggplot(...) ...

Each of these two or more plots can operate on whatever data you supply it.
In your case, you would filter the rows that have the values you want.

Then you can use one of many packages out there such as cowplot or gridextra
to consolidate the parts as in:

plot_grid(P1, P2)

Or 

grid.arrange (P1, P2)

These other functions vary in functionality but many allow you to adjust how
many rows or columns you want or the relative sizes of the subplots and so
on. Some allow you to create output into things like a PDF where parts spill
over well into additional pages.

I note one think Jeff did not replicate. If you want that in your output,
use geom_hline as shown below:

ggplot(dta,aes(x=Ndat,y=estimate, ymin=lower,ymax=upper))+
  geom_point() +
  geom_errorbar(width=30) +
  geom_hline(yintercept = 0, color="red") +
  facet_grid(param~1)

Personally, I sometimes adjust the limits to make a graph truly start at
zero but in your case you can have error bars being displayed with parts
below zero so showing where zero is graphically can be useful.




-Original Message-
From: R-help  On Behalf Of Rolf Turner
Sent: Sunday, July 18, 2021 2:17 AM
To: r-help@r-project.org
Subject: [R] Plotting confidence intervals with ggplot, in multiple facets.



I have need of creating a plot displaying confidence intervals (for the mean
bias in parameter estimates) with one panel or facet for each of the two
parameters in question.

I can do this in base R graphics, but the result is not as aesthetically
pleasing as I would like.  I have attached an example graphic in the file
"eg.pdf".

I would like to try using ggplot2, but cannot get my head around the syntax.
(Life is a struggle when one is old and senile!)  I have been shown in the
past how to produce a single-facet plot of such confidence intervals,
basically using the geom_errorbar() function, but I cannot see how to
produce multiple facets, depending on a "param" factor.  I have thrashed
around a bit but after succeeding in only confusing myself, I thought I
would save wear and tear on my brain by asking this list.  I'm sure the
answer is pretty simple, but I'm just too stupid to see it.

Can anyone give me a recipe for creating, with ggplot(), a graphic like unto
that shown in "eg.pdf", but prettier?  I have attached the data that were
used to create "eg.pdf" in the form of a data frame, in a file called
"egData.txt".  This file was produced by dput() so read it in using
dget("egData.txt").

With eternal gratitude.

cheers,

Rolf Turner

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Plotting confidence intervals with ggplot, in multiple facets.

2021-07-19 Thread Avi Gross via R-help

Rolf,

 

Your questions probably should go to a group focused on the ggplot package, not 
a general R group where many do not use it.

 

A little judicious searching like "R ggplot use greek letters in text" gets you 
some pointers that show how to do much more than Greek letters but more complex 
mathematical style equations and that many aspects of ggplot support it and you 
can use other functions like paste() to put more complex expressions together.

 

Similarly, your comments about wanting to have different scales showing in 
multiple plots using facet_grid() or perhaps facet_wrap() might answer your 
question as the manual page explains:

 


scales

Are scales shared across all facets (the default, "fixed"), or do they vary 
across rows ("free_x"), columns ("free_y"), or both rows and columns ("free")?

So alter the line you use at the end to include a comma at the end of the 
arguments followed by “scale=…” as needed.

 

But yes, truly independent graphs placed in a grid, after loading the packages 
needed, gives even more options.

 

There are some very decent books and tutorials on many aspects of ggplot 
including some that are free and on-line. Here is an earlier edition of one:

 

https://ggplot2-book.org/

 

Note especially this section on faceting:

 

https://ggplot2-book.org/facet.html

 

I will now go silent on ggplot-related questions 😉

 

 

-Original Message-
From: R-help  On Behalf Of Rolf Turner
Sent: Monday, July 19, 2021 7:24 PM
To: r-help@r-project.org
Subject: Re: [R] Plotting confidence intervals with ggplot, in multiple facets.

 

 

 

Thanks to Jeff Newmiller, Rui Barradas and Avi Gross for their extremely 
helpful replies.  I have got both Jeff's and Rui's code to run.  I am currently 
experimenting with Avi's suggestion of producing multiple plots and then 
putting them together using plotgrid() or grid.arrange().  This idea seems to 
me to be most promising in terms of a desideratum that the y-axis scales/limits 
should be different on the two facets.  Also the y-axis labels.

 

And speaking of y-axis labels:  is it possible in ggplot() to get mathematical 
notation in axis labels, titles and possibly other annotation?  (In the manner 
of plotmath() in base R graphics.) Specifically I'd like to get the Greek 
letters alpha and beta in the y-axis labels.  In base R graphics I'd do 
something like ylab=expression(paste("bias in ",beta)) .  Is there an 
appropriate analogue in ggplot()?  (I think that I may have asked this question 
before, some time back, but have forgotten the answer.)

 

cheers,

 

Rolf

 

P.S.  The following is kind of apropos of nothing, but it might serve as a 
useful warning to others of a Trap for Young Players.  I nearly went mad 
(madder?) for a very long time when trying to get Rui's code to run.

I kept getting errors of the form:

 

> Error in source("scr.Rui") : scr.Rui:6:2: unexpected input

> 5: ggplot(eg, aes(Ndat, estimate)) +

> 6:   

> ^

 

Took me an unconscionably long while to figure out what was going on.

I could not see why Jeff's code ran without problem, while Rui's (which was 
very similar) fell over.  Turns out the second character in the offending line 
is a non-printing character, the 160th member of the ASCII character set. (It 
can be produced using "\u00A0".)  Apparently this is a "non-breaking space". 
Whatever that means.  It does NOT get treated as white space in the usual way, 
and triggers the foregoing error.

 

Presumably this invisible character got introduced, into the code that Rui 
emailed, by one of the (many!) infuriating idiosyncrasies of Windoze.  Yet 
another reason, among the many millions of such, not to use Windoze.

 

R.

 

--

Honorary Research Fellow

Department of Statistics

University of Auckland

Phone: +64-9-373-7599 ext. 88276

 

__

  R-help@r-project.org mailing list -- To 
UNSUBSCRIBE and more, see   
https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guide   
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Sanity check in loading large dataframe

2021-08-05 Thread Avi Gross via R-help

Luigi,

Duncan answered part of your question. My feedback is to consider looking at
your data using other tools besides str(). 

There are ways in base R to get lists of row or column names or count them
or ask what types they are and so forth.

Printing an entire large object is hard but printing many subsets can give
you a handle on it.

You may also want to use packages in the tidyverse such as dplyr and work
with tibbles as a mild variation on a data.frame.

I am not sure what you are hoping to do with str() besides getting the
number of rows and columns but consider:

dim(df)
nrow(df)
ncol(df)

To get names: 
names(df)
colnames(df)
rownames(df)

To get many kinds of info about columns in your data.frame, various
functional methods like this can be used:
sapply(df, typeof)

The above will tell you for each column if it is an integer or double or
other things.

To do more interesting things there are packages. The psych package, for
example, lets you get some metrics about each column:
psych::describe(df)

And you can use various methods of subsetting to limit what you are looking
at and only show or print a manageable amount.

You seem to be asking about sanity checking in your subject line and that
depends on what you want to check. Clearly that can include making sure
various columns of data are valid in being of the expected data type or not
having any NA values or even removing outliers and so on. Tools are there
for much of that including the few I mention. Your data may seem huge but I
have worked on much larger ones. One suggestion is to consider trimming some
of that data before working on it IF some is not needed. Both base R and the
tidyverse have lots to offer to do such things.

-Original Message-
From: R-help  On Behalf Of Luigi Marongiu
Sent: Thursday, August 5, 2021 9:16 AM
To: r-help 
Subject: [R] Sanity check in loading large dataframe

Hello,
I am using a large spreadsheet (over 600 variables).
I tried `str` to check the dimensions of the spreadsheet and I got ```
> (str(df))
'data.frame': 302 obs. of  626 variables:
 $ record_id : int  1 1 1 1 1 1 1 1 1 1 ...

$ v1_medicamento___aceta: int  1 NA NA NA NA NA NA NA NA NA ...
  [list output truncated]
NULL
```
I understand that `[list output truncated]` means that there are more
variables than those allowed by str to be displayed as rows. Thus I
increased the row's output with:
```

> (str(df, list.len=1000))
'data.frame': 302 obs. of  626 variables:
 $ record_id : int  1 1 1 1 1 1 1 1 1 1 ...
...
NULL
```

Does `NULL` mean that some of the variables are not closed? (perhaps a
missing comma somewhere) Is there a way to check the sanity of the data and
avoid that some separator is not in the right place?
Thank you



--
Best regards,
Luigi

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Calculation of Age heaping

2021-08-08 Thread Avi Gross via R-help

It is not too clear to me what you want to do and why that package is the way 
to do it. Is the package a required part of your assignment? If so, maybe 
someone else can help you find how to properly install it on your machine, 
assuming you have permissions to replace the other package it seems to require. 
You may need to create your own environment. If you are open to other ways, see 
below.

Are you trying to do something as simple as counting how many people in your 
data are in various buckets such as each age truncated or rounded to an integer 
from 0 to 99? If so, you might miss some of my cousins alive at 100 or that 
died at 103 and 105 recently 😉

Or do you want ages in groups of 10 or so meaning the first of two digits is 0 
through 9?

Many such things can be done quite easily without the package if you wish.

As far as I can tell, your code reads in a data.frame from your local file with 
any number of columns that you do not specify. If it is one, the solution 
becomes much easier. You then for some reason feel the need to convert it to a 
matrix. You then do whatever your Whipple does several ways.

Here is an outline of ways you can do this yourself.

First, combine all your data into one or more vectors. You already have that in 
your data.frame but if all columns are numeric, you can of course do something 
with a matrix.

Then make sure you remove anything objectionable, such as negative numbers or 
numbers too large or NA or whatever your logic requires.

If you have a variable ready with N entries to hold the buckets, such as 
length(0:100) or for even buckets of 5, perhaps length(0:99)/5 you initialize 
that to all zeroes.

Now take your data, and perhaps transform it into a copy where every age is 
truncated to an integer or divided by 5 first or whatever you need so it 
contains a pure integer like 6 or 12. What I mean is if your buckets are 5 
wide, and you want 5:9 to map into one bucket, your transform might be 
as.integer(original/5.0) or one of many variants like that.

You can now simply use one of many methods in R to loop through your values 
that result and assuming you have a zeroed vector called counter and the 
current value being looked at is N, you simply increment counter[N] or of N-1 
or whatever your logic requires.

Alternately R has many built-in methods (or in other packages) like cut() that 
might do something similar without as much work.

And just for the heck of it, I tried your download instructions. Unlike your 
three choices, I was offered 13 choices and as I had no clue what YOU were 
supposed to download, I aborted.

 1: All   
2: CRAN packages only
3: None  
4: colorspace (2.0-1 -> 2.0-2) [CRAN]
5: isoband(0.2.4 -> 0.2.5) [CRAN]
6: utf8   (1.2.1 -> 1.2.2) [CRAN]
7: cli(3.0.0 -> 3.0.1) [CRAN]
8: ggplot2(3.3.3 -> 3.3.5) [CRAN]
9: pillar (1.6.1 -> 1.6.2) [CRAN]
10: tibble (3.1.2 -> 3.1.3) [CRAN]
11: dplyr  (1.0.6 -> 1.0.7) [CRAN]
12: Rcpp   (1.0.6 -> 1.0.7) [CRAN]
13: curl   (4.3.1 -> 4.3.2) [CRAN]
14: cpp11  (0.2.7 -> 0.3.1) [CRAN]

In your case, if you selected All, what exactly did you expect?


-Original Message-
From: R-help  On Behalf Of Md. Moyazzem Hossain
Sent: Sunday, August 8, 2021 5:25 PM
To: r-help@r-project.org
Subject: [R] Calculation of Age heaping

Dear R-expert,

I hope that you are doing well.

I am interested to calculate the age heaping for each digit (0,1,...,9) based 
on my data set. However, when I run the R code, I got the following errors. 
Please help me in this regard.

##
library(remotes)
install_github("timriffe/DemoTools")

###
Downloading GitHub repo timriffe/DemoTools@HEAD These packages have more recent 
versions available.
It is recommended to update all of them.
Which would you like to update?

 1: All
 2: CRAN packages only
 3: None

Enter one or more numbers, or an empty line to skip updates: 1

*After installing some packages, I got the following error message*

package ‘backports’ successfully unpacked and MD5 sums checked
Error: Failed to install 'DemoTools' from GitHub:
  (converted from warning) cannot remove prior installation of package 
‘backports’

I am attaching the R-code and data file along with this email.

Please help me in this regard.

Thanks in advance.
--
Best Regards,
Md. Moyazzem Hossain
Associate Professor
Department of Statistics
Jahangirnagar University
Savar, Dhaka-1342
Bangladesh
Website: http://www.juniv.edu/teachers/hossainmm
Research: *Google Scholar
*;
*ResearchGate
*; *ORCID iD
*

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide ht

Re: [R] ggplot: add percentage for each element in legend and remove tick mark

2021-08-13 Thread Avi Gross via R-help

Kai,

It is easier to want to help someone if they generally know what they are doing 
and are stuck on something. Less so when they do not know enough to explain to 
us what they want, show what they did, and so on.

I modified the data you showed and hopefully it can be recreated this way:

library(tidyverse)

df <- tribble(
  ~ethnicity, ~individuals,
  "Caucasian", 36062,
  "Ashkenazi Jewish", 4309,
  "Multiple", 3193,
  "Hispanic", 2113,
  "Asian. not specified", 1538,
  "Chinese", 1031,
  "African", 643,
  "Unknown", 510,
  "Filipino", 222,
  "Japanese", 129,
  "Native American", 116,
  "Indian", 111,
  "Pacific Islander", 23)

If it was not clear, assuming you already had your data in some variable with a 
name, like my df, you could do this:

> dput(df)
structure(list(
  ethnicity = c(
"Caucasian",
"Ashkenazi Jewish",
"Multiple",
"Hispanic",
"Asian. not specified",
"Chinese",
"African",
"Unknown",
"Filipino",
"Japanese",
"Native American",
"Indian",
"Pacific Islander"
  ),
  individuals = c(36062, 4309, 3193, 2113,
  1538, 1031, 643, 510, 222, 129, 116, 111, 23)
), row.names = c(NA,
 -13L), class = c("tbl_df", "tbl", "data.frame"))   

The above structure can be used to recreate the data somewhat portably 
including a cut and paste like this:

Restoring <- the.above.put.here

The question you ask may better be answered by CHANGING what is in df before 
calling ggplot.

Be that as it may, with lotf of work on your badly formatted code as shown in 
plain text, I have this:

> eth
# A tibble: 13 x 5
ethnicityindividuals fraction  ymax  ymin

  1 Caucasian  36062  0.721   0.721 0
2 Ashkenazi Jewish4309  0.0862  0.807 0.721
3 Multiple3193  0.0639  0.871 0.807
4 Hispanic2113  0.0423  0.914 0.871
5 Asian. not specified1538  0.0308  0.944 0.914
6 Chinese 1031  0.0206  0.965 0.944
7 African  643  0.0129  0.978 0.965
8 Unknown  510  0.0102  0.988 0.978
9 Filipino 222  0.00444 0.992 0.988
10 Japanese 129  0.00258 0.995 0.992
11 Native American  116  0.00232 0.997 0.995
12 Indian   111  0.00222 1.00  0.997
13 Pacific Islander  23  0.00046 1 1.00

I used your ggplot code, reformatted so people can read and run it as:

ggplot(eth, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=ethnicity)) +
  geom_rect() +
  coord_polar(theta="y")  +
  xlim(c(2, 4))

It shows  donut plot I am not sure I can easily share here. You want to change 
the legend by adding more. Sure, tons of ways to do that BUT not sure what you 
actually want. 

ONE WAY to do what you want is to make a new column like this:

> eth$label <- paste(eth$ethnicity, " ", eth$fraction*100, "%", sep="")
> eth
# A tibble: 13 x 6
ethnicityindividuals fraction  ymax  ymin label 
 
  
 
  1 Caucasian  36062  0.721   0.721 0 Caucasian 72.124% 
 
2 Ashkenazi Jewish4309  0.0862  0.807 0.721 Ashkenazi Jewish 8.618% 
   
3 Multiple3193  0.0639  0.871 0.807 Multiple 6.386% 
   
4 Hispanic2113  0.0423  0.914 0.871 Hispanic 4.226% 
   
5 Asian. not specified1538  0.0308  0.944 0.914 Asian. not specified 
3.076%
6 Chinese 1031  0.0206  0.965 0.944 Chinese 2.062%  
   
7 African  643  0.0129  0.978 0.965 African 1.286%  
   
8 Unknown  510  0.0102  0.988 0.978 Unknown 1.02%   
   
9 Filipino 222  0.00444 0.992 0.988 Filipino 0.444% 
   
10 Japanese 129  0.00258 0.995 0.992 Japanese 0.258%

11 Native American  116  0.00232 0.997 0.995 Native American 0.232% 

12 Indian   111  0.00222 1.00  0.997 Indian 0.222%  

13 Pacific Islander  23  0.00046 1 1.00  Pacific Islander 0.046%

Now once you make the labels look like the exact way you want, you need to ask 
ggplot to substitute your labels, and make sure they line up right. It may be 
tricky and may require making factors properly. You may also want to round the 
percentages to all be the same. You can also use scale_fill_discrete to change 
other things like replace "ethnicity" with another phrase and so on.

Here is the additional part of ggplot that makes the change:

ggplot(eth, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=ethnicity)) +
  geom_rect() +
  coord_polar(theta="y")  +
  xlim(c(2, 4)) +
  scale_fill_discrete( labels = eth$label)

Removing the tick mark text can be done by setting the right elements of a 
theme as in the following:

ggplot(eth, aes(ymax=ymax, ymin=ymin, xmax=4, x

Re: [R] Help needed with ggplot2

2021-08-21 Thread Avi Gross via R-help

The code supplied is not proper for several reasons including not being on 
multiple lines properly and use of variables not defined.

"percentage" is a field in data.frame "email" not in "graph_text" and of course 
you need to load libraries properly to use the functions.

I rewrote and fixed a few errors to look like this:

library(tidyverse)

graph_text <- structure(list(percentage = c(57.14, 29.76, 69.32, 28.41, 57.89, 
34.21, 58.59, 33.33, 48.42, 42.11, 59.77, 29.89, 72.13, 18.03, 53.33, 33.33, 
55.1, 40.82, 46.55, 37.93),
 year = c(2020L, 2020L, 2019L, 2019L, 2018L, 2018L, 
2017L, 2017L, 2016L, 2016L, 2015L, 2015L, 2014L, 2014L, 2013L, 2013L, 2012L, 
2012L, 2011L, 2011L), 
 gender = c("male", "female", "male", "female", 
"male", "female", "male", "female", "male", "female", "male", "female", "male", 
"female", "male", "female", "male", "female", "male", "female")), 
class = "data.frame", 
row.names = c(NA, -20L))

ymax <- max(graph_text$percentage)

ggplot(data = graph_text, 
   aes(x=year, 
   y=percentage, 
   color = gender, 
   fill=gender)) +  
  geom_bar(position = 'dodge', 
   stat='identity') +  
  theme_classic() +  
  geom_text(aes(label = percentage), 
size = 4, 
position = position_dodge(width = 1.1), 
vjust=-0.2) +   
  scale_y_continuous(limits=c(0, 1.4*ymax))


And interestingly, it showed the years as 2010.0, 2012.5 and every 2.5 years 
thereafter, like the first version you showed. 

What you are asking for is straightforward enough if you do some simple queries 
on how to set the x axis up. You want integers shown as it they were years that 
presumably start with some year near the minimum and continue toward the 
maximum.  Do you want every year or just every N years?

One low-tech solution is to change year from an integer to a factor of integers 
or characters like this:

graph_text$year <- as.factor(graph_text$year)

The labels now look reasonable.

I won't solve your other issues but there are documented ways. Bold text is an 
example that can be changed in many places. In this case, note the addition to 
the following part from above:

fontface="bold

as in:

  geom_text(aes(label = percentage), 
size = 4, 
position = position_dodge(width = 1.1), 
vjust=-0.2,
fontface="bold") +








-Original Message-
From: R-help  On Behalf Of bharat rawlley via 
R-help
Sent: Saturday, August 21, 2021 6:24 PM
To: Bert Gunter 
Cc: R-help Mailing List 
Subject: Re: [R] Help needed with ggplot2

 Thank you, I have tried to do a better job here - 



Data - 
email <- structure(list(percentage = c(57.14, 29.76, 69.32, 28.41, 57.89,   
 34.21, 58.59, 33.33, 48.42, 42.11, 59.77, 
29.89, 72.13, 18.03,53.33, 33.33, 55.1, 
40.82, 46.55, 37.93), year = c(2020L, 2020L,
   2019L, 2019L, 2018L, 
2018L, 2017L, 2017L, 2016L, 2016L, 2015L,   
2015L, 2014L, 2014L, 2013L, 
2013L, 2012L, 2012L, 2011L, 2011L   ), 
gender = c("male", "female", "male", "female", "male", "female",
  "male", "female", "male", "female", 
"male", "female", "male",  
"female", "male", "female", "male", "female", "male", "female"  
 )), class = "data.frame", row.names = 
 c(NA, -20L))



Code - 
ymax <- max(graph_text$percentage)ggplot(aes(x=year, y=percentage, color = 
gender, fill=gender, data = graph_text)+  geom_bar(position = 'dodge', 
stat='identity')+  theme_classic()+  geom_text(aes(label = percentage), size = 
4, position = position_dodge(width = 1.1), vjust=-0.2) +   
scale_y_continuous(limits=c(0, 1.4*ymax))



Session info - 
R version 4.1.0 (2021-05-18)Platform: x86_64-w64-mingw32/x64 (64-bit)Running 
under: Windows 10 x64 (build 19042)
Matrix products: default
locale:[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252
LC_MONETARY=English_India.1252[4] LC_NUMERIC=C   
LC_TIME=English_India.1252
attached base packages:[1] stats graphics  grDevices utils datasets  
methods   base 
loaded via a namespace (and not attached): [1] fansi_0.5.0  
assertthat_0.2.1 dplyr_1.0.6  crayon_1.4.1 utf8_1.2.1   [6] 
grid_4.1.0   R6_2.5.0 DBI_1.1.1lifecycle_1.0.0  
gtable_0.3.0[11] magrittr_2.0.1   scales_1.1.1 ggplot2_3.3.3
pillar_1.6.1 rlang_0.4.11[16] generics_0.1.0   vctrs_0.3.8  
ellipsis_0.3.2   tools_4.1.0  glue_1.4.2  [21] purrr_0.3.4  
munsell_0.5.0compiler_4.1.0

Re: [R] ggplot error of "`data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class rxlsx"

2021-08-26 Thread Avi Gross via R-help

Kai,

The answer is fairly probable to find  if you examine your variable "eth" as 
that is the only time you are being asked to provide the argument as in 
"ggplot(data=eth, ..) ...)

As the message states, it expects that argument to be a data frame or something 
it can change into a data.frame. What you gave it probably is an object meant 
to represent an EXCEL file or something. You may need to extract a data.frame 
(or tibble or ...) from it before passing that to ggplot.

Avi

-Original Message-
From: R-help  On Behalf Of Kai Yang via R-help
Sent: Thursday, August 26, 2021 11:53 AM
To: R-help Mailing List 
Subject: [R] ggplot error of "`data` must be a data frame, or other object 
coercible by `fortify()`, not an S3 object with class rxlsx"

Hello List,
I got an error message when I submit the code below ggplot(eth, aes(ymax=ymax, 
ymin=ymin, xmax=4, xmin=3, fill=ethnicity)) +  geom_rect() +  
coord_polar(theta="y")  +  xlim(c(2, 4)   ) 

Error: `data` must be a data frame, or other object coercible by `fortify()`, 
not an S3 object with class rxlsx


I checked the syntax. But I can  not find any error on my code. Can you help me 
to find where is the problem?

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ggplot error of "`data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class rxlsx"

2021-08-26 Thread Avi Gross via R-help

This illustrates many things but in particular, why there is a difference 
between saying you tried:

 

class(eth)

 

And saying the function you (think you) called is documented to return a 
data.frame.

 

Just typing something asking for the class would rapidly have shown it was not 
a data.frame and also what it was. True, having multiple packages in some order 
overlay each other is a bit subtle for some and I am glad quite a few people 
here noticed it. 

 

It may indeed make sense to more fully specify package::function notation in 
anything you let others use as they may indeed load more packages …

 

From: John C Frain  
Sent: Thursday, August 26, 2021 3:17 PM
To: Kai Yang 
Cc: r-help@r-project.org; Avi Gross 
Subject: Re: [R] ggplot error of "`data` must be a data frame, or other object 
coercible by `fortify()`, not an S3 object with class rxlsx"

 

officer redefines the read_xlsx command.  You should have got a message to that 
effect when you loaded the officer package.  You can use the version from the 
readxl package with

 

readxl::read_xlsx()  command.




John C Frain

3 Aranleigh Park

Rathfarnham
Dublin 14
Ireland
www.tcd.ie/Economics/staff/frainj/home.html 
<http://www.tcd.ie/Economics/staff/frainj/home.html> 

https://jcfrain.wordpress.com/

https://jcfraincv19.wordpress.com/


mailto:fra...@tcd.ie <mailto:fra...@tcd.ie> 
mailto:fra...@gmail.com <mailto:fra...@gmail.com> 

 

 

On Thu, 26 Aug 2021 at 20:04, Kai Yang via R-help mailto:r-help@r-project.org> > wrote:

 Hi all,
I found something, but I don't know why it happen.
when I submitted the following code, the Eth is data frame. I can see 14 obs. 
of 2 variables
library(readxl)
library(ggplot2)
eth <- read_xlsx("c:/temp/eth.xlsx")


but when I add more package (see below,) the Eth is "List of 1"
library(readxl)
library(ggplot2)
library(dplyr)
library(magrittr)
library(knitr)
library(xtable)
library(flextable)
library(officer)
eth <- read_xlsx("c:/temp/eth.xlsx")

But I need those package in future. Is there a way to fix the problem?
Thanks,
KaiOn Thursday, August 26, 2021, 11:37:53 AM PDT, Kai Yang via R-help 
mailto:r-help@r-project.org> > wrote:  

  Hi All,
1. the eth is a data frame (not sure that based on error message?) that I load 
it from excel file. Here is the code: eth <- read_xlsx("c:/temp/eth.xlsx")
2. I try to use the code to convert eth into eth2, but I got error message:
> eth2 <- data.frame(eth)
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = 
stringsAsFactors) : 
  cannot coerce class ‘"rxlsx"’ to a data.frame

So, it seems the data.frame can not do this data convert? Do you know which 
statement/function can do this?


thank you for your help.

On Thursday, August 26, 2021, 09:33:51 AM PDT, Avi Gross via R-help 
mailto:r-help@r-project.org> > wrote:  

 Kai,

The answer is fairly probable to find  if you examine your variable "eth" as 
that is the only time you are being asked to provide the argument as in 
"ggplot(data=eth, ..) ...)

As the message states, it expects that argument to be a data frame or something 
it can change into a data.frame. What you gave it probably is an object meant 
to represent an EXCEL file or something. You may need to extract a data.frame 
(or tibble or ...) from it before passing that to ggplot.

Avi

-Original Message-
From: R-help mailto:r-help-boun...@r-project.org> > On Behalf Of Kai Yang via R-help
Sent: Thursday, August 26, 2021 11:53 AM
To: R-help Mailing List mailto:r-help@r-project.org> >
Subject: [R] ggplot error of "`data` must be a data frame, or other object 
coercible by `fortify()`, not an S3 object with class rxlsx"

Hello List,
I got an error message when I submit the code below ggplot(eth, aes(ymax=ymax, 
ymin=ymin, xmax=4, xmin=3, fill=ethnicity)) +  geom_rect() +  
coord_polar(theta="y")  +  xlim(c(2, 4)  ) 

Error: `data` must be a data frame, or other object coercible by `fortify()`, 
not an S3 object with class rxlsx


I checked the syntax. But I can  not find any error on my code. Can you help me 
to find where is the problem?

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org <mailto:R-help@r-project.org>  mailing list -- To 
UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org <mailto:R-help@r-project.org>  mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reprod

Re: [R] Calculate daily means from 5-minute interval data

2021-08-30 Thread Avi Gross via R-help

Am I seeing an odd aspect to this discussion.

There are many ways to solve problems and some may be favored by some more
than others.

All require some examination of the data so it can be massaged into shape
for the processes that follow.

If you insist on using the matrix method to arrange that each row or column
has the data you want, then, yes, you need to guarantee all your data is
present and in the right order. If some may be missing, you may want to
write a program that generates all possible dates in order and interpolates
them back (or into a copy more likely) so all the missing items are
represented and show up as an NA or whatever you want. You may also want to
check all dates are in order with no duplicates and anything else that makes
sense and then you are free to ask the vector to be seen as a matrix with N
columns or rows.

For many, the solution is much cleaner to use constructs that may be more
resistant to imperfections or allow them to be treated better. I would
probably use tidyverse functionality these days but can easily understand
people preferring base R or other packages. I have done similar analyses of
real data gathered from streams of various chemicals and levels taken at
various times and depths including times no measures happened and times
there were more than one measure. It is thus much more robust to use methods
like group_by and then apply other such verbs already being done grouped and
especially when the next steps involved making plots with ggplot. It was
rather trivial for example, to replace multiple measures by the average of
the measures. And many of my plots are faceted by variables which is not
trivial to do in base R.

I suggest not falling in love with the first way you think of and try to
bend everything to fit. Yes, some methods may be quite a bit more efficient
but rarely do I run into problems even with quite large collections of data
like a quarter million rows with dozens of columns, including odd columns
like the output of some analysis.

And note the current set of data may be extended with more over time or you
may get other data collected that would not necessarily work well with a
hard-coded method but might easily adjust to a new method. 

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Monday, August 30, 2021 7:34 PM
To: R Project Help 
Subject: Re: [R] Calculate daily means from 5-minute interval data

On Tue, 31 Aug 2021, Richard O'Keefe wrote:

> I made up fake data in order to avoid showing untested code. It's not 
> part of the process I was recommending. I expect data recorded every N 
> minutes to use NA when something is missing, not to simply not be 
> recorded. Well and good, all that means is that reshaping the data is 
> not a trivial call to matrix(). It does not mean that any additional 
> package is needed or appropriate and it does not affect the rest of the
process.

Richard,

The instruments in the gauge pipe don't know to write NA when they're not
measuring. :-) The outage period varies greatly by location, constituent
measured, and other unknown factors.

> You will want the POSIXct class, see ?DateTimeClasses. Do you know 
> whether the time stamps are in universal time or in local time?

The data values are not timestamps. There's one column for date a second
colume for time and a third column for time zone (P in the case of the west
coast.

> Above all, it doesn't affect the point that you probably should not be 
> doing any of this.

? (Doesn't require an explanation.)

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] conditional replacement of elements of matrix with another matrix column

2021-09-01 Thread Avi Gross via R-help

Seems trivial enough Elizabeth, either using a matrix or data.frame.

R is vectorized mostly so A[,1] notation selects a column all at once. Your
condition is thus:

A[,1] == B[,1]

After using your sample data to initialize an A and a B, I get this:

> A[,1] == B[,1]
[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

That Boolean vector can be used to index either of your matrices or any that
have the same number of rows:

Here is one solution using the vectorized ifelse() function:

using <- A[,1] == B[,1]

C <- A

C[, 2 ] <- ifelse(using, B[, 2], A[, 2])

I show the results below and you can tell us if that matches your need on
this sample data:

> A
[,1] [,2]
[1,]   12   NA
[2,]   12   NA
[3,]   12   NA
[4,]   13   NA
[5,]   13   NA
[6,]   13   NA
[7,]   14   NA
[8,]   14   NA
[9,]   14   NA
> B
[,1] [,2]
[1,]   116
[2,]   117
[3,]   118
[4,]   139
[5,]   13   10
[6,]   13   11
[7,]   14   12
[8,]   14   13
[9,]   14   14
> C
[,1] [,2]
[1,]   12   NA
[2,]   12   NA
[3,]   12   NA
[4,]   139
[5,]   13   10
[6,]   13   11
[7,]   14   12
[8,]   14   13
[9,]   14   14

Of course, the above can be done in fewer steps or many other ways.




-Original Message-
From: R-help  On Behalf Of Eliza Botto
Sent: Wednesday, September 1, 2021 5:00 PM
To: r-help@r-project.org
Subject: [R] conditional replacement of elements of matrix with another
matrix column

deaR useRs,

I have the matrix "A" and matrix "B" and I want the matrix "C". Is there a
way of doing it?

> dput(A)

structure(c(12, 12, 12, 13, 13, 13, 14, 14, 14, NA, NA, NA, NA, NA, NA, NA,
NA, NA), .Dim = c(9L, 2L))

> dput(B)

structure(c(11, 11, 11, 13, 13, 13, 14, 14, 14, 6, 7, 8, 9, 10, 11, 12, 13,
14), .Dim = c(9L, 2L))

> dput(C)

structure(c(12, 12, 12, 13, 13, 13, 14, 14, 14, NA, NA, NA, 9, 10, 11, 12,
13, 14), .Dim = c(9L, 2L))

Precisely, I want to replace the elements of 2nd column of A with those of B
provided the elements of 1st column match. Is there a single line loop or
code for that?


Thanks in advance,

Eliza Botto

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] conditional replacement of elements of matrix with another matrix column

2021-09-01 Thread Avi Gross via R-help

Why would you ask your question without mentioning that the two vectors may
be of unequal length when your abbreviated example was not like that!

You have two CASES here. In one A is longer and in one B is longer. When
they are the same, it does not matter.

So in your scenario, consider looking at length(A) and length(B) and
adjusting whatever method you use carefully. You now might need to use 1:N
notation to limit what you are doing so you do not access values out of
bounds.

Not going to do it for you. I see others have also supplied variants and .

From: Eliza Botto  
Sent: Wednesday, September 1, 2021 6:00 PM
To: r-help@r-project.org; Mohammad Tanvir Ahamed ; Avi
Gross ; Richard M. Heiberger 
Subject: Re: [R] conditional replacement of elements of matrix with another
matrix column

I thank you all. But the code doesn't work on my different dataset where A
and B have different column lengths. For example,

> dput(A) 

structure(c(17897, 17897, 17897, 17897, 17897, 17897, 17897, 

17897, 17897, 17897, 17897, 17897, 17897, 17897, 17897, 17897, 

SNIP

Can you please guide me how to implement the given code on this dataset?

I thanyou in advance

  _  

From: Mohammad Tanvir Ahamed 
Sent: Wednesday 1 September 2021 21:48
To: r-help@r-project.org ; Eliza Botto

Subject: Re: [R] conditional replacement of elements of matrix with another
matrix column 

C1 <- A
C1[,2][which(B[,1]%in%A[,1])] <- B[,2][which(B[,1]%in%A[,1])]

Regards.
Tanvir Ahamed 

On Wednesday, 1 September 2021, 11:00:16 pm GMT+2, Eliza Botto
 wrote: 

deaR useRs,

I have the matrix "A" and matrix "B" and I want the matrix "C". Is there a
way of doing it?

> dput(A)

structure(c(12, 12, 12, 13, 13, 13, 14, 14, 14, NA, NA, NA, NA,
NA, NA, NA, NA, NA), .Dim = c(9L, 2L))

> dput(B)

structure(c(11, 11, 11, 13, 13, 13, 14, 14, 14, 6, 7, 8, 9, 10,
11, 12, 13, 14), .Dim = c(9L, 2L))

> dput(C)

structure(c(12, 12, 12, 13, 13, 13, 14, 14, 14, NA, NA, NA, 9,
10, 11, 12, 13, 14), .Dim = c(9L, 2L))

Precisely, I want to replace the elements of 2nd column of A with those of B
provided the elements of 1st column match. Is there a single line loop or
code for that?

Thanks in advance,

Eliza Botto

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] conditional replacement of elements of matrix with another matrix column

2021-09-01 Thread Avi Gross via R-help

Just for the hell of is I looked at the huge amount of data to see the
lengths:

> nrow(A)

[1] 8760

> nrow(B)

[1] 734

> sum(is.na(A[, 2]))

[1] 8760

> sum(is.na(B[, 2]))

[1] 0

So it seems your first huge matrix has 8,760 rows where the second entry is
always NA.

B seems to have 733 unique values out of 734 entries. For what I call a key
and 192 different values mapped into by the keys.

> length(unique(B[,1]))

[1] 733

> length(unique(B[,2]))

[1] 192

I now conclude the question was badly phrased, as often happens when English
is not the main language used, or the person asking may have provided an
incomplete request, perhaps based on their misunderstanding.

First, matrix A has NOTHING anywhere in the second column other than an NA
placeholder. It has umpteen copies of the same number followed by umpteen of
the next and so on. And specifically exactly 24 copies of each!

> table(A[,1])

17897 17898 17899 17900 17901 17902 17903 17904 17905 17906 17907 17908
17909 17910 17911 17912 

24242424242424242424242424
242424 

<>

  18249 18250 18251 18252 18253 18254 18255 18256 18257 18258 18259 18260
18261 

24242424242424242424242424

I have no interest in why any of that is but the problem now strikes me as
different. It is not about what to do when A and B have the same value in
column one at all, especially as they are not at all similar. It is about
table lookup, I think.

As such, the request is to do something so that you replace the NA in table
A (probably no need to make a C, albeit that works too) by using column2 in
B for whichever one table A in column one matches, using the corresponding
column two.

Such a request can be handled quite a few ways BEFORE or after. I mean
instead of making 24 copies in A, you could just make 24 copies of B, and if
needed sort them.  But more generally, there are many R function in base R
that do all kinds of joins such as merge() or in the dplyr/tidyverse package
albeit some of these may be done on data.frames rather than matrices, albeit
they can easily be converted.

And of course many alternatives, some painful, involve iterating over one
matrix while searching the other for a match, or setting up B as a
searchable object that simulates a hash or dictionary in other languages,
such as a named structure.

For example, make a named vector containing column two with the names of
column 1:

You can now look up items in B_vech using the character representation:

Here is the first few lines of B:

> head(B)

[,1] [,2]

[1,] 136343

[2,] 13635   32

[3,] 13637   88

[4,] 13638  126

[5,] 136398

[6,] 136402

Searching for 13635 works fine:

> B_vec[as.character(13635)]

13635 

32 

> B_vec[as.character(13636)]

  NA 

> B_vec[as.character(13637)]

13637 

88

But since 13636 is not in the vector, it fails.

So to convert A (or a copy called C) becomes fairly simple IFF the set of
numbers in A and B are properly set up.

A[,2] <- B_vec[as.character(A[,1])]

But are they?

> range(A[,1])

[1] 17897 18261

> range(B[,1])

[1] 13634 18148

But I think I have wasted enough of my time and of everyone who read this
far on a problem that was not explained and may well still not be what I am
guessing. As noted, probably easiest to solve using a merge.

From: Eliza Botto  
Sent: Wednesday, September 1, 2021 6:00 PM
To: r-help@r-project.org; Mohammad Tanvir Ahamed ; Avi
Gross ; Richard M. Heiberger 
Subject: Re: [R] conditional replacement of elements of matrix with another
matrix column

I thank you all. But the code doesn't work on my different dataset where A
and B have different column lengths. For example,

> dput(A) 

structure(c(17897, 17897, 17897, 17897, 17897, 17897, 17897, 

17897, 17897, 17897, 17897, 17897, 17897, 17897, 17897, 17897, 

<>

NA), .Dim = c(8760L, 2L))

> dput(B) 

structure(c(13634, 13635, 13637, 13638, 13639, 13640, 13641, 

13642, 13643, 13645, 13646, 13647, 13648, 13649, 13650, 13651, 

<>

214, 156, 240, 29, 2, 374, 36, 4, 18, 419, 2, 5, 3, 277, 340, 

1, 216, 93, 1, 4, 2, 3, 42, 78, 190, 40, 808, 80, 266, 66, 42

), .Dim = c(734L, 2L))

Can you please guide me how to implement the given code on this dataset?

I thanyou in advance

  _  

From: Mohammad Tanvir Ahamed 
Sent: Wednesday 1 September 2021 21:48
To: r-help@r-project.org ; Eliza Botto

Subject: Re: [R] conditional replacement of elements of matrix with another
matrix column 

C1 <- A
C1[,2][which(B[,1]%in%A[,1])] <- B[,2][which(B[,1]%in%A[,1])]

Regards.
Tanvir Ahamed 

On Wednesday, 1 September 2021, 11:00:16 pm GMT+2, Eliza Botto
 wrote: 

deaR useRs,

I have the matrix "A" and matrix "B" and I want the matrix "C". Is there a
way of doing it?

> dput(A)

structure(c(12, 12, 12, 13, 13, 13, 14, 14, 14, NA, NA, NA, NA,
NA, NA, NA, NA, N

Re: [R] Show only header of str() function

2021-09-02 Thread Avi Gross via R-help

Luigi,

If you are sure you are looking at something like a data.frame, and all you
want o know is how many rows and how many columns are in it, then str() is
perhaps too detailed a tool.

The functions nrow() and ncol() tell you what you want and you can get both
together with dim(). You can, of course, print out whatever message you want
using the numbers supplied by throwing together some function like this:

sstr <- function(x) {
  cat(nrow(x), "obs. of ", ncol(x), " variables\n")
}

Calling that instead of str may meet your needs.  Of course, unlike str, it
will not work on arbitrary data structures.

Note the output of str()goes straight to the screen, similar to what cat
does. Capturing the output to say chop out just the first line is not
therefore a simple option. 


-Original Message-
From: R-help  On Behalf Of Luigi Marongiu
Sent: Thursday, September 2, 2021 7:02 AM
To: r-help 
Subject: [R] Show only header of str() function

Hello, is it possible to show only the header (that is: `'data.frame':
x obs. of  y variables:` part) of the str function?
Thank you

--
Best regards,
Luigi

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Show only header of str() function

2021-09-02 Thread Avi Gross via R-help

Thanks for the interesting method Rui. So that is a way to do a redirect of 
output not to a sinkfile but to an in-memory variable as a textConnection.

Of course, one has to wonder why the makers of str thought it would be too 
inefficient to have an option that returns the output in a form that can be 
captured directly, not just to the screen. 

I have in the past done odd things such as using sink() to capture the output 
of a program that wrote another program dynamically in a loop. The saved file 
could then be used with source(). So a similar technique can capture the output 
from str() or cat() or whatever normally only writes to the screen and then the 
file can be read in to get the first line or whatever you need. I have had to 
play games to get the right output from some statistical programs too as it was 
assumed the user would read it, and sometimes had to cherry pick what I needed 
directly from withing the underlying object.

I suspect one reason R has so many packages including the tidyverse I like to 
use, is because the original R was designed in another time and in many places 
is not very consistent. I wonder how hard it would be to change some programs 
to simply accept an additional argument like sink() has where you can say 
split=TRUE and get a copy of what is being diverted to also come to the screen. 
I find cat() to be a very useful way to put more complicated output together 
than say print() but since it does not allow capture of the text into 
variables, I end up having to use other methods such as the glue() function or 
something like print(sprint("Hello %s, I have %d left.\n", "Brian", 5))

But you work with what you have. Your solution works albeit having read the 
function definition, is quite a bit of overkill when I read the code as it does 
things not needed. But as noted, if efficiency matters and you are only looking 
at data.frame style objects, there are cheaper solutions.


-Original Message-
From: R-help  On Behalf Of Rui Barradas
Sent: Thursday, September 2, 2021 7:31 AM
To: Luigi Marongiu ; r-help 
Subject: Re: [R] Show only header of str() function

Hello,

Not perfect but works for data.frames:


header_str <- function(x){
   capture.output(str(x))[[1]]
}
header_str(iris)
header_str(AirPassengers)
header_str(1:10)


Hope this helps,

Rui Barradas

Às 12:02 de 02/09/21, Luigi Marongiu escreveu:
> Hello, is it possible to show only the header (that is: `'data.frame':
> x obs. of  y variables:` part) of the str function?
> Thank you
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Splitting a data column randomly into 3 groups

2021-09-02 Thread Avi Gross via R-help

What is stopping you Abou?

Some of us here start wondering if we have better things to do than homework 
for others. Help is supposed to be after they try and encounter issues that we 
may help with.

So think about your problem. You supplied data in a file that is NOT in CSV 
format but is in Tab separated format.

You need to get it in to your program and store it in something. It looks like 
you have 204 items so 1/3 of those would be exactly 68.

So if your data is in an object like a vector or data.frame, you want to choose 
random number between 1 and 204. How do you do that? You need 1/3 of the length 
of the object items, in your case 68.

Now extract the items with  those indices into say A1. Extract all the rest 
into a temporary item.

Make another 68 random indices, with no overlap, and copy those items into A2 
and the ones that do not have those into A3 and you are sort of done, other 
than some cleanup or whatever.

There are many ways to do the above and I am sure packages too.

But since you have made no visible effort, I personally am not going to pick 
anything in particular.

Had you shown some text and code along the lines of the above and just wanted 
to know how to copy just the ones that were not selected, we could easily ...


-Original Message-
From: R-help  On Behalf Of AbouEl-Makarim 
Aboueissa
Sent: Thursday, September 2, 2021 9:30 PM
To: R mailing list 
Subject: [R] Splitting a data column randomly into 3 groups

Dear All:

How to split a column data *randomly* into three groups. Please see the 
attached data. I need to split column #2 titled "Data"

with many thanks
abou
__


*AbouEl-Makarim Aboueissa, PhD*

*Professor, Statistics and Data Science* *Graduate Coordinator*

*Department of Mathematics and Statistics* *University of Southern Maine*

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Splitting a data column randomly into 3 groups

2021-09-02 Thread Avi Gross via R-help

Abou,

 

I am not trying to be negative. Assuming you are a professor of Statistics, 
your request seems odd as what you are asking about is very routine in much of 
statistical work where you want to make a model or something using just part of 
your data and need to reserve some to check if you perhaps trained an algorithm 
too much for the original data used.

 

A simple online search before asking questions here is appreciated. I did a 
quick search for something like “R split data into three parts” and see several 
applicable answers.

 

There are people on this forum who actually get paid to do nontrivial tasks and 
do not mind help in spots but feel sort of used if expected to write a serious 
amount of code and perhaps then be asked to redo it with more bells and 
whistles added. A recent badly phrased request comes to mind where several of 
us provided and answer only to find out it was for a different scenario, …

 

So let me continue with a serious answer. May we assume you KNOW how to read 
the data in to something like a data.frame? If so, and if you see no need or 
value in doing this the hard way, then your question could have been to ask if 
there is an R built-in function or perhaps a pacjkage already set to solve it 
quickly. Again, a simple online search can do wonders.  Here, for example is a 
package called caret and this page discusses spliutting data multiple ways:

 

https://topepo.github.io/caret/data-splitting.html

 

There are other such pages suggesting how to do it using base R.

 

Here is one that gives an example on how to make  three unequal partitions:

 

inds <- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 
0.2))

 

 

There is more to do below but in the above, you would use whatever names you 
want instead of train/valid/test and set all three to 0.33 and so on.

 

I repeat, that what you want to do strikes some of us as a fairly routine thing 
to do and lots of people have written how they have done it and you can pick 
and choose, or redo it on your own. If what you have is a homework assignment, 
the appropriate thing is to have you learn to use some technique yourself and 
perhaps get minor help when it fails. But if you will be doing this regularly, 
use of some packages is highly valuable.

 

Good Luck.

 

 

 

 

 

From: AbouEl-Makarim Aboueissa  
Sent: Thursday, September 2, 2021 9:51 PM
To: Avi Gross 
Cc: R mailing list 
Subject: Re: [R] Splitting a data column randomly into 3 groups

 

Sorry, please forget about it. I believe that I am very serious when I posted 
my question.

 

with thanks

abou


__

AbouEl-Makarim Aboueissa, PhD

 

Professor, Statistics and Data Science

Graduate Coordinator

Department of Mathematics and Statistics

University of Southern Maine

 

 

 

On Thu, Sep 2, 2021 at 9:42 PM Avi Gross via R-help mailto:r-help@r-project.org> > wrote:

What is stopping you Abou?

Some of us here start wondering if we have better things to do than homework 
for others. Help is supposed to be after they try and encounter issues that we 
may help with.

So think about your problem. You supplied data in a file that is NOT in CSV 
format but is in Tab separated format.

You need to get it in to your program and store it in something. It looks like 
you have 204 items so 1/3 of those would be exactly 68.

So if your data is in an object like a vector or data.frame, you want to choose 
random number between 1 and 204. How do you do that? You need 1/3 of the length 
of the object items, in your case 68.

Now extract the items with  those indices into say A1. Extract all the rest 
into a temporary item.

Make another 68 random indices, with no overlap, and copy those items into A2 
and the ones that do not have those into A3 and you are sort of done, other 
than some cleanup or whatever.

There are many ways to do the above and I am sure packages too.

But since you have made no visible effort, I personally am not going to pick 
anything in particular.

Had you shown some text and code along the lines of the above and just wanted 
to know how to copy just the ones that were not selected, we could easily ...


-Original Message-
From: R-help mailto:r-help-boun...@r-project.org> > On Behalf Of AbouEl-Makarim Aboueissa
Sent: Thursday, September 2, 2021 9:30 PM
To: R mailing list mailto:r-help@r-project.org> >
Subject: [R] Splitting a data column randomly into 3 groups

Dear All:

How to split a column data *randomly* into three groups. Please see the 
attached data. I need to split column #2 titled "Data"

with many thanks
abou
__


*AbouEl-Makarim Aboueissa, PhD*

*Professor, Statistics and Data Science* *Graduate Coordinator*

*Department of Mathematics and Statistics* *University of Southern Maine*

__
R-help@r-project.org <mailto:R-help@r-project.org>  mailing list -- To 
UNSUBSCRIBE and m

Re: [R] . Re: Splitting a data column randomly into 3 groups

2021-09-04 Thread Avi Gross via R-help

Thomas,

There are many approaches tried over the years to do partitioning along the
lines you mentioned and others. R already has many built-in or in packages
including some that are quite optimized. So anyone doing serious work can
often avoid doing this the hard way and build on earlier work.

Now, obviously, some people learning may take on such challenges or have
them assigned as homework. And if you want to tweak the efficiency, you may
be able to do things like knowing the conditions needed by sample() are met,
you can directly call sample.int() and so on.

But fundamentally, a large subset of all these kinds of sampling can often
be done by just playing with indices. It does not matter whether your data
is in the form of a list or other kind of vector or a data.frame or matrix.
Anything you can subset with integers will do.

So an algorithm could determine how many subsets of the indices you want and
calculate how many you want in each bucket and it can be done fairly simply.
One approach might be to scramble the indices in some form, and that can be
a vector of them or something more like an unordered set. You then take the
first number of them as needed for the first partition then the next ones
for the additional partitions. Finally, you apply the selected ones to
subset the original data into multiple smaller data collections.

Obviously you can instead work in stages, if you prefer. Your algorithm
seems to be along those lines. Start with your full data and pull out what
you want for the first partition. Then with what is left, repeat for the
second partition and so on, till what is left is used for the final
partition. Arguably this may in some ways be more work, especially for
larger amounts of data.

I do note many statistical processes, such as bootstrapping, may include
allowing various kinds of overlap in which the same items are allowed to be
used repeatedly, sometimes even within a single random sample. In those
cases, the algorithm has to include replacement and that is a somewhat
different discussion.

What I am finding here is that many problems posed are not explained in the
way they turn out to be needed in the end. So answering a question before we
know what it is can be a tad premature and waste time all around. But in
reality, some people new to R or computing in general, may be stuck
precisely on understanding the question they are trying to solve or may not
realize their use of language (especially when English is not one of their
stronger languages) can be a problem as their listeners/readers assume they
mean something else.

Your suggestion of how to do some things is reasonable and you note a
question about larger amounts of data. Many languages, for example Python,
will often have higher-level abilities arguable better designed for some
problems. R was built with vectorization in mind and most things are
ultimately vectors in a sense. For some purposes, it would be nice to have
an implementation of primitives along the lines of sets and bags and
hashes/dictionaries and so on. People have added some things along these
lines but if you look at the implementation of your use of setdiff(), it
just calls unique on two arguments coerced into vector format!

So consider what happens if you instead start in a language where you can
use a native construct called a set where sets are implemented efficiently.
To make N groupings that are distinct, you might start by adding all the
indices, or even complete entities, into the set. You can then ask to get a
random element from the set (perhaps also with deletion) until you have
reached your N items. You can then ask for the next group of N' and then N''
till you have what you need and perhaps the set is empty. An implementation
that uses some form of hashing to store and access set items can make
finding things take the same amount of time no matter the size. There is no
endless copying of parts of data. There may be no need for a setdiff() step
or if used, a more efficient way it is done. 

And, of course, if your data can have things like redundant elements or
rows, some kind of bag primitive may be useful and you can do your
scrambling and partitioning without using obvious indices.

R has been used extensively for a long time and most people use what is
already there. Some create new object types or packages, of course.

I have sometimes taken a hybrid approach and one interesting one is to use a
hybrid environment. I have, for example, done things like the above by
writing a program that has a package that allows both an R and a Python
interpreter to work together on data they sort of share and at times
interconvert. In an example where you have lots of statistical routines you
trust in R but want some of the preparation and arrangement or further
analysis, to be done with functionality you have in Python, you can sort of
combine the best of both worlds into a single program. Heck, the same
environment may also be creating a docu

Re: [R] Splitting a data column randomly into 3 groups

2021-09-04 Thread Avi Gross via R-help

Abou,

I believe I addressed this issue in a private message the other day.

As a general rule, truncating can leave a remainder. If 
M  = length(whatever)/3 

Then M is no longer an integer. It can be a number ending in .333... or .666... 
as well as 0.

Now R may silently truncate something like 100/3 which you see to use and make 
it be as if you typed 33. Same for 2*M. In your code, you used integer division 
and that is a truncation too!

  m1 <- n1 %/% 3
  s1 <- sample(1:n1, n1)
  group1.IDs <- sample1.IDs[s1[1:m1]]
  group2.IDs <- sample1.IDs[s1[(m1+1):(2*m1)]]
  group3.IDs <- sample1.IDs[s1[(m1*2+1):(3*m1)]]

A proper solution accounts for any leftover items. One method is to leave all 
extra items till the end and have:

MAX <- length(original or whatever)
group3.IDs <- sample1.IDs[s1[(m1*2+1):MAX]]

The last group then might have one or two extra items. Another is to go for  a 
second sweep and take any leftover items and move one each into whatever groups 
you wish for some balance.

Or, as discussed, there are packages available that let you specify percentages 
you want and handle these edge cases too.

-Original Message-
From: R-help  On Behalf Of AbouEl-Makarim 
Aboueissa
Sent: Saturday, September 4, 2021 5:13 PM
To: Thomas Subia 
Cc: R mailing list 
Subject: Re: [R] Splitting a data column randomly into 3 groups

Dear Thomas:

Thank you very much for your input in this matter.

The core part of this R code(s) (please see below) was written by *Richard 
O'Keefe*. I had three examples with different sample sizes.

*First sample of size n1 = 204* divided randomly into three groups of sizes 68. 
*No problems with this one*.

*The second sample of size n2 = 112* divided randomly into three groups of 
sizes 37, 37, and 38. BUT this R code generated three groups of equal sizes 
(37, 37, and 37). *How to fix the code to make sure that the output will be 
three groups of sizes 37, 37, and 38*.

*The third sample of size n3 = 284* divided randomly into three groups of sizes 
94, 95, and 95. BUT this R code generated three groups of equal sizes (94, 94, 
and 94). *Again*, h*ow to fix the code to make sure that the output will be 
three groups of sizes 94, 95, and 95*.

With many thanks

abou

###     #

N1 <- 485
population1.IDs <- seq(1, N1, by = 1)
 population1.IDs

n1<-204# in this case the size
of each group of the three groups = 68
sample1.IDs <- sample(population1.IDs,n1)  sample1.IDs

  n1 <- length(sample1.IDs)

  m1 <- n1 %/% 3
  s1 <- sample(1:n1, n1)
  group1.IDs <- sample1.IDs[s1[1:m1]]
  group2.IDs <- sample1.IDs[s1[(m1+1):(2*m1)]]
  group3.IDs <- sample1.IDs[s1[(m1*2+1):(3*m1)]]

groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs)

groups.IDs

### --

N2 <- 266
population2.IDs <- seq(1, N2, by = 1)
 population2.IDs

n2<-112   # in this case the sizes of the three
groups are(37, 37, and 38)
  # BUT this codes generate three 
groups of equal sizes (37, 37, and 37) sample2.IDs <- 
sample(population2.IDs,n2)  sample2.IDs

  n2 <- length(sample2.IDs)

  m2 <- n2 %/% 3
  s2 <- sample(1:n2, n2)
  group1.IDs <- sample2.IDs[s2[1:m2]]
  group2.IDs <- sample2.IDs[s2[(m2+1):(2*m2)]]
  group3.IDs <- sample2.IDs[s2[(m2*2+1):(3*m2)]]

groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs)

groups.IDs

### --

N3 <- 674
population3.IDs <- seq(1, N3, by = 1)
 population3.IDs

n3<-284   # in this case the sizes of the three
groups are(94, 95, and 95)
  # BUT this codes generate three 
groups of equal sizes (94, 94, and 94) sample2.IDs <- 
sample(population2.IDs,n2) sample3.IDs <- sample(population3.IDs,n3)  
sample3.IDs

  n3 <- length(sample2.IDs)

  m3 <- n3 %/% 3
  s3 <- sample(1:n3, n3)
  group1.IDs <- sample3.IDs[s3[1:m3]]
  group2.IDs <- sample3.IDs[s3[(m3+1):(2*m3)]]
  group3.IDs <- sample3.IDs[s3[(m3*2+1):(3*m3)]]

groups.IDs <-cbind(group1.IDs,group2.IDs,group3.IDs)

groups.IDs

__

*AbouEl-Makarim Aboueissa, PhD*

*Professor, Statistics and Data Science* *Graduate Coordinator*

*Department of Mathematics and Statistics* *University of Southern Maine*

On Sat, Sep 4, 2021 at 11:54 AM Thomas Subia  wrote:

> Abou,
>
>
>
> I’ve been following your question on how to split a data column 
> randomly into 3 groups using R.
>
>
>
> My method may not be amenable for a large set of data but it surely 
> worth considering since it makes sense intuitively.
>
>
>
> mydata <- LETTERS[1:11]
>
> > mydata
>
> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
>
>
>
> # Let’s choose a random sample of size 4 from mydata
>
> > random_grp1
>
> [1] "J" "H" "D" "A"
>
>
>
> Now my next random selection of data is defined by
>
> data_wo_random <- se

Re: [R] how to find "first" or "last" record after sort in R

2021-09-09 Thread Avi Gross via R-help

I am sure there are many good ways to do the task including taking the 
data.frame out into a list of data.frames and making the change to each by 
taking the nth row that matches nrow(it) and changing it and then recombining.

What follows are several attempts leading up to one at the end I find is 
probably the best choice.

I did the following sample demo using the dplyr package in the tidyverse but 
want to explain. My data was three small groups of 1 then 2 then 3. The second 
column in each had the same number as the group and it was unique for that 
group. If the last item can be a duplicate of another item, this method changes 
too much:

library(dplyr)

mydf <-
  tribble(
~grouper, ~val,
1, 1,
2, 1,
2, 2,
3, 1,
3, 2,
3, 3,
  )

mydf %>% group_by(grouper) %>% mutate(val2 = last(val), 
val=ifelse(val==val2,0,val))

The result is this:

> mydf %>% group_by(grouper) %>% mutate(val2 = last(val), 
> val=ifelse(val==val2,0,val))
# A tibble: 6 x 3
# Groups:   grouper [3]
grouper   val  val2
  
  1   1 0 1
2   2 1 2
3   2 0 2
4   3 1 3
5   3 2 3
6   3 0 3

Now obviously this introduced an extra temporary row called val2, which is 
easily removed by many methods like piping to select(-val2) ...

But that is not needed as a shorter and more direct method is this:

mydf %>% 
  group_by(grouper) %>% 
  mutate(val = ifelse(val==last(val), 
  0, 
  val))

But some more research shows the helper functions that make this trivial.

Recall you wanted the last row in each group altered, I think to have an NA in 
column. I used 0 above but can use NA just as easily or any constant. The 
functions are:

n() gives the number of rows in the group.
row_number() gives the number of the current row as the functionality is being 
applied, within that group. The condition being offered is that n() == 
row_number() so this version surgically changes just the last rows no matter 
what other rows contain.

mydf %>% 
  group_by(grouper) %>% 
  mutate(val = ifelse(row_number() == n(), 
  0, 
  val))

If you have no interest in using a package like this, someone else will likely 
point you to a way. I suspect using something like split() to make a list of 
data.frames then applying some functionality to each smaller data.frame to get 
the result then recombining it back.


-Original Message-
From: R-help  On Behalf Of Kai Yang via R-help
Sent: Thursday, September 9, 2021 3:00 PM
To: R-help Mailing List 
Subject: [R] how to find "first" or "last" record after sort in R

Hello List,
Please look at the sample data frame below:

ID date1  date2 date3
12015-10-082015-12-172015-07-23

22016-01-16NA 2015-10-08
32016-08-01NA 2017-01-10
32017-01-10NA 2016-01-16
42016-01-192016-02-24   2016-08-01
52016-03-012016-03-10   2016-01-19 This data frame was sorted by ID and 
date1. I need to set the column date3 as missing for the "last" record for each 
ID. In the sample data set, the ID 1, 2, 4 and 5 has one row only, so they can 
be consider as first and last records. the data3 can be set as missing. But the 
ID 3 has 2 rows. Since I sorted the data by ID and date1, the ID=3 and 
date1=2017-01-10 should be the last record only. I need to set date3=NA for 
this row only.

the question is, how can I identify the "last" record and set it as NA in date3 
column.
Thank you,
Kai
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to find "first" or "last" record after sort in R

2021-09-10 Thread Avi Gross via R-help

Excellent function to use, Terry.

I note when I used it on a vector (in this case the  first column of a 
data.frame, it accepted last=TRUE as well a fromlast=TRUE, which I did not see 
documented. Used on a data.frame, that change fails as function 
duplicated.data.frame only passes along the fromlast keyword value. 😉

When given a problem, we sometimes use a hammer when existing functions are 
already there to help.

-Original Message-
From: R-help  On Behalf Of Therneau, Terry M., 
Ph.D. via R-help
Sent: Friday, September 10, 2021 8:14 AM
To: yangkai9...@yahoo.com; R-help 
Subject: Re: [R] how to find "first" or "last" record after sort in R

I prefer the duplicated() function, since the final code will be clear to a 
future reader. 

  (Particularly when I am that future reader).

last <- !duplicated(mydata$ID, fromLast=TRUE)  # point to the last ID for each 
subject mydata$data3[last] <- NA

Terry T.

(I read the list once a day in digest form, so am always a late reply.)

On 9/10/21 5:00 AM,   
r-help-requ...@r-project.org wrote:

> Hello List,

> Please look at the sample data frame below:

> 

> ID date1  date2 date3

> 12015-10-082015-12-172015-07-23

> 

> 22016-01-16NA 2015-10-08

> 32016-08-01NA 2017-01-10

> 32017-01-10NA 2016-01-16

> 42016-01-192016-02-24   2016-08-01

> 52016-03-012016-03-10   2016-01-19 This data frame was sorted 

> by ID and date1. I need to set the column date3 as missing for the "last" 
> record for each ID. In the sample data set, the ID 1, 2, 4 and 5 has one row 
> only, so they can be consider as first and last records. the data3 can be set 
> as missing. But the ID 3 has 2 rows. Since I sorted the data by ID and date1, 
> the ID=3 and date1=2017-01-10 should be the last record only. I need to set 
> date3=NA for this row only.

> 

> the question is, how can I identify the "last" record and set it as NA in 
> date3 column.

> Thank you,

> Kai

> [[alternative HTML version deleted]]

> 

__

  R-help@r-project.org mailing list -- To 
UNSUBSCRIBE and more, see   
https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guide   
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] tidyverse: grouped summaries (with summerize)

2021-09-13 Thread Avi Gross via R-help

Rich,

Did I miss something? The summarise() command is telling you that  you had not 
implicitly grouped the data and it made a guess. The canonical way is:

... %>% group_by(year, month, day, hour) %>% summarise(...)


You decide which fields to group by, sometimes including others so they are in 
the output. 

Avi

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Monday, September 13, 2021 4:53 PM
To: r-help@r-project.org
Subject: [R] tidyverse: grouped summaries (with summerize)

I changed the data files so the date-times are in five separate columns:
year, month, day, hour, and minute; for example, year,month,day,hour,min,cfs
2016,03,03,12,00,149000
2016,03,03,12,10,15
2016,03,03,12,20,151000
2016,03,03,12,30,156000
2016,03,03,12,40,154000
2016,03,03,12,50,15
2016,03,03,13,00,153000
2016,03,03,13,10,156000
2016,03,03,13,20,154000

The script is based on the example (on page 59 of 'R for Data Science'):
library('tidyverse')
disc <- read.csv('../data/water/disc.dat', header = TRUE, sep = ',', 
stringsAsFactors = FALSE) disc$year <- as.integer(disc$year) disc$month <- 
as.integer(disc$month) disc$day <- as.integer(disc$day) disc$hour <- 
as.integer(disc$hour) disc$min <- as.integer(disc$min) disc$cfs <- 
as.double(disc$cfs, length = 6)

# use dplyr to filter() by year, month, day; summarize() to get monthly # 
means, sds disc_by_month <- group_by(disc, year, month) 
summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

but my syntax is off because the results are:
> source('disc.R')
`summarise()` has grouped output by 'year'. You can override using the 
`.groups` argument.
Warning messages:
1: In eval(ei, envir) : NAs introduced by coercion
2: In eval(ei, envir) : NAs introduced by coercion
> ls()
[1] "disc"  "disc_by_month"
> disc_by_month
# A tibble: 590,940 × 6
# Groups:   year, month [66]
 year month   day  hour   mincfs
  
  1  2016 3 312 0 149000
  2  2016 3 31210 15
  3  2016 3 31220 151000
  4  2016 3 31230 156000
  5  2016 3 31240 154000
  6  2016 3 31250 15
  7  2016 3 313 0 153000
  8  2016 3 31310 156000
  9  2016 3 31320 154000
10  2016 3 31330 155000
# … with 590,930 more rows

I have the same results if I use as.numeric rather than as.integer and 
as.double. What am I doing incorrectly?

TIA,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] tidyverse: grouped summaries (with summerize)

2021-09-13 Thread Avi Gross via R-help

As Eric has pointed out, perhaps Rich is not thinking pipelined. Summarize() 
takes a first argument as:
summarise(.data=whatever, ...)

But in a pipeline, you OMIT the first argument and let the pipeline supply an 
argument silently.

What I think summarize saw was something like:

summarize(. , disc_by_month, vol = mean(cfs, na.rm = TRUE))

There is now a superfluous SECOND argument in a place it expected not a 
data.frame type of variable but the name of a column in the hidden 
data.frame-like object it was passed. You do not have a column called 
disc_by_month and presumably some weird logic made it suggest it was replacing 
that by the first column or something.

I hope this makes sense. You do not cobble a pipeline together from parts 
without carefully making sure all first arguments otherwise used are NOT used.

And, just FYI, the subject line should not use a word that some see as the 
opposite companion of "winterize" ...

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Monday, September 13, 2021 5:51 PM
To: r-help@r-project.org
Subject: Re: [R] tidyverse: grouped summaries (with summerize)

On Mon, 13 Sep 2021, Rich Shepard wrote:

> That's what I thought I did. I'll rewrite the script and work toward 
> the output I need.

Still not the correct syntax. Command is now:
disc_by_month %>%
 group_by(year, month) %>%
 summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

and results are:
> source('disc.R')
`summarise()` has grouped output by 'year', 'month'. You can override using the 
`.groups` argument.

> disc_by_month
# A tibble: 590,940 × 6
# Groups:   year, month [66]
 year month   day  hour   mincfs

  1  2016 3 312 0 149000
  2  2016 3 31210 15
  3  2016 3 31220 151000
  4  2016 3 31230 156000
  5  2016 3 31240 154000
  6  2016 3 31250 15
  7  2016 3 313 0 153000
  8  2016 3 31310 156000
  9  2016 3 31320 154000
10  2016 3 31330 155000
# … with 590,930 more rows

The grouping is still not right. I expected to see a mean value for each month 
of each year in the data set, not for each minute.

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] tidyverse: grouped summaries (with summArize)

2021-09-13 Thread Avi Gross via R-help

I think we wandered away into a package rather than base R, but the request 
seems easy enough.

Just FYI, Rich, as you seem not to have incorporated the advice we gave yet 
about the first argument, your use of group_by() is a tad odd.

disc %>%
 group_by(hour) %>%
 group_by(day) %>%
 group_by(year, month) %>%
 summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

Not sure why you use disc once and disc_by_month the second superfluous time 
but if you read the manual page for group_by() 
https://dplyr.tidyverse.org/reference/group_by.html you may note it tends to be 
called ONCE with multiple arguments in sequence that specify what columns in 
the data.frame to group by sequentially.

disc %>%
 group_by(hour, day, year, month) %>%
 summarize(vol = mean(cfs, na.rm = TRUE))

Not sure most people would group that way as the above sorts by hours first. 
Many might reverse that sequence.

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Monday, September 13, 2021 6:32 PM
To: R mailing list 
Subject: Re: [R] tidyverse: grouped summaries (with summerize)

On Tue, 14 Sep 2021, Eric Berger wrote:

> This code is not correct:
> disc_by_month %>%
> group_by(year, month) %>%
> summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE)) It should 
> be:
> disc %>% group_by(year,month) %>% summarize(vol=mean(cfs,na.rm=TRUE)

Eric/Avi:

That makes no difference:
> disc_by_month
# A tibble: 590,940 × 6
# Groups:   year, month [66]
 year month   day  hour   mincfs

  1  2016 3 312 0 149000
  2  2016 3 31210 15
  3  2016 3 31220 151000
  4  2016 3 31230 156000
  5  2016 3 31240 154000
  6  2016 3 31250 15
  7  2016 3 313 0 153000
  8  2016 3 31310 156000
  9  2016 3 31320 154000
10  2016 3 31330 155000
# … with 590,930 more rows

I wondered if I need to group first by hour, then day, then year-month.
This, too, produces the same output:

disc %>%
 group_by(hour) %>%
 group_by(day) %>%
 group_by(year, month) %>%
 summarize(disc_by_month, vol = mean(cfs, na.rm = TRUE))

And disc shows the read dataframe.

I don't understand why the columns are not grouping.

Thanks,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] tidyverse: grouped summaries (with summarize) [RESOLVED]

2021-09-13 Thread Avi Gross via R-help

Just FYI, Rich, the way the idiom with pipeline works does allow but not 
require the method you used:

Yours was
  RESULT <-
DATAFRAME %>%
FN1(args) %>%
...
FNn(args)

But equally valid are forms that assign the result at the end:

DATAFRAME %>%
FN1(args) %>%
...
FNn(args) -> RESULT

Or that supply the first argument to just the first function:

FN1(DATAFRAME, args) %>%
...
FNn(args) -> RESULT

And if you read some tutorials, there are many other things you can do 
including variants on the pipe symbol to do other things but also how to put 
the variable returned into a different part (not the first position) of the 
argument that follows and lots more. Some people spend most of the programming 
time relatively purely in the tidyverse functions without looking much at base 
R.

I am not saying that is a good thing.

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Monday, September 13, 2021 7:04 PM
To: r-help@r-project.org
Subject: Re: [R] tidyverse: grouped summaries (with summarize) [RESOLVED]

On Mon, 13 Sep 2021, Avi Gross via R-help wrote:

> As Eric has pointed out, perhaps Rich is not thinking pipelined. Summarize() 
> takes a first argument as:
>   summarise(.data=whatever, ...)
>
> But in a pipeline, you OMIT the first argument and let the pipeline supply an 
> argument silently.

Avi,

Thank you. I read your message carefully and re-read the example on the bottom 
of page 60 and top of page 61. Then changed the command to:
disc_by_month = disc %>%
 group_by(year, month) %>%
 summarize(vol = mean(cfs, na.rm = TRUE))

And, the script now returns what I need:
> disc_by_month
# A tibble: 66 × 3
# Groups:   year [7]
 year month vol

  1  2016 3 221840.
  2  2016 4 288589.
  3  2016 5 255164.
  4  2016 6 205371.
  5  2016 7 167252.
  6  2016 8 140465.
  7  2016 9  97779.
  8  201610 135482.
  9  201611 166808.
10  201612 165787.

I missed the beginning of the command where the resulting dataframe needs to be 
named first.

This clarifies my understanding and I appreciate your and Eric's help.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need fresh eyes to see what I'm missing

2021-09-14 Thread Avi Gross via R-help

Rich,

I reproduced your problem on my re-arranging the code the mailer mangled. I 
tried variations like not using pipes or changing what it is grouped by and 
they all show your results on the abbreviated data with the error:

`summarise()` has grouped output by 'year'. You can override using the 
`.groups` argument.

I think I fixed summarise()  but it makes me wonder if there is an 
inconsistency introduced along the way as what you used is supposed to work and 
has worked for me in the past.

I note the man page for summarise() mentions that the .groups="..." is 
experimental and a tad confusing:

I changed your code to this by telling it to keep the grouping in the output 
the same:

vel_by_month = vel %>%
  group_by(year, month) %>%
  summarise(flow = mean(fps, na.rm = TRUE), .groups="keep")

The change from your code is the addition at the very end of the .groups="keep" 
argument.

Since I used your limited data, this is all I get:

> vel_by_month
# A tibble: 1 x 3
# Groups:   year, month [1]
year month  flow
  
  1  2016 3  1.77

For now, all I did was shut summarise() up.

Not having the rest of your data, the question is where your NA and Nan are 
introduced. If the change I made above does not resolve it, then as others 
suggested, you begin by looking at your data more carefully perhaps starting 
with the .CSV file and then the data structures in R, along the lines of what 
you were shown. I find the table() function useful for categorical data with 
limited choices as it would spit out the anomaly as happening once.

I see your point about needing fresh eyes. My eyes do not see what you did 
wrong but am just following clues you may be ignoring.


-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Tuesday, September 14, 2021 11:21 AM
To: r-help@r-project.org
Subject: [R] Need fresh eyes to see what I'm missing

The data file begins this way:
year,month,day,hour,min,fps
2016,03,03,12,00,1.74
2016,03,03,12,10,1.75
2016,03,03,12,20,1.76
2016,03,03,12,30,1.81
2016,03,03,12,40,1.79
2016,03,03,12,50,1.75
2016,03,03,13,00,1.78
2016,03,03,13,10,1.81

The script to process it:
library('tidyverse')
vel <- read.csv('../data/water/vel.dat', header = TRUE, sep = ',', 
stringsAsFactors = FALSE) vel$year <- as.integer(vel$year) vel$month <- 
as.integer(vel$month) vel$day <- as.integer(vel$day) vel$hour <- 
as.integer(vel$hour) vel$min <- as.integer(vel$min) vel$fps <- 
as.double(vel$fps, length = 6)

# use dplyr to filter() by year, month, day; summarize() to get monthly # means 
vel_by_month = vel %>%
 group_by(year, month) %>%
 summarize(flow = mean(fps, na.rm = TRUE))

R's display after running the script:
> source('vel.R')
`summarise()` has grouped output by 'year'. You can override using the 
`.groups` argument.
Warning messages:
1: In eval(ei, envir) : NAs introduced by coercion
2: In eval(ei, envir) : NAs introduced by coercion
3: In eval(ei, envir) : NAs introduced by coercion

The dataframe created by the read.csv() command:
> head(vel)
   year month day hour min  fps
1 2016 3   3   12   0 1.74
2 2016 3   3   12  10 1.75
3 2016 3   3   12  20 1.76
4 2016 3   3   12  30 1.81
5 2016 3   3   12  40 1.79
6 2016 3   3   12  50 1.75

and the resulting grouping:
> vel_by_month
# A tibble: 67 × 3
# Groups:   year [8]
 year month   flow
   
  1 0NA NaN
  2  2016 3   2.40
  3  2016 4   3.00
  4  2016 5   2.86
  5  2016 6   2.51
  6  2016 7   2.18
  7  2016 8   1.89
  8  2016 9   1.38
  9  201610   1.73
10  201611   2.01
# … with 57 more rows

I cannot find why line 1 is there. Other data sets don't produce this result.

TIA,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need fresh eyes to see what I'm missing

2021-09-14 Thread Avi Gross via R-help

Rich,

I have to wonder about how your data was placed in the CSV file based on
what you report.

functions like read.table() (which is called by read.csv()) ultimately make
guesses about what number of columns to expect and what the contents are
likely to be. They may just examine the first N entries and make the most
compatible choice. The fact that it shows this:

'data.frame':   565675 obs. of  6 variables:
  $ year : chr  "2016" "2016" "2016" "2016" ...
  $ month: int  3 3 3 3 3 3 3 3 3 3 ...
  $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
  $ hour : chr  "12" "12" "12" "12" ...
  $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
  $ fps  : chr  "1.74" "1.75" "1.76" "1.81" ...

is odd. It suggests somewhere early in the data, it did not say 2016 or some
other entry  as an integer but as "2016" or a word like `missing` and not in
quotes.

Something similar seems to have happened with hour and fps but not the rest.

Nonetheless, you did convert back to what you wanted BUT if a single
anomalous entry remains then as.integer("missing") would return an NA and
as.double("missing") also an NA. So it is wise to check for any unexpected
numbers. If the source cannot be changed, then the R program can filter out
such cases from your data.frame in various ways.

Your way of reading the CSV in was this:

vel <- read.csv('../data/water/vel.dat', header = TRUE, sep = ',',
stringsAsFactors = FALSE)

The default is the options you added for header=TRUE and sep="," so that is
harmless. The default now is not to read in strings as Factors. But what you
did not include may be something you can look at given your data may be a
bit off. 

Without the underlying file, we can not trivially diagnose what may be wrong
in it. Do you get any error messages when reading in the file?  You can
specify additional arguments to read.csv() about what, if any, quoting
characters are used, what sequences should be recognized as an NA,
suggestions of what type each column should be assumed to be, what to do
with blank lines, what a comment looks like  and so on. 

One thing I sometimes have had to do is open the original CSV file in EXCEL
and examine it in various ways or even change it and save it again. That is
beyond the scope of this mailing list so if needed, ask me in private. You
have been working on this kind of stuff, but I assume often using other
tools outside R and dplyr.

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Tuesday, September 14, 2021 11:49 AM
To: R mailing list 
Subject: Re: [R] Need fresh eyes to see what I'm missing

On Tue, 14 Sep 2021, Bert Gunter wrote:

> Remove all your as.integer() and as.double() coercions. They are 
> unnecessary (unless you are preparing input for C code; also, all R 
> non-integers are double precision) and may be the source of your problems.

Bert,

When I remove coercions the script produces warnings like this:
1: In mean.default(fps, na.rm = TRUE) :
   argument is not numeric or logical: returning NA

and str(vel) displays this:
'data.frame':   565675 obs. of  6 variables:
  $ year : chr  "2016" "2016" "2016" "2016" ...
  $ month: int  3 3 3 3 3 3 3 3 3 3 ...
  $ day  : int  3 3 3 3 3 3 3 3 3 3 ...
  $ hour : chr  "12" "12" "12" "12" ...
  $ min  : int  0 10 20 30 40 50 0 10 20 30 ...
  $ fps  : chr  "1.74" "1.75" "1.76" "1.81" ...

so month, day, and min are recognized as integers but year, hour, and fps
are seen as characters. I don't understand why.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need fresh eyes to see what I'm missing

2021-09-14 Thread Avi Gross via R-help

Rich,

You have helped us understand and at this point, suppose we now are sure
about the way missing info is supplied. What you show is not the same as the
CSV sample earlier but assuming you know that "Eqp" is the one and only way
they signaled bad data.

One choice is to fix the original data before reading into R. Chances are
placing exactly NA in those places, perhaps using a global substitute of
some sort, might do it.

But as Bert noted, R is a very powerful environment and you can use it.

One argument you can use with read.csv() is to tell it "Eqp" is to be
treated as an NA. The substitution may then be made as it is read in AND you
might then notice it is properly read in as a column of doubles.

Suppose you read in this data and make sure the column involved is read as
character strings, instead. You can use any number of tools in base R or
dplyr to replace Eqp with NA such as in a pipeline ... %>%
mutate(fps=ifelse(fps=="Eqp", NA, fps)) %>% ...

The above is one of many ways and of course afterward, you may want to
reconvert the character column back to floating point. I note dplyr can do
both in the same function as it applies them in order:

mutate(fps=ifelse(fps=="Eqp", NA, fps), fps=as.double(fps))

The point is that in many cases, the data must be carefully examined and
cleaned and set up. In some cases, it may also be useful to treat some as
factors as in the hours and minutes. If you continue on your road and hit
ggplot() to make graphs, factors may be useful in various kinds of fine
tuning.

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Tuesday, September 14, 2021 1:59 PM
To: r-help@r-project.org
Subject: Re: [R] Need fresh eyes to see what I'm missing

On Tue, 14 Sep 2021, Bert Gunter wrote:

> **Don't do this.*** You will make errors. Use fit-for-purpose tools.
> That's what R is for. Also, be careful **how** you "download", as that 
> already may bake in problems.

Bert,

Haven't had downloading errors saving displayed files.

The problem with the velocities data is shown here:
2020-11-24 11:00PST Eqp 
2020-11-24 11:05PST Eqp 
2020-11-24 11:10PST Eqp 
2020-11-24 11:15PST Eqp 
2020-11-24 11:20PST Eqp 
2020-11-24 11:25PST Eqp 
2020-11-24 11:30PST Eqp 
2020-11-24 11:35PST Eqp 
2020-11-24 11:40PST Eqp 
2020-11-24 11:45PST Eqp 
2020-11-24 11:50PST Eqp 
2021-01-08 16:26PST Eqp

Equipment failure during the period shown.

What's the best way to replace these lines? Just remove them or change them
to NA?

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to remove all rows that have a numeric in the first (or any) column

2021-09-14 Thread Avi Gross via R-help

Calling something a data.frame does not make it a data.frame.

The abbreviated object shown below is a list of singletons. If it is a column 
in a larger object that is a data.frame, then it is a list column which is 
valid but can be ticklish to handle within base R but less so in the tidyverse.

For example, if I try to make a data.frame the normal way, the list gets made 
into multiple columns and copied to each row. Not what was expected. I think 
some tidyverse functionality does better.

Like this:

library(tidyverse)
temp=list("Hello", 1, 1.1, "bye")

Now making a data.frame has an odd result:

> mydf=data.frame(alpha=1:4, beta=temp)
> mydf
alpha beta..Hello. beta.1 beta.1.1 beta..bye.
1 1Hello  1  1.1bye
2 2Hello  1  1.1bye
3 3Hello  1  1.1bye
4 4Hello  1  1.1bye

But a tibble handles it:

> mydf=tibble(alpha=1:4, beta=temp)
> mydf
# A tibble: 4 x 2
alpha beta 

  1 1 
  2 2 
  3 3 
  4 4 

So if the data does look like this, with a list column, but access can be 
tricky as subsetting a list with [] returns a list and you need [[]].

I found a somehwhat odd solution like this:

mydf %>%
   filter(!map_lgl(beta, is.numeric)) -> mydf2
# A tibble: 2 x 2
alpha beta 

  1 1 
  2 4 

When I saved that result into mydf2, I got this.

Original:
  
  > str(mydf)
tibble [4 x 2] (S3: tbl_df/tbl/data.frame)
$ alpha: int [1:4] 1 2 3 4
$ beta :List of 4
..$ : chr "Hello"
..$ : num 1
..$ : num 1.1
..$ : chr "bye"

Output when any row with a numeric is removed:

> str(mydf2)
tibble [2 x 2] (S3: tbl_df/tbl/data.frame)
$ alpha: int [1:2] 1 4
$ beta :List of 2
..$ : chr "Hello"
..$ : chr "bye"

So if you try variations on your code motivated by what I show, good luck. I am 
sure there are many better ways but I repeat, it can be tricky.

-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Tuesday, September 14, 2021 11:54 PM
To: Gregg Powell 
Cc: Gregg Powell via R-help 
Subject: Re: [R] How to remove all rows that have a numeric in the first (or 
any) column

You cannot apply vectorized operators to list columns... you have to use a map 
function like sapply or purrr::map_lgl to obtain a logical vector by running 
the function once for each list element:

sapply( VPN_Sheet1$HVA, is.numeric )

On September 14, 2021 8:38:35 PM PDT, Gregg Powell  
wrote:
>Here is the output:
>
>> str(VPN_Sheet1$HVA)
>List of 2174
> $ : chr "Email: f...@fff.com"
> $ : num 1
> $ : chr "Eloisa Libas"
> $ : chr "Percival Esquejo"
> $ : chr "Louchelle Singh"
> $ : num 2
> $ : chr "Charisse Anne Tabarno, RN"
> $ : chr "Sol Amor Mucoy"
> $ : chr "Josan Moira Paler"
> $ : num 3
> $ : chr "Anna Katrina V. Alberto"
> $ : chr "Nenita Velarde"
> $ : chr "Eunice Arrances"
> $ : num 4
> $ : chr "Catherine Henson"
> $ : chr "Maria Carla Daya"
> $ : chr "Renee Ireine Alit"
> $ : num 5
> $ : chr "Marol Joseph Domingo - PS"
> $ : chr "Kissy Andrea Arriesgado"
> $ : chr "Pia B Baluyut, RN"
> $ : num 6
> $ : chr "Gladys Joy Tan"
> $ : chr "Frances Zarzua"
> $ : chr "Fairy Jane Nery"
> $ : num 7
> $ : chr "Gladys Tijam, RMT"
> $ : chr "Sarah Jane Aramburo"
> $ : chr "Eve Mendoza"
> $ : num 8
> $ : chr "Gloria Padolino"
> $ : chr "Joyce Pearl Javier"
> $ : chr "Ayza Padilla"
> $ : num 9
> $ : chr "Walfredson Calderon"
> $ : chr "Stephanie Anne Militante"
> $ : chr "Rennua Oquilan"
> $ : num 10
> $ : chr "Neil John Nery"
> $ : chr "Maria Reyna Reyes"
> $ : chr "Rowella Villegas"
> $ : num 11
> $ : chr "Katelyn Mendiola"
> $ : chr "Maria Riza Mariano"
> $ : chr "Marie Vallianne Carantes"
> $ : num 12
>
>‐‐‐ Original Message ‐‐‐
>
>On Tuesday, September 14th, 2021 at 8:32 PM, Jeff Newmiller 
> wrote:
>
>> An atomic column of data by design has exactly one mode, so if any 
>> values are non-numeric then the entire column will be non-numeric. 
>> What does
>> 
>
>> str(VPN_Sheet1$HVA)
>> 
>
>> tell you? It is likely either a factor or character data.
>> 
>
>> On September 14, 2021 7:01:53 PM PDT, Gregg Powell via R-help 
>> r-help@r-project.org wrote:
>> 
>
>> > > Stuck on this problem - How does one remove all rows in a dataframe that 
>> > > have a numeric in the first (or any) column?
>> > 
>
>> > > Seems straight forward - but I'm having trouble.
>> > 
>
>> > I've attempted to used:
>> > 
>
>> > VPN_Sheet1 <- VPN_Sheet1[!is.numeric(VPN_Sheet1$HVA),]
>> > 
>
>> > and
>> > 
>
>> > VPN_Sheet1 <- VPN_Sheet1[!is.integer(VPN_Sheet1$HVA),]
>> > 
>
>> > Neither work - Neither throw an error.
>> > 
>
>> > class(VPN_Sheet1$HVA) returns:
>> > 
>
>> > [1] "list"
>> > 
>
>> > So, the HVA column returns a list.
>> > 
>
>> > > Data looks like the attached screen grab -
>> > 
>
>> > > The ONLY rows I need to delete are the rows where there is a numeric in 
>> > > the HVA column.
>> > 
>
>> > > There are some 5000+ rows in the actual data.
>> > 
>
>> > > Would be grateful for a solution

Re: [R] How to remove all rows that have a numeric in the first (or any) column

2021-09-14 Thread Avi Gross via R-help



Calling something a data.frame does not make it a data.frame.

The abbreviated object shown below is a list of singletons. If it is a column 
in a larger object that is a data.frame, then it is a list column which is 
valid but can be ticklish to handle within base R but less so in the tidyverse.

For example, if I try to make a data.frame the normal way, the list gets made 
into multiple columns and copied to each row. Not what was expected. I think 
some tidyverse functionality does better.

Like this:

library(tidyverse)
temp=list("Hello", 1, 1.1, "bye")

Now making a data.frame has an odd result:

> mydf=data.frame(alpha=1:4, beta=temp)
> mydf
alpha beta..Hello. beta.1 beta.1.1 beta..bye.
1 1Hello  1  1.1bye
2 2Hello  1  1.1bye
3 3Hello  1  1.1bye
4 4Hello  1  1.1bye

But a tibble handles it:

> mydf=tibble(alpha=1:4, beta=temp)
> mydf
# A tibble: 4 x 2
alpha beta 

  1 1 
  2 2 
  3 3 
  4 4 

So if the data does look like this, with a list column, but access can be 
tricky as subsetting a list with [] returns a list and you need [[]].

I found a somehwhat odd solution like this:

mydf %>%
   filter(!map_lgl(beta, is.numeric)) -> mydf2 # A tibble: 2 x 2
alpha beta 

  1 1 
  2 4 

When I saved that result into mydf2, I got this.

Original:
  
  > str(mydf)
tibble [4 x 2] (S3: tbl_df/tbl/data.frame) $ alpha: int [1:4] 1 2 3 4 $ beta 
:List of 4 ..$ : chr "Hello"
..$ : num 1
..$ : num 1.1
..$ : chr "bye"

Output when any row with a numeric is removed:

> str(mydf2)
tibble [2 x 2] (S3: tbl_df/tbl/data.frame) $ alpha: int [1:2] 1 4 $ beta :List 
of 2 ..$ : chr "Hello"
..$ : chr "bye"

So if you try variations on your code motivated by what I show, good luck. I am 
sure there are many better ways but I repeat, it can be tricky.

-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Tuesday, September 14, 2021 11:54 PM
To: Gregg Powell 
Cc: Gregg Powell via R-help 
Subject: Re: [R] How to remove all rows that have a numeric in the first (or 
any) column

You cannot apply vectorized operators to list columns... you have to use a map 
function like sapply or purrr::map_lgl to obtain a logical vector by running 
the function once for each list element:

sapply( VPN_Sheet1$HVA, is.numeric )

On September 14, 2021 8:38:35 PM PDT, Gregg Powell  
wrote:
>Here is the output:
>
>> str(VPN_Sheet1$HVA)
>List of 2174
> $ : chr "Email: f...@fff.com"
> $ : num 1
> $ : chr "Eloisa Libas"
> $ : chr "Percival Esquejo"
> $ : chr "Louchelle Singh"
> $ : num 2
> $ : chr "Charisse Anne Tabarno, RN"
> $ : chr "Sol Amor Mucoy"
> $ : chr "Josan Moira Paler"
> $ : num 3
> $ : chr "Anna Katrina V. Alberto"
> $ : chr "Nenita Velarde"
> $ : chr "Eunice Arrances"
> $ : num 4
> $ : chr "Catherine Henson"
> $ : chr "Maria Carla Daya"
> $ : chr "Renee Ireine Alit"
> $ : num 5
> $ : chr "Marol Joseph Domingo - PS"
> $ : chr "Kissy Andrea Arriesgado"
> $ : chr "Pia B Baluyut, RN"
> $ : num 6
> $ : chr "Gladys Joy Tan"
> $ : chr "Frances Zarzua"
> $ : chr "Fairy Jane Nery"
> $ : num 7
> $ : chr "Gladys Tijam, RMT"
> $ : chr "Sarah Jane Aramburo"
> $ : chr "Eve Mendoza"
> $ : num 8
> $ : chr "Gloria Padolino"
> $ : chr "Joyce Pearl Javier"
> $ : chr "Ayza Padilla"
> $ : num 9
> $ : chr "Walfredson Calderon"
> $ : chr "Stephanie Anne Militante"
> $ : chr "Rennua Oquilan"
> $ : num 10
> $ : chr "Neil John Nery"
> $ : chr "Maria Reyna Reyes"
> $ : chr "Rowella Villegas"
> $ : num 11
> $ : chr "Katelyn Mendiola"
> $ : chr "Maria Riza Mariano"
> $ : chr "Marie Vallianne Carantes"
> $ : num 12
>
>‐‐‐ Original Message ‐‐‐
>
>On Tuesday, September 14th, 2021 at 8:32 PM, Jeff Newmiller 
> wrote:
>
>> An atomic column of data by design has exactly one mode, so if any 
>> values are non-numeric then the entire column will be non-numeric.
>> What does
>> 
>
>> str(VPN_Sheet1$HVA)
>> 
>
>> tell you? It is likely either a factor or character data.
>> 
>
>> On September 14, 2021 7:01:53 PM PDT, Gregg Powell via R-help 
>> r-help@r-project.org wrote:
>> 
>
>> > > Stuck on this problem - How does one remove all rows in a dataframe that 
>> > > have a numeric in the first (or any) column?
>> > 
>
>> > > Seems straight forward - but I'm having trouble.
>> > 
>
>> > I've attempted to used:
>> > 
>
>> > VPN_Sheet1 <- VPN_Sheet1[!is.numeric(VPN_Sheet1$HVA),]
>> > 
>
>> > and
>> > 
>
>> > VPN_Sheet1 <- VPN_Sheet1[!is.integer(VPN_Sheet1$HVA),]
>> > 
>
>> > Neither work - Neither throw an error.
>> > 
>
>> > class(VPN_Sheet1$HVA) returns:
>> > 
>
>> > [1] "list"
>> > 
>
>> > So, the HVA column returns a list.
>> > 
>
>> > > Data looks like the attached screen grab -
>> > 
>
>> > > The ONLY rows I need to delete are the rows where there is a numeric in 
>> > > the HVA column.
>> > 
>
>> > > There are some 5000+ rows in the actual data.
>> > 
>
>> > > Would be grateful for a soluti

Re: [R] How to remove all rows that have a numeric in the first (or any) column

2021-09-14 Thread Avi Gross via R-help

You are correct, Gregg, I am aware of that trick of asking something to not be 
evaluated in certain ways.

And you can indeed use base R to play with contents of beta as defined above.  
Here is a sort of incremental demo:

> sapply(mydf$beta, is.numeric)

[1] FALSE  TRUE  TRUE FALSE

> !sapply(mydf$beta, is.numeric)

[1]  TRUE FALSE FALSE  TRUE

> keeping <- !sapply(mydf$beta, is.numeric)

> mydf[keeping, ]

# A tibble: 2 x 2

alpha beta 

  1 1 

  2 4 

  > str(mydf[keeping, ])

tibble [2 x 2] (S3: tbl_df/tbl/data.frame)

$ alpha: int [1:2] 1 4

$ beta :List of 2

..$ : chr "Hello"

..$ : chr "bye"

Now for the bad news. The original request was for ANY column. But presumably 
one way to do it, neither efficiently nor the best, would be to loop on the 
names of all the columns and starting with the original data.frame, whittle 
away at it column by column and adjust which column you search each time until 
what is left had nothing numeric anywhere. 

Now if I was using dplyr, I wonder if there is a nice way to use rowwise() to 
evaluate across a row.

Using your technique I made the following data.frame:

mydf <- data.frame(alpha=I(list("first", 2, 3.3, "Last")), 

   beta=I(list(1, "second", 3.3, "Lasting")))

> mydf

alphabeta

1 first   1

2 2  second

3   3.3 3.3

4  Last Lasting

Do we agree only the fourth row should be kept as the others have one or two 
numeric values?

Here is some code I cobbled together that seems to work:

rowwise(mydf) %>% 

  mutate(alphazoid=!is.numeric(unlist(alpha)), 

 betazoid=!is.numeric(unlist(beta))) %>%

  filter(alphazoid & betazoid) -> result

str(result)  

print(result)

result[[1,1]]

result[[1,2]]

as.data.frame(result)

The results are shown below that only the fourth row was kept:

> rowwise(mydf) %>% 

  +   mutate(alphazoid=!is.numeric(unlist(alpha)), 

 +  betazoid=!is.numeric(unlist(beta))) %>%

  +   filter(alphazoid & betazoid) -> result

> 

  > str(result)  

rowwise_df [1 x 4] (S3: rowwise_df/tbl_df/tbl/data.frame)

$ alpha:List of 1

..$ : chr "Last"

..- attr(*, "class")= chr "AsIs"

$ beta :List of 1

..$ : chr "Lasting"

..- attr(*, "class")= chr "AsIs"

$ alphazoid: logi TRUE

$ betazoid : logi TRUE

- attr(*, "groups")= tibble [1 x 1] (S3: tbl_df/tbl/data.frame)

..$ .rows: list [1:1] 

.. ..$ : int 1

.. ..@ ptype: int(0) 

> print(result)

# A tibble: 1 x 4

# Rowwise: 

alpha beta  alphazoid betazoid

> > 

  1   TRUE  TRUE

> result[[1,1]]

[[1]]

[1] "Last"

> result[[1,2]]

[[1]]

[1] "Lasting"

> as.data.frame(result)

alphabeta alphazoid betazoid

1  Last Lasting  TRUE TRUE

Of course, the temporary columns for alphazoid and betazoid can trivially be 
removed.

From: Andrew Simmons  
Sent: Wednesday, September 15, 2021 12:44 AM
To: Avi Gross 
Cc: Gregg Powell via R-help 
Subject: Re: [R] How to remove all rows that have a numeric in the first (or 
any) column

I'd like to point out that base R can handle a list as a data frame column, 
it's just that you have to make the list of class "AsIs". So in your example

temp <- list("Hello", 1, 1.1, "bye")

data.frame(alpha = 1:4, beta = I(temp)) 

means that column "beta" will still be a list.

On Wed, Sep 15, 2021, 00:40 Avi Gross via R-help mailto:r-help@r-project.org> > wrote:

Calling something a data.frame does not make it a data.frame.

The abbreviated object shown below is a list of singletons. If it is a column 
in a larger object that is a data.frame, then it is a list column which is 
valid but can be ticklish to handle within base R but less so in the tidyverse.

For example, if I try to make a data.frame the normal way, the list gets made 
into multiple columns and copied to each row. Not what was expected. I think 
some tidyverse functionality does better.

Like this:

library(tidyverse)
temp=list("Hello", 1, 1.1, "bye")

Now making a data.frame has an odd result:

> mydf=data.frame(alpha=1:4, beta=temp)
> mydf
alpha beta..Hello. beta.1 beta.1.1 beta..bye.
1 1Hello  1  1.1bye
2 2Hello  1  1.1bye
3 3Hello  1  1.1bye
4 4Hello  1  1.1bye

But a tibble handles it:

> mydf=tibble(alpha=1:4, beta=temp)
> mydf
# A tibble: 4 x 2
alpha beta 

  1 1 
  2 2 
  3 3 
  4 4 

So if the data does look like this, with a list column, but access can be 
tricky as subsetting a list with [] returns a list and you need [[]].

I found a somehwh

Re: [R] How to remove all rows that have a numeric in the first (or any) column

2021-09-14 Thread Avi Gross via R-help

My apologies. My reply was to Andrew, not Gregg.

Enough damage for one night. Here is hoping we finally understood a question 
that could have been better phrased. list columns are not normally considered 
common data structures but quite possibly will be more as time goes on and the 
tools to handle them become better or at least better understood.

-Original Message-
From: R-help  On Behalf Of Avi Gross via R-help
Sent: Wednesday, September 15, 2021 1:23 AM
To: R-help@r-project.org
Subject: Re: [R] How to remove all rows that have a numeric in the first (or 
any) column

You are correct, Gregg, I am aware of that trick of asking something to not be 
evaluated in certain ways.

And you can indeed use base R to play with contents of beta as defined above.  
Here is a sort of incremental demo:

> sapply(mydf$beta, is.numeric)

[1] FALSE  TRUE  TRUE FALSE

> !sapply(mydf$beta, is.numeric)

[1]  TRUE FALSE FALSE  TRUE

> keeping <- !sapply(mydf$beta, is.numeric)

> mydf[keeping, ]

# A tibble: 2 x 2

alpha beta 

  1 1 

  2 4 

  > str(mydf[keeping, ])

tibble [2 x 2] (S3: tbl_df/tbl/data.frame)

$ alpha: int [1:2] 1 4

$ beta :List of 2

..$ : chr "Hello"

..$ : chr "bye"

Now for the bad news. The original request was for ANY column. But presumably 
one way to do it, neither efficiently nor the best, would be to loop on the 
names of all the columns and starting with the original data.frame, whittle 
away at it column by column and adjust which column you search each time until 
what is left had nothing numeric anywhere. 

Now if I was using dplyr, I wonder if there is a nice way to use rowwise() to 
evaluate across a row.

Using your technique I made the following data.frame:

mydf <- data.frame(alpha=I(list("first", 2, 3.3, "Last")), 

   beta=I(list(1, "second", 3.3, "Lasting")))

> mydf

alphabeta

1 first   1

2 2  second

3   3.3 3.3

4  Last Lasting

Do we agree only the fourth row should be kept as the others have one or two 
numeric values?

Here is some code I cobbled together that seems to work:

rowwise(mydf) %>% 

  mutate(alphazoid=!is.numeric(unlist(alpha)), 

 betazoid=!is.numeric(unlist(beta))) %>%

  filter(alphazoid & betazoid) -> result

str(result)  

print(result)

result[[1,1]]

result[[1,2]]

as.data.frame(result)

The results are shown below that only the fourth row was kept:

> rowwise(mydf) %>%

  +   mutate(alphazoid=!is.numeric(unlist(alpha)), 

 +  betazoid=!is.numeric(unlist(beta))) %>%

  +   filter(alphazoid & betazoid) -> result

> 

  > str(result)  

rowwise_df [1 x 4] (S3: rowwise_df/tbl_df/tbl/data.frame)

$ alpha:List of 1

..$ : chr "Last"

..- attr(*, "class")= chr "AsIs"

$ beta :List of 1

..$ : chr "Lasting"

..- attr(*, "class")= chr "AsIs"

$ alphazoid: logi TRUE

$ betazoid : logi TRUE

- attr(*, "groups")= tibble [1 x 1] (S3: tbl_df/tbl/data.frame)

..$ .rows: list [1:1] 

.. ..$ : int 1

.. ..@ ptype: int(0) 

> print(result)

# A tibble: 1 x 4

# Rowwise: 

alpha beta  alphazoid betazoid

> > 

  1   TRUE  TRUE

> result[[1,1]]

[[1]]

[1] "Last"

> result[[1,2]]

[[1]]

[1] "Lasting"

> as.data.frame(result)

alphabeta alphazoid betazoid

1  Last Lasting  TRUE TRUE

Of course, the temporary columns for alphazoid and betazoid can trivially be 
removed.

From: Andrew Simmons 
Sent: Wednesday, September 15, 2021 12:44 AM
To: Avi Gross 
Cc: Gregg Powell via R-help 
Subject: Re: [R] How to remove all rows that have a numeric in the first (or 
any) column

I'd like to point out that base R can handle a list as a data frame column, 
it's just that you have to make the list of class "AsIs". So in your example

temp <- list("Hello", 1, 1.1, "bye")

data.frame(alpha = 1:4, beta = I(temp)) 

means that column "beta" will still be a list.

On Wed, Sep 15, 2021, 00:40 Avi Gross via R-help mailto:r-help@r-project.org> > wrote:

Calling something a data.frame does not make it a data.frame.

The abbreviated object shown below is a list of singletons. If it is a column 
in a larger object that is a data.frame, then it is a list column which is 
valid but can be ticklish to handle within base R but less so in the tidyverse.

For example, if I try to make a data.frame the normal way, the list gets made 
into multiple columns and copied to each row. Not what was expected. I think 
some tidyverse functionality does better.

Like this:

library(tidyverse)
temp=list("Hello", 1, 1.1, "bye")

Now making a data.frame has an odd result:

> mydf=data.frame(alpha=1:

Re: [R] how to remove factors from whole dataframe?

2021-09-19 Thread Avi Gross via R-help

Glad we have solutions BUT I note that the more abstract question is how to 
convert any columns that are factors to their base type and that may well NOT 
be character. They can be integers or doubles or complex or Boolean and maybe 
even raw. 

So undoing factorization may require using something like typeof() to get the 
base type and then depending on what final type you have, you may have to do 
things like as.integer(as.character(the_factor)) to get it as an integer and 
for a logical, as.logical(factor(c(TRUE, TRUE, FALSE, TRUE, FALSE))) and so on.

This seems like a fairly basic need so I wonder if anyone has already done it. 
I can see a fairly straightforward way to build a string and use eval and I 
suspect others might use something else like do.call() and yet others use 
multiple if statements or a case_when or something

-Original Message-
From: R-help  On Behalf Of Luigi Marongiu
Sent: Sunday, September 19, 2021 4:43 PM
To: Rui Barradas 
Cc: r-help 
Subject: Re: [R] how to remove factors from whole dataframe?

Awesome, thanks!

On Sun, Sep 19, 2021 at 4:22 PM Rui Barradas  wrote:
>
> Hello,
>
> Using Jim's lapply(., is.factor) but simplified, you could do
>
>
> df1 <- df
> i <- sapply(df1, is.factor)
> df1[i] <- lapply(df1[i], as.character)
>
>
> a one-liner modifying df, not df1 is
>
>
> df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], 
> as.character)
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 11:03 de 19/09/21, Luigi Marongiu escreveu:
> > Thank you Jim, but I obtain:
> > ```
> >> str(df)
> > 'data.frame': 5 obs. of  3 variables:
> >   $ region : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
> >   $ sales  : num  13 16 22 27 34
> >   $ country: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> >> df1<-df[,!unlist(lapply(df,is.factor))]
> >> str(df1)
> >   num [1:5] 13 16 22 27 34
> >> df1
> > [1] 13 16 22 27 34
> > ```
> > I was expecting
> > ```
> > str(df)
> > 'data.frame': 5 obs. of  3 variables:
> >   $ region : char "A","B","C","D",..: 1 2 3 4 5
> >   $ sales  : num  13 16 22 27 34
> >   $ country: char "a","b","c","d",..: 1 2 3 4 5 ```
> >
> > On Sun, Sep 19, 2021 at 11:37 AM Jim Lemon  wrote:
> >>
> >> Hi Luigi,
> >> It's easy:
> >>
> >> df1<-df[,!unlist(lapply(df,is.factor))]
> >>
> >> _except_ when there is only one column left, as in your example. In 
> >> that case, you will have to coerce the resulting vector back into a 
> >> one column data frame.
> >>
> >> Jim
> >>
> >> On Sun, Sep 19, 2021 at 6:18 PM Luigi Marongiu  
> >> wrote:
> >>>
> >>> Hello,
> >>> I woul dlike to remove factors from all the columns of a dataframe.
> >>> I can do it n a column at the time with ```
> >>>
> >>> df <- data.frame(region=factor(c('A', 'B', 'C', 'D', 'E')),
> >>>   sales = c(13, 16, 22, 27, 34), 
> >>> country=factor(c('a', 'b', 'c', 'd', 'e')))
> >>>
> >>> new_df$region <- droplevels(new_df$region) ```
> >>>
> >>> What is the syntax to remove all factors at once (from all columns)?
> >>> For this does not work:
> >>> ```
>  str(df)
> >>> 'data.frame': 5 obs. of  3 variables:
> >>>   $ region : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
> >>>   $ sales  : num  13 16 22 27 34
> >>>   $ country: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
>  df = droplevels(df)
>  str(df)
> >>> 'data.frame': 5 obs. of  3 variables:
> >>>   $ region : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
> >>>   $ sales  : num  13 16 22 27 34
> >>>   $ country: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5 ``` 
> >>> Thank you
> >>>
> >>> __
> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide 
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >

--
Best regards,
Luigi

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to remove factors from whole dataframe?

2021-09-19 Thread Avi Gross via R-help

nality more fully. So, yes, that might be 
something that could be done but this is just an academic exercise for me.

-Original Message-
From: Bert Gunter  
Sent: Sunday, September 19, 2021 7:19 PM
To: Avi Gross 
Cc: Luigi Marongiu ; Rui Barradas 
; r-help 
Subject: Re: [R] how to remove factors from whole dataframe?

You do not understand factors. There is no "base type" that can be recovered.

> f <- factor(c(5.1, 6.2), labels = c("whoa","baby")) f
[1] whoa baby
Levels: whoa baby

> unclass(f)
[1] 1 2
attr(,"levels")
[1] "whoa" "baby"

> typeof(f)
[1] "integer"


Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Sun, Sep 19, 2021 at 2:15 PM Avi Gross via R-help  
wrote:
>
> Glad we have solutions BUT I note that the more abstract question is how to 
> convert any columns that are factors to their base type and that may well NOT 
> be character. They can be integers or doubles or complex or Boolean and maybe 
> even raw.
>
> So undoing factorization may require using something like typeof() to get the 
> base type and then depending on what final type you have, you may have to do 
> things like as.integer(as.character(the_factor)) to get it as an integer and 
> for a logical, as.logical(factor(c(TRUE, TRUE, FALSE, TRUE, FALSE))) and so 
> on.
>
> This seems like a fairly basic need so I wonder if anyone has already 
> done it. I can see a fairly straightforward way to build a string and 
> use eval and I suspect others might use something else like do.call() 
> and yet others use multiple if statements or a case_when or something
>
>
>
>
> -Original Message-
> From: R-help  On Behalf Of Luigi 
> Marongiu
> Sent: Sunday, September 19, 2021 4:43 PM
> To: Rui Barradas 
> Cc: r-help 
> Subject: Re: [R] how to remove factors from whole dataframe?
>
> Awesome, thanks!
>
> On Sun, Sep 19, 2021 at 4:22 PM Rui Barradas  wrote:
> >
> > Hello,
> >
> > Using Jim's lapply(., is.factor) but simplified, you could do
> >
> >
> > df1 <- df
> > i <- sapply(df1, is.factor)
> > df1[i] <- lapply(df1[i], as.character)
> >
> >
> > a one-liner modifying df, not df1 is
> >
> >
> > df[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)],
> > as.character)
> >
> >
> > Hope this helps,
> >
> > Rui Barradas
> >
> > Às 11:03 de 19/09/21, Luigi Marongiu escreveu:
> > > Thank you Jim, but I obtain:
> > > ```
> > >> str(df)
> > > 'data.frame': 5 obs. of  3 variables:
> > >   $ region : Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
> > >   $ sales  : num  13 16 22 27 34
> > >   $ country: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
> > >> df1<-df[,!unlist(lapply(df,is.factor))]
> > >> str(df1)
> > >   num [1:5] 13 16 22 27 34
> > >> df1
> > > [1] 13 16 22 27 34
> > > ```
> > > I was expecting
> > > ```
> > > str(df)
> > > 'data.frame': 5 obs. of  3 variables:
> > >   $ region : char "A","B","C","D",..: 1 2 3 4 5
> > >   $ sales  : num  13 16 22 27 34
> > >   $ country: char "a","b","c","d",..: 1 2 3 4 5 ```
> > >
> > > On Sun, Sep 19, 2021 at 11:37 AM Jim Lemon  wrote:
> > >>
> > >> Hi Luigi,
> > >> It's easy:
> > >>
> > >> df1<-df[,!unlist(lapply(df,is.factor))]
> > >>
> > >> _except_ when there is only one column left, as in your example. 
> > >> In that case, you will have to coerce the resulting vector back 
> > >> into a one column data frame.
> > >>
> > >> Jim
> > >>
> > >> On Sun, Sep 19, 2021 at 6:18 PM Luigi Marongiu 
> > >>  wrote:
> > >>>
> > >>> Hello,
> > >>> I woul dlike to remove factors from all the columns of a dataframe.
> > >>> I can do it n a column at the time with ```
> > >>>
> > >>> df <- data.frame(region=factor(c('A', 'B', 'C', 'D', 'E')),
> > >>>   sales = c(13, 16, 22, 27, 34), 
> > >>> country=factor(c('a', 'b', 'c', 'd'

Re: [R] How to use ifelse without invoking warnings

2021-10-08 Thread Avi Gross via R-help

Ravi,

I have no idea what motivated the people who made ifelse() but there is no
reason they felt the need to program it to meet your precise need. As others
have noted, it probably was built to handle simple cases well and it expects
to return a result that is the same length as the input. If some of the
processing returns an NA or a NaN then that is what it probably should
return. 

What is the alternative? Return a shorter result? Replace it with a zero?
Fail utterly and abort the program?

YOU as the programmer should make such decisions for a non-routine case.

You can create functions with names like wrapperIf() and wrapperElse() and
do your ifelse like this:

result <- ifelse(condition, wrapperIf(args), wrapperElse(args))

Why the wrappers? If your logic is to replace NaN with 0 or NA or 666 or
Inf, then the code for it would invoke your functionality and if tested to
be a NaN it would replace it as you wish. Yes, it would slow things down a
bit but leave the ifelse() routine fairly simple.

If your goal is to remove those entries, you can do it after by manipulating
"result" above such as not keeping any item that matches 666, or even
without the wrappers, something like:

result <- result[!is.nan(result)]

But, of course, warnings are only suppressed if done right. Clearly you can
very selectively suppress warnings in the wrapper functions above without
also suppressing some other more valid warnings. But if the warning is
coming from ifelse() itself then not ever having it see a NaN would suppress
that.

Do note that the implementation of ifelse() is currently a function, not
some internal call. You can copy that and make your own slightly modified
version if you wish. 

(no R) Avi

-Original Message-
From: R-help  On Behalf Of Ravi Varadhan via
R-help
Sent: Friday, October 8, 2021 8:22 AMiu
To: John Fox l
Cc: R-Help 
Subject: Re: [R] How to use ifelse without invoking warnings

Thank you to Bert, Sarah, and John. I did consider suppressing warnings, but
I felt that there must be a more principled approach.  While John's solution
is what I would prefer, I cannot help but wonder why `ifelse' was not
constructed to avoid this behavior.

Thanks & Best regards,
Ravi

From: John Fox 
Sent: Thursday, October 7, 2021 2:00 PM
To: Ravi Varadhan 
Cc: R-Help 
Subject: Re: [R] How to use ifelse without invoking warnings

  External Email - Use Caution

Dear Ravi,

It's already been suggested that you could disable warnings, but that's
risky in case there's a warning that you didn't anticipate. Here's a
different approach:

 > kk <- k[k >= -1 & k <= n]
 > ans <- numeric(length(k))
 > ans[k > n] <- 1
 > ans[k >= -1 & k <= n] <- pbeta(p, kk + 1, n - kk, lower.tail=FALSE)  >
ans [1] 0.0 0.006821826 0.254991551 1.0

BTW, I don't think that you mentioned that p = 0.3, but that seems apparent
from the output you showed.

I hope this helps,
  John

--
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
web:
https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsocialscie
nces.mcmaster.ca%2Fjfox%2F&data=04%7C01%7Cravi.varadhan%40jhu.edu%7Cfd88
2e7c4f4349db34e108d989bc6a9f%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C63
7692265160038474%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzI
iLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=Q33yXm36BwEVKUWO72CWFpSUx7g
cEEXhM3qFi7n78ZM%3D&reserved=0

On 2021-10-07 12:29 p.m., Ravi Varadhan via R-help wrote:
> Hi,
> I would like to execute the following vectorized calculation:
>
>ans <- ifelse (k >= -1 & k <= n, pbeta(p, k+1, n-k, lower.tail = 
> FALSE), ifelse (k < -1, 0, 1) )
>
> For example:
>
>
>> k <- c(-1.2,-0.5, 1.5, 10.4)
>> n <- 10
>> ans <- ifelse (k >= -1 & k <= n, pbeta(p,k+1,n-k,lower.tail=FALSE), 
>> ifelse (k < -1, 0, 1) )
> Warning message:
> In pbeta(p, k + 1, n - k, lower.tail = FALSE) : NaNs produced
>> print(ans)
> [1] 0.0 0.006821826 0.254991551 1.0
>
> The answer is correct.  However, I would like to eliminate the annoying
warnings.  Is there a better way to do this?
>
> Thank you,
> Ravi
>
>
>   [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7Cravi.varadha
> n%40jhu.edu%7Cfd882e7c4f4349db34e108d989bc6a9f%7C9fa4f438b1e6473b803f8
> 6f8aedf0dec%7C0%7C0%7C637692265160048428%7CUnknown%7CTWFpbGZsb3d8eyJWI
> joiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&a
> mp;sdata=FXX%2B4zNT0JHBnDFO5dXBDQ484oQF1EK5%2Fa0dG9P%2F4k4%3D&rese
> rved=0 PLEASE do read the posting guide 
> https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r
> -project.org%2Fposting-guide.html&data=04%7C01%7Cravi.varadhan%40j
> hu.edu%7Cfd882e7c4f4349db34e108d989bc6a9f%7C9fa4f438b1e6473b803f86f8ae
> df

[R] assumptions about how things are done

2021-10-09 Thread Avi Gross via R-help

This is supposed to be a forum for help so general and philosophical
discussions belong elsewhere, or nowhere.

 

Having said that, I want to make a brief point. Both new and experienced
people make implicit assumptions about the code they use. Often nobody looks
at how the sausage is made. The recent discussion of ifelse() made me take a
look and I was not thrilled.

 

My NA�VE view was that ifelse() was implemented as a sort of loop construct.
I mean if I have a vector of length N and perhaps a few other vectors of the
same length, I might say:

 

result <- ifelse(condition-on-vector-A, result-if-true-using-vectors,
result-if-false-using-vectors)

 

So say I want to take a vector of integers from 1 to N and make an output a
second vector where you have either a prime number or NA. If I have a
function called is.prime() that checks a single number and returns
TRUE/FALSE, it might look like this:

 

primed <- ifelse(is.prime(A, A, NA)

 

So A[1] will be mapped to 1 and A[2} to 2 and A[3] to 3, but A[4] being
composite becomes NA and so on.

 

If you wrote the above using loops, it would be to range from index 1 to N
and apply the above. There are many complications as R allows vectors to be
longer or to be repeated as needed.

 

What I found ifelse() as implemented to do, is sort of like this:

 

Make a vector of the right length for the results, initially empty.

 

Make a vector evaluating the condition so it is effectively a Boolean
result.

Calculate which indices are TRUE. Secondarily, calculate another set of
indices that are false.

 

Calculate ALL the THEN conditions and ditto all the ELSE conditions.

 

Now copy into the result all the THEN values indexed by the TRUE above and
than all the ELSE values indicated by the FALSE above.

 

In plain English, make a result from two other results based on picking
either one from menu A or one from menu B.

 

That is not a bad algorithm and in a vectorized language like R, maybe even
quite effective and efficient. It does lots of extra work as by definition
it throws at least half away.

 

I suspect the implementation could be made much faster by making some of it
done internally using a language like C.

 

But now that I know what this implementation did, I might have some qualms
at using it in some situations. The original complaint led to other
observations and needs and perhaps blindly using a supplied function like
ifelse() may not be a decent solution for some needs.

 

I note how I had to reorient my work elsewhere using a group of packages
called the tidyverse when they added a function to allow rowwise
manipulation of the data as compared to an ifelse-like method using all
columns at once. There is room for many approaches and if a function may not
be doing quite what you want, something else may better meet your needs OR
you may want to see if you can copy the existing function and modify it for
your own personal needs.

 

In the case we mentioned, the goal was to avoid printing selected warnings.
Since the function is readable, it can easily be modified in a copy to find
what is causing the warnings and either rewrite a bit to avoid them or start
over with perhaps your own function that tests before doing things and
avoids tripping the condition (generating a NaN) entirely.

 

Like may languages, R is a bit too rich. You can piggyback on the work of
others but with some caution as they did not necessarily have you in mind
with what they created.

 

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] assumptions about how things are done

2021-10-11 Thread Avi Gross via R-help

ting the condition so it is effectively a Boolean
result.

Calculate which indices are TRUE. Secondarily, calculate another set of
indices that are false.

Calculate ALL the THEN conditions and ditto all the ELSE conditions.

Now copy into the result all the THEN values indexed by the TRUE above and
than all the ELSE values indicated by the FALSE above.

In plain English, make a result from two other results based on picking
either one from menu A or one from menu B.

That is not a bad algorithm and in a vectorized language like R, maybe even
quite effective and efficient. It does lots of extra work as by definition
it throws at least half away.

I suspect the implementation could be made much faster by making some of it
done internally using a language like C.

But now that I know what this implementation did, I might have some qualms
at using it in some situations. The original complaint led to other
observations and needs and perhaps blindly using a supplied function like
ifelse() may not be a decent solution for some needs.

I note how I had to reorient my work elsewhere using a group of packages
called the tidyverse when they added a function to allow rowwise
manipulation of the data as compared to an ifelse-like method using all
columns at once. There is room for many approaches and if a function may not
be doing quite what you want, something else may better meet your needs OR
you may want to see if you can copy the existing function and modify it for
your own personal needs.

In the case we mentioned, the goal was to avoid printing selected warnings.
Since the function is readable, it can easily be modified in a copy to find
what is causing the warnings and either rewrite a bit to avoid them or start
over with perhaps your own function that tests before doing things and
avoids tripping the condition (generating a NaN) entirely.

Like may languages, R is a bit too rich. You can piggyback on the work of
others but with some caution as they did not necessarily have you in mind
with what they created.

[[alternative HTML version deleted]]

--

Message: 4
Date: Sun, 10 Oct 2021 08:34:52 +1100
From: Jim Lemon 
To: Avi Gross 
Cc: r-help mailing list 
Subject: Re: [R] assumptions about how things are done
Message-ID:

Content-Type: text/plain; charset="utf-8"

Hi Avi,
Definitely a learning moment. I may consider writing an ifElse() for
my own use and sharing it if anyone wants it.

Jim

On Sun, Oct 10, 2021 at 6:36 AM Avi Gross via R-help
 wrote:
>
> This is supposed to be a forum for help so general and philosophical
> discussions belong elsewhere, or nowhere.
>
>
>
> Having said that, I want to make a brief point. Both new and experienced
> people make implicit assumptions about the code they use. Often nobody 
looks
> at how the sausage is made. The recent discussion of ifelse() made me 
take a
> look and I was not thrilled.
>
>
>
> My NAÏVE view was that ifelse() was implemented as a sort of loop 
construct.
> I mean if I have a vector of length N and perhaps a few other vectors of 
the
> same length, I might say:
>
>
>
> result <- ifelse(condition-on-vector-A, result-if-true-using-vectors,
> result-if-false-using-vectors)
>
>
>
> So say I want to take a vector of integers from 1 to N and make an output 
a
> second vector where you have either a prime number or NA. If I have a
> function called is.prime() that checks a single number and returns
> TRUE/FALSE, it might look like this:
>
>
>
> primed <- ifelse(is.prime(A, A, NA)
>
>
>
> So A[1] will be mapped to 1 and A[2} to 2 and A[3] to 3, but A[4] being
> composite becomes NA and so on.
>
>
>
> If you wrote the above using loops, it would be to range from index 1 to N
> and apply the above. There are many complications as R allows vectors to 
be
> longer or to be repeated as needed.
>
>
>
> What I found ifelse() as implemented to do, is sort of like this:
>
>
>
> Make a vector of the right length for the results, initially empty.
>
>
>
> Make a vector evaluating the condition so it is effectively a Boolean
> result.
>
> Calculate which indices are TRUE. Secondarily, calculate another set of
> indices that are false.
>
>
>
> Calculate ALL the THEN conditions and ditto all the ELSE conditions.
>
>
>
> No

Re: [R] Does intersect preserve order?

2021-10-17 Thread Avi Gross via R-help

intersect() is a generic function so the question is which one does someone
want to know if it remains in the same order?

But a deeper question is what ORDER? 

intersect(A, B)
intersect(B, A)

Note the results have to be the same but not the order unless they start
sorted the same way.

-Original Message-
From: R-help  On Behalf Of Duncan Murdoch
Sent: Sunday, October 17, 2021 5:49 AM
To: petr smirnov ; r-help@r-project.org
Subject: Re: [R] Does intersect preserve order?

On 15/10/2021 4:31 p.m., petr smirnov wrote:
> Hi,
> 
> Is base::intersect guaranteed to return items in the order they 
> (first) appear in the first argument? I couldn't find any mention of 
> this in the help file for set operations.

No, that's just what the current implementation does.

It's conceivable that swapping x and y could let it be faster in some
circumstances.  Or maybe there's a completely different implementation
that's better for some data types.  In either of those cases the order could
change.

Generally speaking, the functions that treat vectors as sets make no
assumptions and no guarantees about order, because sets are unordered.

If you need the current behaviour to be guaranteed, probably the easiest way
is to copy the function:  it's very simple.

Duncan Murdoch

> 
> If so, could this be documented on the help page?
> 
> Thanks,
> Petr

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] cleanup/replacing a value on condition of another value

2021-10-25 Thread Avi Gross via R-help

I wonder why it is not as simple as:

Call mutate on the data and have a condition that looks like:

data %>% mutate(cases = ifelse(multiple_cond, NA, cases) -> output

-Original Message-
From: R-help  On Behalf Of Dr Eberhard W Lisse
Sent: Monday, October 25, 2021 11:49 AM
To: r-help@r-project.org
Subject: Re: [R] cleanup/replacing a value on condition of another value

Rui,

that works for me too, but is not what I need to do.

I want to make the 'cases' value for this particular country AND this 
particular date AND this particular type AND this particular value (ie ALL 
conditions must be fulfilled) become NA so that the tibble would change from

[...]
2 Namibia 2021-10-24 death 4
3 Namibia 2021-10-23 confirmed   357
4 Namibia 2021-10-23 death 1
[...]

to

[...]
2 Namibia 2021-10-24 death 4
3 Namibia 2021-10-23 confirmed   357
4 Namibia 2021-10-23 death 1
[...]

as long as they don't fix the dataset, and if/when they do it goes to the 
expected 23 value :-)-O

greetings, el

On 2021-10-25 17:26 , Rui Barradas wrote:
 > Hello,
 >
 > The following works with me.
 >
 >
 > library(coronavirus)
 > library(dplyr)
 >
 > data(coronavirus, package = "coronavirus")  > #update_dataset(silence = 
 > FALSE)  >  > coronavirus %>%
 >select(country, date, type, cases) %>%
 >filter(
 >  country == 'Namibia',
 >  date == '2021-10-23',
 >  cases == 357
 >)
 >
 >
 >
 > Can you post the pipe code you are running?
 >
 > Hope this helps,
 >
 > Rui Barradas
 >
 > Às 12:25 de 25/10/21, Dr Eberhard W Lisse escreveu:
 >> Hi,
 >>
 >> I have data from JHU via the 'coronavirus' package which has a value for  
 >> >> the confirmed cases for 2021-10-23 which differs drastically (357) from  
 >> >> what is reported in country (23).
 >>
 >>  # A tibble: 962 × 4
 >>country date   type  cases
 >> 
 >>  1 Namibia 2021-10-24 confirmed23
 >>  2 Namibia 2021-10-24 death 4
 >>  3 Namibia 2021-10-23 confirmed   357
 >>  4 Namibia 2021-10-23 death 1
 >>  5 Namibia 2021-10-22 confirmed30
 >>  6 Namibia 2021-10-22 death 1
 >>  # … with 956 more rows
 >>
 >> I am using a '%>%' pipeline and am struggling to mutate 'cases' to NA  >> 
 >> using something like  >>
 >>  country == 'Namibia' & date == '2021-10-23' & cases == 357
 >>
 >> so that if or when the data-set is corrected I don't have to change the  >> 
 >> code (immediately), even after some googling.
 >>
 >> I can do
 >>
 >>  cases == 357
 >>
 >> only, but that could find other cases as well, which is obviously not  >> 
 >> the thing to do  >>  >> Any suggestions?
 >>
 >> greetings, el
 >>
 >> __
 >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see  >> 
 >> https://stat.ethz.ch/mailman/listinfo/r-help
 >> PLEASE do read the posting guide
 >> http://www.R-project.org/posting-guide.html
 >> and provide commented, minimal, self-contained, reproducible code.
 >

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need help in R

2021-10-26 Thread Avi Gross via R-help

There can be people doing homework for a course and as noted, the normal
expectation is to use the resources provided including classroom instruction
(or the often ZOOM or recordings) as well as textbooks.

Forums like this are not a substitute and some nice people will sometimes
volunteer not to do homework but help someone a bit such as asking them to
think about the problem, what data structures or loops might be needed OR
help them understand an error message or after seeing their attempts, point
out a subtle flaw.

There are some who are teaching themselves and are not being graded but are
trying little project to learn. But again, it is best they learn for
themselves and not be handed an answer.

Some mailing lists have rules and perhaps might answer someone more if it is
WORK question like how to fine tune a graph after they have most of it
working.

So, I am looking at the follow-up below with a jaundiced eye as I just saw
something similar enough being asked on a Python board. 

The first question strikes me as odd because it contains a completely
un-necessary looking part. 

> > ### Question 1
> > Create a variable containing a sequence of numbers from 1 to 100:
> >
> > Iterate over the variables and print those numbers which are prime.

You do not really need to create a sequence and then again loop over the
same sequence. The following code is shown:

n= seq(1,100)

for (j in n:100) {

Well, the look could have been using n directly or using 1:100 but uses
nonsense. To say n:100 requires n to be an integer, not some kind of
sequence or vector. And why go to 200 in any case? 

The rest gets worse and worse with oddities like using just letters of the
alphabet without meanings as variables, using integers like "f = 1" rather
than Booleans for such flags, and using the "=" operator rather than the
more accepted "<-" operator. The loop uses variable j then ignores it and
keeps manipulating i. 

Someone wanting help might want to let people know what the algorithm is
supposed to do. 

I won't try to guess and certainly won't supply a valid solution here!

-Original Message-
From: R-help  On Behalf Of Anas Jamshed
Sent: Tuesday, October 26, 2021 4:39 PM
To: Rolf Turner 
Cc: R-help Mailing List 
Subject: Re: [R] Need help in R

Its not homework . Basically i want to get easy solution:
I am trying this for ist problem:

n= seq(1,100)

for (j in n:100) {
  f = 1
  i = 2
  n = j
  while (i <= n / 2) {
if (n %% i == 0) {
  f = 0
  break
}
i = i + 1
  }
  if (f == 1) {
print(paste("Number is prime :", n))
  }
}

On Wed, Oct 27, 2021 at 1:35 AM Rolf Turner  wrote:

>
> On Wed, 27 Oct 2021 01:09:50 +0500
> Anas Jamshed  wrote:
>
> > I need help to these questions
> >
> > ### Question 1
> > Create a variable containing a sequence of numbers from 1 to 100:
> >
> > Iterate over the variables and print those numbers which are prime.
> >
> >
> > ### Question 2
> > Create a matrix of size 3x3 called mat_1:
> >
> >  Iterate over all the values one by one and print the element as 
> > well as the position in the matrix (row, col)
>
> You really should do your own homework.
>
> cheers,
>
> Rolf Turner
>
> --
> Honorary Research Fellow
> Department of Statistics
> University of Auckland
> Phone: +64-9-373-7599 ext. 88276
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to include if conditions in dplyr filter function

2021-10-26 Thread Avi Gross via R-help

The error below was fairly clear. The R 'if" statement is not vectorized and
takes a single logical argument. It is not normally used in a pipeline
unless at that point the data has been reduced to a vector of length 1.

I do not want to look at your code further without the data behind it but
suggest there is a vectorized ifelse() function that might fit your needs
but since you use filter(0 so many times in there, maybe this should not be
one long pipeline.



-Original Message-
From: R-help  On Behalf Of Yuan Chun Ding via
R-help
Sent: Tuesday, October 26, 2021 6:39 PM
To: r-help@r-project.org
Subject: [R] how to include if conditions in dplyr filter function

Hi R users,

I thought the follow R code should work, but I got error, Can you fix my
code?

Thank you,

Ding


outlier_tcga_MAD3 <- outlier_tcga %>% filter(n_two >0) %>% 
  mutate(freqMAD3_gain2ratio = N_MAD3_gain2/n_two )%>%
  if (N_MAD3 < 9) {filter(freqMAD3_gain >=1)} else if (N_MAD3 > n_two*2 )
  {filter (freqMAD3_gain >= 0.8 & freqMAD3_gain2ratio >=
0.33)} else 
  {filter(freqMAD3_gain2 >=0.3 )} %>%
  arrange(desc(N_MAD3))


Error in if (.) N_MAD3 < 9 else { : 
  argument is not interpretable as logical In addition: Warning message:
In if (.) N_MAD3 < 9 else { :
  the condition has length > 1 and only the first element will be used

--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or
entity to which they are addressed. This communication may contain
information that is privileged, confidential, or exempt from disclosure
under applicable law (e.g., personal health information, research data,
financial information). Because this e-mail has been sent without
encryption, individuals other than the intended recipient may be able to
view the information, forward it to others or tamper with the information
without the knowledge or consent of the sender. If you are not the intended
recipient, or the employee or person responsible for delivering the message
to the intended recipient, any dissemination, distribution or copying of the
communication is strictly prohibited. If you received the communication in
error, please notify the sender immediately by replying to this message and
deleting the message and any accompanying files from your system. If, due to
the security risks, you do not wish to receive further communications via
e-mail, please reply to this message and inform the sender that you do not
wish to receive further e-mail from the sender. (LCP301)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R vs Numpy

2021-10-28 Thread Avi Gross via R-help

I am not sure your overall question fits into this forum but a brief
internet search can find plenty of info.

But in brief, R is a language in which much of what numpy does was built in
from the start and many things are vectorized. Much of what the python
pandas language does is also part of native R. There are additional packages
(python called them modules) freely available that greatly extend those
capabilities and I doubt there is very much you can do in numpy that cannot
also often easily be done in R.

Realistically, there are several reasons the numpy module is so commonly
used in python. They left something like vectors out of the language. Yes,
they have dictionaries and lists and sets and all kinds of objects. So numpy
was made mostly in C to provide numeric processing of things that are more
like vectors efficiently. In R, everything is a vector as in a simple
variable is just a vector of length one!

I program in both and in other languages as many do. Reasons to choose one
or another vary. Python can do many things easily and with complexity and is
a rather full-blown and complex language with real object-oriented
capabilities and also functional programming. It is interpreted but also has
a way to save partially compiled code. R is pretty much all interpreted
albeit many things are written I C or C++ pr other compiled languages and
stuffed into libraries. 

One main reason to choose is programming style but there are TONS of
differences that can bite you such as R sometimes deferring evaluation of
code which can be an advantage or the opposite. But a huge reason I think
that people choose one or the other is the availability of packages that do
much of what they want. Some, for example, love a set of packages they call
the tidyverse and do much of their work largely within it rather than base
R. Many love the graphics package called ggplot.

But over time, I see more and more functionality available within the Python
community that rivals or perhaps exceeds such as the machine learning tools.

I have an interesting solution I sometimes use as you can run programs in R
using a package that allows the same data to be accessed back and forth
between an attached R interpreter and a Python interpreter. So if you want
to use python features like dictionaries and list comprehensions to massage
the data then have R do additional things and perhaps make graphs, you can
get some of both worlds.

As noted, a detailed answer is way beyond here. R has packages that probably
let you add things and it has too many object-oriented subsystems, most of
them not complete.

Good Luck,

Avi

-Original Message-
From: R-help  On Behalf Of Catherine Walt
Sent: Thursday, October 28, 2021 2:57 AM
To: r-help@r-project.org
Subject: [R] R vs Numpy

Hello members,

I am familiar with python's Numpy.
Now I am looking into R language.
What is the main difference between these two languages? including
advantages or disadvantages.

Thanks.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] customize the step value

2021-10-29 Thread Avi Gross via R-help

As others have replied, the customary way is to use the seq() function that
takes additional arguments besides a from= and a to= such as by= to specify
the step size and two others sometimes handy of length.out= and along.with=

In your case seq(from=1.5, to=3.5, by=0.5) works as well as the shorter
positional version of seq(1.5, 3.5, 0.5)

But as others have noted, certain calculations with floating point
arithmetic in pretty much any language can be imprecise in the final bits. I
doubt it matters for you but there are ways to do a comparison that allows
for a little leeway and still tests equal. 

The suggestion others have made is a good choice too, especially when you
know exactly what you need in advance and can adjust:

> seq(1.5, 3.5, 0.5)
[1] 1.5 2.0 2.5 3.0 3.5
> seq(3, 7) / 2
[1] 1.5 2.0 2.5 3.0 3.5
> 0.5*(3:7)
[1] 1.5 2.0 2.5 3.0 3.5


If you do this often and for larger vectors and efficiency matters, consider
using seq.int() in the latter cases as it is much faster when working on
just integers. 

-Original Message-
From: R-help  On Behalf Of Catherine Walt
Sent: Friday, October 29, 2021 3:06 AM
To: R mailing list 
Subject: [R] customize the step value

dear members,

Sorry I am newbie on R.
as we saw below:

> 1.5:3.5
[1] 1.5 2.5 3.5

How can I make the step to 0.5?
I want the result:

1.5 2.0 2.5 3.0 3.5

Thanks.
Cathy

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Probably off topic but I hope amusing

2021-10-29 Thread Avi Gross via R-help

Bert,

R is used all over the place, sometimes not visibly.

A search shows the NY times using it in 2011, 2009, ...:

https://www.nytimes.com/2009/01/07/technology/business-computing/07program.h
tml

https://blog.revolutionanalytics.com/2011/03/how-the-new-york-times-uses-r-f
or-data-visualization.html

There also seem to be several packages for interfacing with the NY Times,
albeit that does not mean much about their usage.

However, the error message using the phrase "NaN" is not a guarantee as
there are other languages that use the concept, albeit may not capitalize it
the same way. But in an error message, any programmer can be setting up the
text. According to this reference, Rust and ECMAScript also call it a NaN:

https://en.wikipedia.org/wiki/NaN

I am a tad confused it lists a form of "NaN%" without specifying if any
language specifically uses it and your example ended with a percent sign.


-Original Message-
From: R-help  On Behalf Of Bert Gunter
Sent: Friday, October 29, 2021 11:36 AM
To: R-help 
Subject: [R] Probably off topic but I hope amusing

There was a little discussion today (yet again) about floating point
arithmetic. Perhaps related to this, I subscribe to the online NYTimes,
which flashes U.S. stock index prices at the top of its home page. Today,
instead of the Nasdaq price being flashed, there was this:

undefined-NaN%

I wonder if this means that R is being used as a backend for this or whether
this way of displaying what I think is 0/0 in FP is common.

Anyway, what do you think most readers reaction to this was?!

Best to all,
Bert

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] by group

2021-11-01 Thread Avi Gross via R-help

This is a fairly simple request and well covered by introductory reading
material.

A decent example was given and I see Andrew provided a base R reply that
should be sufficient. But I do not think he realized you wanted something
different so his answer is not in the format you wanted:

> tapply(dat$wt, dat$Year, mean)  # mean by Year 
2001 2002 2003 
13.5 14.8 13.5 
> tapply(dat$wt, dat$Sex , mean)  # mean by Sex tapply(dat$wt,
list(dat$Year, dat$Sex), mean)  # mean by Year and Sex
FM
12.4 15.4

I personally often prefer to the tidyverse approach which optionally
includes pipes and allows a data frame to be grouped any way you want and
followed by commands. It is easier to output your result this way by
grouping BOTH by Year and Sex at once and getting multiple lines of output.
Note the code below requires a line once like install.packages("tidyverse)

library(tidyverse)
dat <- read.table(
  text = "Year Sex wt
2001 M 15
2001 M 14
2001 M 16
2001 F 12
2001 F 11
2001 F 13
2002 M 14
2002 M 18
2002 M 17
2002 F 11
2002 F 15
2002 F 14
2003 M 18
2003 M 13
2003 M 14
2003 F 15
2003 F 10
2003 F 11  ",
  header = TRUE
)

dat %>%
  group_by(Year, Sex) %>%
  summarize( M = mean(wt, na.rm=TRUE))

The output of the above is the rows below:

> dat %>%
  +   group_by(Year, Sex) %>%
  +   summarize( M = mean(wt, na.rm=TRUE))
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
# A tibble: 6 x 3
# Groups:   Year [3]
Year Sex   M
  
  1  2001 F  12  
2  2001 M  15  
3  2002 F  13.3
4  2002 M  16.3
5  2003 F  12  
6  2003 M  15  

Note Male and Female have their own rows. It is not that hard to switch it
to your format by rearranging the intermediate data set with pivot_wider()
in the pipeline asking to make multiple new columns from variable Sex and
populating them from the created variable M. The new complete pipeline is
now:

dat %>%
  group_by(Year, Sex) %>%
  summarize( M = mean(wt, na.rm=TRUE)) %>%
  pivot_wider(names_from = Sex, values_from = M)

The output as a tibble is:

Year F M
  
  1  2001  1215  
2  2002  13.3  16.3
3  2003  1215  

Or as a data.frame which seems to add zeroes:

dat %>%
  +   group_by(Year, Sex) %>%
  +   summarize( M = mean(wt, na.rm=TRUE)) %>%
  +   pivot_wider(names_from = Sex, values_from = M) %>%
  +   as.data.frame
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
YearFM
1 2001 12.0 15.0
2 2002 13.3 16.3
3 2003 12.0 15.0

Your expected output is too rounded as it expects 13.3 and 16.3 but if you
insist on a single significant digit after the decimal point, ask for it to
be rounded:

> dat %>%
  +   group_by(Year, Sex) %>%
  +   summarize( M = mean(wt, na.rm=TRUE)) %>%
  +   pivot_wider(names_from = Sex, values_from = M) %>%
  +   as.data.frame %>%
  +   round(1)
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
YearFM
1 2001 12.0 15.0
2 2002 13.3 16.3
3 2003 12.0 15.0

And, yes, any of the above can be done in various ways using plain old R,
and especially in the recent versions that have added a somewhat different
way to do pipelines.





-Original Message-
From: R-help  On Behalf Of Val
Sent: Monday, November 1, 2021 5:08 PM
To: r-help@R-project.org (r-help@r-project.org) 
Subject: [R] by group

Hi All,

How can I generate mean by group. The sample data looks like as follow,
dat<-read.table(text="Year Sex wt
2001 M 15
2001 M 14
2001 M 16
2001 F 12
2001 F 11
2001 F 13
2002 M 14
2002 M 18
2002 M 17
2002 F 11
2002 F 15
2002 F 14
2003 M 18
2003 M 13
2003 M 14
2003 F 15
2003 F 10
2003 F 11  ",header=TRUE)

The desired  output  is,
 MF
20011512
200216.33   13.33
200315  12

Thank you,

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] by group

2021-11-01 Thread Avi Gross via R-help

Jim,

Your code gives the output in quite a different format and as an object of
class "by" that is not easily convertible to a data.frame. So, yes, it is an
answer that produces the right numbers but not in the places or data
structures I think they (or if it is HW ...) wanted.

Trivial standard cases are often handled by a single step but more complex
ones often suggest a multi-part approach.

Of course Val gets to decide what approach works best for them within
whatever constraints we here are not made aware of. If this is a class
assignment, it likely would be using only tools discussed in the class. So I
would not suggest using a dplyr/tidyverse approach if that is not covered or
even part of a class. If this is a project in the real world, it becomes a
matter of programming taste and convenience and so on.

Maybe Val can share more about the situation so we can see what is helpful
and what is not. Realistically, I can think of way too many ways to get the
required output.

-Original Message-
From: R-help  On Behalf Of Jim Lemon
Sent: Monday, November 1, 2021 6:25 PM
To: Val ; r-help mailing list 
Subject: Re: [R] by group

Hi Val,
I think you answered your own question:

by(dat$wt,dat[,c("Sex","Year")],mean)

Jim

On Tue, Nov 2, 2021 at 8:09 AM Val  wrote:
>
> Hi All,
>
> How can I generate mean by group. The sample data looks like as 
> follow, dat<-read.table(text="Year Sex wt
> 2001 M 15
> 2001 M 14
> 2001 M 16
> 2001 F 12
> 2001 F 11
> 2001 F 13
> 2002 M 14
> 2002 M 18
> 2002 M 17
> 2002 F 11
> 2002 F 15
> 2002 F 14
> 2003 M 18
> 2003 M 13
> 2003 M 14
> 2003 F 15
> 2003 F 10
> 2003 F 11  ",header=TRUE)
>
> The desired  output  is,
>  MF
> 20011512
> 200216.33   13.33
> 200315  12
>
> Thank you,
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] by group

2021-11-01 Thread Avi Gross via R-help

Understood Val. So you need to save the output in something like a data.frame 
which can then be saved as a CSV file or whatever else makes sense to be read 
in by a later program. As note by() does not produce the output in a usable way.

But you mentioned efficient, and that is another whole ball of wax. For small 
amounts of data it may not matter much. And some processes may look slower but 
turn out be more efficient if compiled as C/C++ or ...

Sometimes it might be more efficient to change the format of your data before 
the analysis, albeit if the output is much smaller, maybe best later.

Good luck.

-Original Message-
From: Val  
Sent: Monday, November 1, 2021 7:44 PM
To: Avi Gross 
Cc: r-help mailing list 
Subject: Re: [R] by group

Thank you all!
I can assure you that this is not  HW. This is a sample of my large data set 
and I want a simple  and efficient approach to get the
desired  output   in that particular format.  That file will be saved
and used  as an input file for another external process.

val







On Mon, Nov 1, 2021 at 6:08 PM Avi Gross via R-help  
wrote:
>
> Jim,
>
> Your code gives the output in quite a different format and as an 
> object of class "by" that is not easily convertible to a data.frame. 
> So, yes, it is an answer that produces the right numbers but not in 
> the places or data structures I think they (or if it is HW ...) wanted.
>
> Trivial standard cases are often handled by a single step but more 
> complex ones often suggest a multi-part approach.
>
> Of course Val gets to decide what approach works best for them within 
> whatever constraints we here are not made aware of. If this is a class 
> assignment, it likely would be using only tools discussed in the 
> class. So I would not suggest using a dplyr/tidyverse approach if that 
> is not covered or even part of a class. If this is a project in the 
> real world, it becomes a matter of programming taste and convenience and so 
> on.
>
> Maybe Val can share more about the situation so we can see what is 
> helpful and what is not. Realistically, I can think of way too many 
> ways to get the required output.
>
> -Original Message-
> From: R-help  On Behalf Of Jim Lemon
> Sent: Monday, November 1, 2021 6:25 PM
> To: Val ; r-help mailing list 
> 
> Subject: Re: [R] by group
>
> Hi Val,
> I think you answered your own question:
>
> by(dat$wt,dat[,c("Sex","Year")],mean)
>
> Jim
>
> On Tue, Nov 2, 2021 at 8:09 AM Val  wrote:
> >
> > Hi All,
> >
> > How can I generate mean by group. The sample data looks like as 
> > follow, dat<-read.table(text="Year Sex wt
> > 2001 M 15
> > 2001 M 14
> > 2001 M 16
> > 2001 F 12
> > 2001 F 11
> > 2001 F 13
> > 2002 M 14
> > 2002 M 18
> > 2002 M 17
> > 2002 F 11
> > 2002 F 15
> > 2002 F 14
> > 2003 M 18
> > 2003 M 13
> > 2003 M 14
> > 2003 F 15
> > 2003 F 10
> > 2003 F 11  ",header=TRUE)
> >
> > The desired  output  is,
> >  MF
> > 20011512
> > 200216.33   13.33
> > 200315  12
> >
> > Thank you,
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] by group

2021-11-01 Thread Avi Gross via R-help

I sent Val a longer reply but for anyone else here, please note Val was copying 
my OUTPUT with continuation lines starting with plus so of course that has 
issues! 

The naked code to use was this:

dat %>%
  group_by(Year, Sex) %>%
  summarize( M = mean(wt, na.rm=TRUE)) %>%
  pivot_wider(names_from = Sex, values_from = M) %>%
  as.data.frame %>%
  round(1)


-Original Message-
From: Val  
Sent: Monday, November 1, 2021 8:15 PM
To: Avi Gross 
Cc: r-help mailing list 
Subject: Re: [R] by group

Thank you Avi,

One question, I am getting this  error from this script

> dat %>%
+   +   group_by(Year, Sex) %>%
+   +   summarize( M = mean(wt, na.rm=TRUE)) %>%
+   +   pivot_wider(names_from = Sex, values_from = M) %>%
+   +   as.data.frame %>%
+   +   round(1)
Error in group_by(Year, Sex) : object 'Year' not found Why I am getting this?



On Mon, Nov 1, 2021 at 7:07 PM Avi Gross via R-help  
wrote:
>
> Understood Val. So you need to save the output in something like a data.frame 
> which can then be saved as a CSV file or whatever else makes sense to be read 
> in by a later program. As note by() does not produce the output in a usable 
> way.
>
> But you mentioned efficient, and that is another whole ball of wax. For small 
> amounts of data it may not matter much. And some processes may look slower 
> but turn out be more efficient if compiled as C/C++ or ...
>
> Sometimes it might be more efficient to change the format of your data before 
> the analysis, albeit if the output is much smaller, maybe best later.
>
> Good luck.
>
> -Original Message-
> From: Val 
> Sent: Monday, November 1, 2021 7:44 PM
> To: Avi Gross 
> Cc: r-help mailing list 
> Subject: Re: [R] by group
>
> Thank you all!
> I can assure you that this is not  HW. This is a sample of my large data set 
> and I want a simple  and efficient approach to get the
> desired  output   in that particular format.  That file will be saved
> and used  as an input file for another external process.
>
> val
>
>
>
>
>
>
>
> On Mon, Nov 1, 2021 at 6:08 PM Avi Gross via R-help  
> wrote:
> >
> > Jim,
> >
> > Your code gives the output in quite a different format and as an 
> > object of class "by" that is not easily convertible to a data.frame.
> > So, yes, it is an answer that produces the right numbers but not in 
> > the places or data structures I think they (or if it is HW ...) wanted.
> >
> > Trivial standard cases are often handled by a single step but more 
> > complex ones often suggest a multi-part approach.
> >
> > Of course Val gets to decide what approach works best for them 
> > within whatever constraints we here are not made aware of. If this 
> > is a class assignment, it likely would be using only tools discussed 
> > in the class. So I would not suggest using a dplyr/tidyverse 
> > approach if that is not covered or even part of a class. If this is 
> > a project in the real world, it becomes a matter of programming taste and 
> > convenience and so on.
> >
> > Maybe Val can share more about the situation so we can see what is 
> > helpful and what is not. Realistically, I can think of way too many 
> > ways to get the required output.
> >
> > -Original Message-
> > From: R-help  On Behalf Of Jim Lemon
> > Sent: Monday, November 1, 2021 6:25 PM
> > To: Val ; r-help mailing list 
> > 
> > Subject: Re: [R] by group
> >
> > Hi Val,
> > I think you answered your own question:
> >
> > by(dat$wt,dat[,c("Sex","Year")],mean)
> >
> > Jim
> >
> > On Tue, Nov 2, 2021 at 8:09 AM Val  wrote:
> > >
> > > Hi All,
> > >
> > > How can I generate mean by group. The sample data looks like as 
> > > follow, dat<-read.table(text="Year Sex wt
> > > 2001 M 15
> > > 2001 M 14
> > > 2001 M 16
> > > 2001 F 12
> > > 2001 F 11
> > > 2001 F 13
> > > 2002 M 14
> > > 2002 M 18
> > > 2002 M 17
> > > 2002 F 11
> > > 2002 F 15
> > > 2002 F 14
> > > 2003 M 18
> > > 2003 M 13
> > > 2003 M 14
> > > 2003 F 15
> > > 2003 F 10
> > > 2003 F 11  ",header=TRUE)
> > >
> > > The desired  output  is,
> > >  MF
> > > 20011512
> > > 200216.33   13.33
> > > 200315  12
> > >
> > > Thank you,
> > >
> > > __
> > > R-help@r-project.org mailing list -

Re: [R] by group

2021-11-01 Thread Avi Gross via R-help

That works, Bert. I do note the result, d, is a matrix and works because 
everything within it is numeric. I have no problem with that for this problem.

 

For a more general problem with a data.frame type of object with columns of any 
type, the proper output might better be a data.frame. Of course, in this case, 
the matric can easily be coerced back into a data.frame:

 

d <- as.data.frame(with(dat, tapply(wt, list(Year, Sex), mean)))

d

 

FM

2001 12.0 15.0

2002 13.3 16.3

2003 12.0 15.0

 

str(d)

 

'data.frame':  3 obs. of  2 variables:

  $ F: num  12 13.3 12

$ M: num  15 16.3 15

 

 

 

From: Bert Gunter  
Sent: Monday, November 1, 2021 8:50 PM
To: Val 
Cc: Avi Gross ; r-help mailing list 
Subject: Re: [R] by group

 

"A decent example was given and I see Andrew provided a base R reply that
should be sufficient. But I do not think he realized you wanted something
different so his answer is not in the format you wanted:"

 

Yes, but it is trivial to modify Andrew's suggestion to get almost exactly what 
Val has requested (one merely needs to read ?tapply carefully) without having 
to resort to tidyverse gymnastics:

 

> d <- with(dat, tapply(wt, list(Year, Sex), mean))
> d
FM
2001 12.0 15.0
2002 13.3 16.3
2003 12.0 15.0

 

This is a matrix, not a data.frame. write() or write.table()  can be used to 
write it to a file in whatever format is desired (e.g. with or without 
row,column names. Etc.

 

Cheers,

Bert

 

 

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

 

 

On Mon, Nov 1, 2021 at 4:44 PM Val mailto:valkr...@gmail.com> > wrote:

Thank you all!
I can assure you that this is not  HW. This is a sample of my large
data set and I want a simple  and efficient approach to get the
desired  output   in that particular format.  That file will be saved
and used  as an input file for another external process.

val







On Mon, Nov 1, 2021 at 6:08 PM Avi Gross via R-help
mailto:r-help@r-project.org> > wrote:
>
> Jim,
>
> Your code gives the output in quite a different format and as an object of
> class "by" that is not easily convertible to a data.frame. So, yes, it is an
> answer that produces the right numbers but not in the places or data
> structures I think they (or if it is HW ...) wanted.
>
> Trivial standard cases are often handled by a single step but more complex
> ones often suggest a multi-part approach.
>
> Of course Val gets to decide what approach works best for them within
> whatever constraints we here are not made aware of. If this is a class
> assignment, it likely would be using only tools discussed in the class. So I
> would not suggest using a dplyr/tidyverse approach if that is not covered or
> even part of a class. If this is a project in the real world, it becomes a
> matter of programming taste and convenience and so on.
>
> Maybe Val can share more about the situation so we can see what is helpful
> and what is not. Realistically, I can think of way too many ways to get the
> required output.
>
> -Original Message-
> From: R-help  <mailto:r-help-boun...@r-project.org> > On Behalf Of Jim Lemon
> Sent: Monday, November 1, 2021 6:25 PM
> To: Val mailto:valkr...@gmail.com> >; r-help mailing 
> list mailto:r-help@r-project.org> >
> Subject: Re: [R] by group
>
> Hi Val,
> I think you answered your own question:
>
> by(dat$wt,dat[,c("Sex","Year")],mean)
>
> Jim
>
> On Tue, Nov 2, 2021 at 8:09 AM Val  <mailto:valkr...@gmail.com> > wrote:
> >
> > Hi All,
> >
> > How can I generate mean by group. The sample data looks like as
> > follow, dat<-read.table(text="Year Sex wt
> > 2001 M 15
> > 2001 M 14
> > 2001 M 16
> > 2001 F 12
> > 2001 F 11
> > 2001 F 13
> > 2002 M 14
> > 2002 M 18
> > 2002 M 17
> > 2002 F 11
> > 2002 F 15
> > 2002 F 14
> > 2003 M 18
> > 2003 M 13
> > 2003 M 14
> > 2003 F 15
> > 2003 F 10
> > 2003 F 11  ",header=TRUE)
> >
> > The desired  output  is,
> >  MF
> > 20011512
> > 200216.33   13.33
> > 200315  12
> >
> > Thank you,
> >
> > __
> > R-help@r-project.org <mailto:R-help@r-project.org>  mailing list -- To 
> > UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-gu

Re: [R] Is there a hash data structure for R

2021-11-02 Thread Avi Gross via R-help

I have several things I considered about this topic.

It is, in general, not possible to do some things in one language or another
even if you find a bridge. Python lets you place all kinds of things into a
dictionary including many static objects like tuples or even other
dictionaries. What is allowed for keys is quite broad. If you use an R
environment or list, there are restrictions on what names are allowed that
are not necessarily the same. 

Now on to key uniqueness. Yes, R allows multiple named entries to share the
same name. But  when I made such a structure, the FIRST instance hides any
later ones when accessing something like list.name$A or setting it. If you
remove an exiting entry by something like setting the above to NULL, though,
the second instance of that name nor becomes the first. So anyone wanting to
fully remove all instances might need to loop till sure all are gone. Not
sure about environments but they may behave better.

Third, multiple parties have built R packages to support hashing including
some no longer available:

https://cran.r-project.org/web/packages/hash/hash.pdf
https://www.rdocumentation.org/packages/Dict/versions/0.10.0

So  if one of those works, why reinvent it?

In any case, if you roll your own, as has been shown by others, you may have
to provide getter and setter functionality and so on to make sure of things
that you want for compatibility and be Careful as any other programs that
can play with your data may go around you.

Finally, someone mentioned how creating a data.frame with duplicate names
for columns is not a problem as it can automagically CHANGE them to be
unique. That is a HUGE problem for using that as a dictionary as the new
name will not be known to the system so all kinds of things will fail.

And there are also packages for many features like sets as well as functions
to manipulate these things.

-Original Message-
From: R-help  On Behalf Of Bill Dunlap
Sent: Tuesday, November 2, 2021 1:26 PM
To: Andrew Simmons 
Cc: R Help 
Subject: Re: [R] Is there a hash data structure for R

Note that an environment carries a hash table with it, while a named list
does not.  I think that looking up an entry in a list causes a hash table to
be created and thrown away.  Here are some timings involving setting and
getting various numbers of entries in environments and lists.  The times are
roughly linear in n for environments and quadratic for lists.

> vapply(1e3 * 2 ^ (0:6), f, L=new.env(parent=emptyenv()),
FUN.VALUE=NA_real_)
[1] 0.00 0.00 0.00 0.02 0.03 0.06 0.15
> vapply(1e3 * 2 ^ (0:6), f, L=list(), FUN.VALUE=NA_real_)
[1]  0.01  0.03  0.15  0.53  2.66 13.66 56.05
> f
function(n, L, V = sprintf("V%07d", sample(n, replace=TRUE))) {
system.time(for(v in V)L[[v]]<-c(L[[v]],v))["elapsed"] }

Note that environments do not allow an element named "" (the empty string).

Elements named NA_character_ are treated differently in environments and
lists, neither of which is great.  You may want your hash table functions to
deal with oddball names explicitly.

-Bill

On Tue, Nov 2, 2021 at 8:52 AM Andrew Simmons  wrote:

> If you're thinking about using environments, I would suggest you 
> initialize them like
>
>
> x <- new.env(parent = emptyenv())
>
>
> Since environments have parent environments, it means that requesting 
> a value from that environment can actually return the value stored in 
> a parent environment (this isn't an issue for [[ or $, this is 
> exclusively an issue with assign, get, and exists) Or, if you've 
> already got your values stored in a list that you want to turn into an 
> environment:
>
>
> x <- list2env(listOfValues, parent = emptyenv())
>
>
> Hope this helps!
>
>
> On Tue, Nov 2, 2021, 06:49 Yonghua Peng  wrote:
>
> > But for data.frame the colnames can be duplicated. Am I right?
> >
> > Regards.
> >
> > On Tue, Nov 2, 2021 at 6:29 PM Jan van der Laan 
> wrote:
> >
> > >
> > > True, but in a lot of cases where a python user might use a dict 
> > > an R user will probably use a list; or when we are talking about 
> > > arrays of dicts in python, the R solution will probably be a 
> > > data.frame (with
> each
> > > dict field in a separate column).
> > >
> > > Jan
> > >
> > >
> > >
> > >
> > > On 02-11-2021 11:18, Eric Berger wrote:
> > > > One choice is
> > > > new.env(hash=TRUE)
> > > > in the base package
> > > >
> > > >
> > > >
> > > > On Tue, Nov 2, 2021 at 11:48 AM Yonghua Peng  wrote:
> > > >
> > > >> I know this is a newbie question. But how do I implement the 
> > > >> hash
> > > structure
> > > >> which is available in other languages (in python it's dict)?
> > > >>
> > > >> I know there is the list, but list's names can be duplicated here.
> > > >>
> > > >>> x <- list(x=1:5,y=month.name,x=3:7)
> > > >>
> > > >>> x
> > > >>
> > > >> $x
> > > >>
> > > >> [1] 1 2 3 4 5
> > > >>
> > > >>
> > > >> $y
> > > >>
> > > >>   [1] "January"   "February"  "March" "April" "May"
> >  "June"
> > > >>
> > > >>   [7] "July"

Re: [R] Is there a hash data structure for R

2021-11-03 Thread Avi Gross via R-help

Jack, I was agreeing with you and pointing out that although changing names
of columns to be unique has a positive side, it makes it hard to use for
anything that needs to look like either a set or a bag and of course a
dictionary/hash. All the above want to put things in using some identifier
and expect to get back the same.

R actually has other places names can be changed or dynamically created
often with defaults. This can be convenient or annoying as you need to look
at the resulting names to be able to use them. A feature that is nice in
some programs and parts of the tidyverse packages is the ability to specify
things like suffixes to be used.

I use lots of computer languages and keep running into people who expect
another language to just support what the first does. If that were true, we
might not create so many. If your other language supports indefinite sized
integers, good for you. Many others do not and perhaps may not easily do so
even if you try to create your own emulation if some other code gets your
version and does not work on it.

So assuming you use an available package or roll your own and can now make
some kind of hash data structure. As you pointed out, hashing may not be
required if your implementation is already fast and hashing can use lots of
memory. What are the allowed keys in your implementation? Will an integer of
1 be distinct from a floating point of 1.0? Can you hash objects of the
half-dozen or so kinds R seems to have? Whatever your answers are to many
questions like these, you may not get quite the same answers in Python or
PERL or ...

R is not really meant to be a general-purpose language although it can be.
It is not really fully-blown Object Oriented  but in many ways it can be
using newer grafts. So if you need or want the things R is designed for and
in particular there are good packages available for your needs, then use R.
If your needs include things not easily part of R, and you have a language
that works for you, why switch?

I will note that there is an intermediate path. I often run programs in
RSTUDIO that include code for both R and Python. The data structures used
sort of convert back and forth as needed so you can begin in R and read in
data and make some changes and then hand it over to Python for more, perhaps
multiple times, then generate a graph within R or whatever.

So if you want a dictionary, you can sort of keep it on the Python side and
use Python commands to create it and add to it or access contents. The
results may be handed over to the R side as needed but not as a dict but
instead to a pairlist or named list/vector or whatever and when needed, you
can have python take your results. The same applies to lots of other things
Python does that R may not do quite the same or at all. I mean generators
and all kinds of object-oriented programs including multiple inheritance and
so much more. You can use each language for what it is good at or that work
with the way you think and with some overhead get the best of both worlds.

Of course, this comes with costs and any programs you send out to be used by
others would require both languages to be installed properly and ...

-Original Message-
From: R-help  On Behalf Of Jan van der Laan
Sent: Wednesday, November 3, 2021 5:47 AM
To: r-help@r-project.org
Subject: Re: [R] Is there a hash data structure for R

On 03-11-2021 00:42, Avi Gross via R-help wrote:

> 
> Finally, someone mentioned how creating a data.frame with duplicate 
> names for columns is not a problem as it can automagically CHANGE them 
> to be unique. That is a HUGE problem for using that as a dictionary as 
> the new name will not be known to the system so all kinds of things will
fail.

I think you are referring to my remark which was:

 > However, the data.frame construction method will detect this and  >
generate unique names (which also might not be what you want):

I didn't say this means that duplicate names aren't a problem; I just
mentioned the the behaviour is different. Personally, I would actually
prefer the behaviour of list (keep the duplicated name) with a warning.

Most of the responses seem to assume that the OP actually wants a hash
table. Yes, he did ask for that and for a hash table an environment (with
some work) would be a good option. But in many cases, where other languages
would use a hash-table-like object (such as a dict) in R you would use other
types of objects. Furthermore, for many operations where you might use hash
tables to implement the operation, R has already built in options, for
example %in%, match, duplicated. These are also vectorised; so two vectors:
one with keys and one with values might actually be faster than an
environment in some use cases.

Best,
Jan

> 
> And there are also packages for many features like sets as well as 
> functions to manipulate these things.
> 
> -Original Message-
> From: R-help  On Behalf Of B

Re: [R] Fwd: Merging multiple csv files to new file

2021-11-03 Thread Avi Gross via R-help

Gabrielle,

Why would you expect that to work?

rbind() binds rows of internal R data structures that are some variety of 
data.frame with exactly the same columns in the same order into a larger object 
of that type.

You are not providing rbind() with the names of variables holding the info but 
file names of Comma Separated Values.

If you have many files with different numbers of columns of data with some 
overlap, you need to decide on quite a few things first. If a file has say 4 
columns out of a possible 20 unique columns across the files, do you want to 
add 16 columns to the contents of the file, after reading it in, and re-arrange 
it into a specific order by column? What will you fill in the new columns with? 
NA is a popular choice but you need to decide.

You then need to repeat the same thing with all the other files and read in 6 
columns then add 14 filled as you wish and rearrange the columns to the same 
order.

When done, you have an assortment of variables of class data.frame (or other 
similar ones) and you can use rbind() on those variables to get a result.

But it may not be what you want. You may actually want more of a database merge 
type of operation combining columns from each into the same userID field or 
whatever. rbind() is not the function to do that with and I won't go on to give 
a long tutorial. 

My main point is what you are doing is at the wrong level. You need to read all 
the files into variable before doing additional calculations in R.

-Original Message-
From: R-help  On Behalf Of gabrielle aban 
steinberg
Sent: Tuesday, November 2, 2021 6:31 PM
To: r-help@r-project.org
Subject: [R] Fwd: Merging multiple csv files to new file

Hello, I would like to merge 18 csv files into a master data csv file, but each 
file has a different number of columns (mostly found in one or more of the 
other cvs files) and different number of rows.

I have tried something like the following in R Studio (cloud):

all_data_fit_files <- rbind("dailyActivity_merged.csv", 
"dailyCalories_merged.csv", "dailyIntensities_merged.csv", 
"dailySteps_merged.csv", "heartrate_seconds_merged.csv", 
"hourlyCalories_merged.csv", "hourlyIntensities_merged.csv", 
"hourlySteps_merged.csv", "minuteCaloriesNarrow_merged.csv",
"minuteCaloriesWide_merged.csv", "minuteIntensitiesNarrow_merged.csv",
"minuteIntensitiesWide_merged.csv", "minuteMETsNarrow_merged.csv", 
"minuteSleep_merged.csv", "minuteStepsNarrow_merged.csv", 
“minuteStepsWide_merged.csv", "sleepDay_merged.csv", 
"minuteStepsWide_merged.csv", "sleepDay_merged.csv",
"weightLogInfo_merged.csv")



But I am getting the following error:

Error: unexpected input in "rlySteps_merged.csv", 
"minuteCaloriesNarrow_merged.csv", "minuteCaloriesWide_merged.csv", 
"minuteIntensitiesNarrow_merged.csv",
"minuteIntensitiesWide_merged.csv", "minuteMETsNarrow_merged.csv"


(Maybe the R Studio free trial/usage is underpowered for my project?)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fwd: Merging multiple csv files to new file

2021-11-03 Thread Avi Gross via R-help

I am not clear why Python came up on this forum. Yes, you can do all sorts of 
stuff in Python (or other languages) in ways similar or not to doing them in R.

The topic here was reading in data from multiple CSV files and I saw no mention 
about whether some columns are supposed to be of type character or other types.

As noted, if the (CSV) file is properly formatted and whatever function you use 
to read them in does not guess right, you can use versions of functions that 
let you specify what type to expect OR change it after you read it in.

One poster seems to be confused and perhaps things what is being read from is 
some other kind of EXCEL data file. There are other functions that can read 
from those but I believe a CSV has no real issues unless it is formatted wrong 
or in ways that require you to ask for comments to be ignored or skip a few 
lines and so on.

Nontheless, Gabriella needs to spell out a bit more about her project as all we 
know now is to suggest she read in each file sequentially (or in a loop) into 
multiple R variables. Beyond that, it is not clear what she wants to do in 
combining them and I am not so sure an rbind() makes much sense.

So what she needs perhaps is to look at functions like read.csv() and 
write.csv() and consider what transformations to make in the data read in and 
then how to recombine them.

Of course, if I have completely misunderstood what she wants, never mind!

-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Wednesday, November 3, 2021 1:22 PM
To: r-help@r-project.org; Robert Knight ; gabrielle 
aban steinberg 
Cc: r-help 
Subject: Re: [R] Fwd: Merging multiple csv files to new file

Data type in a CSV is always character until inferred otherwise... it is not 
necessary nor even easier to manipulate files with Python if you are planning 
to use R to manipulate the data further with R. Just use the 
colClasses="character" argument for read.csv.

On November 3, 2021 9:47:03 AM PDT, Robert Knight  
wrote:
>It might be easier to settle on the desired final csv layout and use 
>Python to copy the rows via line reads.  Python doesn't care about the 
>data type in a given "cell", numeric or char, whereas the type errors R 
>would encounter would make the task very difficult.
>
>On Wed, Nov 3, 2021, 10:36 AM gabrielle aban steinberg < 
>gabrielleabansteinb...@gmail.com> wrote:
>
>> Hello, I would like to merge 18 csv files into a master data csv 
>> file, but each file has a different number of columns (mostly found 
>> in one or more of the other cvs files) and different number of rows.
>>
>> I have tried something like the following in R Studio (cloud):
>>
>> all_data_fit_files <- rbind("dailyActivity_merged.csv", 
>> "dailyCalories_merged.csv", "dailyIntensities_merged.csv", 
>> "dailySteps_merged.csv", "heartrate_seconds_merged.csv", 
>> "hourlyCalories_merged.csv", "hourlyIntensities_merged.csv", 
>> "hourlySteps_merged.csv", "minuteCaloriesNarrow_merged.csv",
>> "minuteCaloriesWide_merged.csv", 
>> "minuteIntensitiesNarrow_merged.csv",
>> "minuteIntensitiesWide_merged.csv", "minuteMETsNarrow_merged.csv", 
>> "minuteSleep_merged.csv", "minuteStepsNarrow_merged.csv", 
>> “minuteStepsWide_merged.csv", "sleepDay_merged.csv", 
>> "minuteStepsWide_merged.csv", "sleepDay_merged.csv",
>> "weightLogInfo_merged.csv")
>>
>>
>>
>> But I am getting the following error:
>>
>> Error: unexpected input in "rlySteps_merged.csv", 
>> "minuteCaloriesNarrow_merged.csv", "minuteCaloriesWide_merged.csv", 
>> "minuteIntensitiesNarrow_merged.csv",
>> "minuteIntensitiesWide_merged.csv", "minuteMETsNarrow_merged.csv"
>>
>>
>> (Maybe the R Studio free trial/usage is underpowered for my project?)
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide 
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

2021-11-10 Thread Avi Gross via R-help

Rich,

I think many here may not quite have enough info to help you.

But the subject of multiple plots has come up. There are a slew of ways,
especially in the ggplot paradigm, to make multiple smaller plots into a
larger display showing them in some number of rows and columns, or other
ways. Some methods use facet_wrap() or facet_grid() type functionality that
let you plot multiple subdivisions of the data independently. These though
generally have to be in some way related.

Yet others let you make many independent graphs and save them and later
recombine them in packages like cowplot. 

So, although it may also be possible to do whatever it is you want within a
single plot, it may also make sense to do it as loosely described above. 

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Wednesday, November 10, 2021 4:53 PM
To: R-help 
Subject: Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

On Wed, 10 Nov 2021, Bert Gunter wrote:

> As always, online search (on "ggplot2 help") seemed to bring up useful 
> resources. I suggest you look here (suggested tutorials and resources 
> are farther down the page): https://ggplot2.tidyverse.org/

Bert,

My web search was for multiple boxplots and I didn't see what I wanted. I'll
continue reading the ggplot2 third edition and figure it out.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

2021-11-11 Thread Avi Gross via R-help

Rich,

Boxplots like many other things in ggplot can be grouped in various ways. I
often do something like this:

Say I have a data.frame with columns called PLACE and MEASURE others. The
one I call PLACE would be a factor containing the locations you are
measuring at. I mean it would be character strings of your N places but the
factors would be made in the order you want the results in. The MEASURE
variable in each row would contain one of the many measures at that
location. You probably would have other columns like DATE.

To display multiple boxplots subdivided by place is as easy as using the
phrase in an aes() clause like:

ggplot(your_data, aes(..., color=PLACE)) + geom_boxplot()

There are other variants or using group= but it works fine usually for up to
six places as colors often get recycled.

My impression is that you want to sort of force ggplot to keep forgetting
earlier data and commence anew on new data and other attributes. Yes, there
is a way to ask ggplot to not inherit things from step to step and take new
instructions BUT you may not be aware of how different the ggplot paradigm
is compared to other graphics engines.

Some of the other plot repeatedly as you go along and new layers sort of
overlay old layers. Ggplot makes no plots whatsoever. It creates a huge
complex object and populates it as it goes along. Some changes in setting
simply overwrite earlier ones that are then gone. The plotting can be done
at any later time (or automatically when it finishes) using one of the
methods known to print(). So some of what you want may not work well as it
normally does not want to store multiple data sets.

In one sense, I would say the ggplot way is to focus on getting your data
into the right shape. In your case, have you considered reading in your
multiple data items into df1 through df4 or whatever and making changes so
each has a new column called something like PLACE that is the same for all
items in df1 and another for df2 and so on?

When you have done that and made all the dfN have the same names and numbers
for columns, you can combine them into one df_combined using something like
rbind().

You can then change the column in df_combined called PLACE to be a factor of
itself in the order you want based on the compass.

What you have then can be given to ggplot as described above. Note in some
places ggplot sees a factor in the order it is sequenced, i.e. it may see it
as containing a 1  and 2 and so on. So the easiest way to make it do some
things is before you call it. Somewhat more advanced users can do odd things
in midstream like y=refactor(something) with ggplot.

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Thursday, November 11, 2021 8:50 AM
To: r-help@r-project.org
Subject: Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

On Wed, 10 Nov 2021, Avi Gross via R-help wrote:

> I think many here may not quite have enough info to help you.

Avi,

Actually, you've reflected my thinking.

> But the subject of multiple plots has come up. There are a slew of 
> ways, especially in the ggplot paradigm, to make multiple smaller 
> plots into a larger display showing them in some number of rows and 
> columns, or other ways. Some methods use facet_wrap() or facet_grid() 
> type functionality that let you plot multiple subdivisions of the data 
> independently. These though generally have to be in some way related.

My experience with facets (which I belive are like latice's conditioned
trellis plots has each plot in a separate frame in a row, column, or
matrix.) That won't communicate what I want viewers to see as well as would
having all in a single frame.

My data represent hydrologic and geochemical conditions at four locations
along the mainstem of a river. While the period of record for each
monitoring gauge is different, I want to illustrate how highly variable
conditions are at each location. The major factor of interest is discharge,
the volume of water passing a river cross section at the gauge location in
cubic feet per second. I have created boxplots for each site representing
the distribution of discharge for the entire data set and I'd like to place
each of the four horizontal boxplots stacked vertically with the
southern-most at the bottom and the northern-most at the top (the river
flows north).

> Yet others let you make many independent graphs and save them and 
> later recombine them in packages like cowplot.

I discovered cowplot yesterday but haven't yet read the PDF or vignette.

> So, although it may also be possible to do whatever it is you want 
> within a single plot, it may also make sense to do it as loosely described
above.

While I certainly may be wrong, I believe that seeing all four boxplots in
the same frame makes the differences in distribution most clear.

Thanks,

Rich

__
R-help@r-proj

Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

2021-11-11 Thread Avi Gross via R-help

Rich,

I think we have suggested something several times that you ignore as you are
focused on your way of thinking.

If you read the last part of the letter I wrote in public, I suggest
combining your multiple dataframes into one if they are compatible and
including a new column called something like PLACE. The existence of that
variable lets you tell ggplot you want MULTIPLE plots placed in a grid based
on that variable OR in other words, asking ggplot to subdivide your data
into four parts and make one plot for each and then place them into a
matrix. 

I will not spell out the many variations you can make but some variations
have you add another layer to the ggplot that may use formula notation to
specify what combination of variables to do the subdivisions by or
specifying you want then in rows (meaning vertically stacked) based on a
variable and so on.

Look up facet_grid() and facet_wrap() as various ways to do this. Note you
may want to examine some options such as setting the scales to be the same
or free.

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Thursday, November 11, 2021 12:22 PM
To: r-help@r-project.org
Subject: Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

On Thu, 11 Nov 2021, Avi Gross wrote:

> Boxplots like many other things in ggplot can be grouped in various ways.
> I often do something like this:

Avi,

I've designed and used multiple boxplots in many projects. They might show
geochemical concentrations at two locations or in two (or three) separate
time periods. All data in a single dataframe.

> To display multiple boxplots subdivided by place is as easy as using 
> the phrase in an aes() clause like:
>
>   ggplot(your_data, aes(..., color=PLACE)) + geom_boxplot()

What I need to plot are multiple 'your_data' sets. I'll be testing this:
ggp <- ggplot(NULL, aes(x, y)) +# Draw ggplot2 plot based on two data
frames
   geom_point(data = data1, col = "red") +
   geom_line(data = data2, col = "blue")
ggp # Draw plot

today, but using four boxplots.

Regards,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

2021-11-11 Thread Avi Gross via R-help

As I replied to Rich privately for another message, I suggest that you may
well be able to fit what you need in memory, if careful. But my main point
is that when you have so much data, you do not need all of it to make a
representative graph. A boxplot made using 100,000 data points may well have
too many outliers to display resulting in a bushy tail and not be all that
much more accurate than one made using 10,000 randomly chosen data points
from it.

So the idea would be to read in df1 into memory, trimming away any columns
not needed, then use something like sample() to make a smaller version and
rm() the original and repeat by reading in the second and third and so on.

Now add a PLACE column to each of df1 through dfN and then cbind() them
together and again throw away anything no longer needed.

Finally, you can use factors as already discussed including as a way to use
less data as a factor is just an integer vector attached to a sort of
dictionary containing one copy of the text aspect of your data. 

Then call ggplot and ...

The results may vary depending on the size chosen and it may be wise to use
set.seed() to some value so it does the same thing each time you run it.

Your thought of going to make separate boxplots also can use as much memory
or more if you keep everything in memory as you go along.

And, BTW, for people using truly big data, there are approaches that get
them huge amounts of memory either within their own machines, or using web
services.

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Thursday, November 11, 2021 12:56 PM
To: R-help 
Subject: Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

On Thu, 11 Nov 2021, Bert Gunter wrote:

> You can always create a graphics layout  and then plot different 
> ggplot objects in the separate regions of the layout. See ?grid.layout 
> (since ggplots are grobs)  and ?plot.ggplot  . This also **may** be 
> useful by showing examples using grid.arrange()
>
> https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html
>
> Still, I suspect that Jeff Newmiller may be right about needing to 
> structure your data more appropriately for what you wish to do.

Bert,

For this plot I could create a new data set with only site_nbr, year and cfs
columns; it would be 3,016,005 rows long.

Or, I could create separate boxplots and arrange them in a row. That might
be the easiest.

Thanks,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

2021-11-11 Thread Avi Gross via R-help

Rich,

This is not a place designed for using packages but since this discussion
persists, I will supply you with SAMPLE code thrown together in just a few
minutes to illustrate the IDEAS, but your would obviously be tweaked to your
needs. I made a very small amount of data to illustrate several approaches
and neglected worrying about the X dimension. And, you may well want to use
other variants such as facet_grid() instead if it does more like what you
want.

I then threw in cowplot() as an example of doing it another way. This is
more useful for combining heterogeneous graphs. Many of the thinks I show
just as a silly example have alternates and there are multiple packages that
do similar (and also different) things than cowplot does if you want to tue
the output with other niceties.

 If you copy the code below (installing needed packages first if needed) it
should run on your machine if you do it in chunks so you can see the graphs
one at a time.

#START of code

# Load libraries needed, using install.packages() first if needed.
library(tidyverse)

# Make sample data AS IF you have already read in from file and converted.
df1 <- data.frame(site_nbr=1, DATE=1:5,
cfs=c(11900,11800,11900,11700,11800))
df2 <- data.frame(site_nbr=2, DATE=3:7,
cfs=c(12900,12600,12900,12700,12300))

# Combine al your data into one df.
df <- rbind(df1, df2)
# rm (df1, df2)

# Make a factor in the order you want.
df$site_nbr <- factor(x=df$site_nbr, levels=c(2,1))

# ready for a ggplot segmented by site_nbr.
ggplot(data=df, aes(x=NULL, y=cfs)) +
  geom_boxplot(aes(group=site_nbr))

# Or use color instead for a more specific grouping.
ggplot(data=df, aes(x=NULL, y=cfs)) +
  geom_boxplot(aes(color=site_nbr))

# Or make multiple lattice-like plots, default may be horizontal.
ggplot(data=df, aes(x=NULL, y=cfs)) +
  geom_boxplot() +
  facet_wrap(~site_nbr)

# Or make multiple lattice-like plots, specifying you want vertical.
ggplot(data=df, aes(x=NULL, y=cfs)) +
  geom_boxplot() +
  facet_wrap(~site_nbr, nrow=2)

# The above has the same scale used, so if you want, change them.
ggplot(data=df, aes(x=NULL, y=cfs)) +
  geom_boxplot() +
  facet_wrap(~site_nbr, nrow=2, ncol=1, scales="free")

# ALTERNATE METHOD of making multiple plots and combining them later.
require("cowplot")

# read in data to simulate what is shown below:
df1 <- data.frame(site_nbr=1, DATE=1:5,
cfs=c(11900,11800,11900,11700,11800))
df2 <- data.frame(site_nbr=2, DATE=3:7,
cfs=c(12900,12600,12900,12700,12300))

# Create and save two ggplots, or more in your case:

p1 <- ggplot(data=df1, aes(x=NULL, y=cfs)) +
  geom_boxplot(color="red", fill="yellow")

p2 <- ggplot(data=df2, aes(x=NULL, y=cfs)) +
  geom_boxplot(color="green", fill="pink")

# combine the two or more verically using the plot_grid() from cowplot
plot_grid(p1, p2, ncol=1)

#END OF CODE

-Original Message-
From: R-help  On Behalf Of Rich Shepard
Sent: Thursday, November 11, 2021 1:25 PM
To: r-help@r-project.org
Subject: Re: [R] ggplot2: multiple box plots, different tibbles/dataframes

On Thu, 11 Nov 2021, Avi Gross via R-help wrote:

> Say I have a data.frame with columns called PLACE and MEASURE others. 
> The one I call PLACE would be a factor containing the locations you 
> are measuring at. I mean it would be character strings of your N 
> places but the factors would be made in the order you want the results 
> in. The MEASURE variable in each row would contain one of the many 
> measures at that location. You probably would have other columns like
DATE.

Avi/Jeff/Burt,

Here are the head and tail of one data file:
site_nbr,year,mon,day,hr,min,tz,cfs
14174000,1986,10,01,00,30,PDT,11900
14174000,1986,10,01,01,00,PDT,11900
14174000,1986,10,01,01,30,PDT,11900
14174000,1986,10,01,02,00,PDT,11800
14174000,1986,10,01,02,30,PDT,11800
14174000,1986,10,01,03,00,PDT,11800
14174000,1986,10,01,03,30,PDT,11800
14174000,1986,10,01,04,00,PDT,11800
14174000,1986,10,01,04,30,PDT,11800
...
14174000,2021,09,30,23,12,PDT,5070
14174000,2021,09,30,23,17,PDT,5070
14174000,2021,09,30,23,22,PDT,5050
14174000,2021,09,30,23,27,PDT,5050
14174000,2021,09,30,23,32,PDT,5050
14174000,2021,09,30,23,37,PDT,5050
14174000,2021,09,30,23,42,PDT,5050
14174000,2021,09,30,23,47,PDT,5050
14174000,2021,09,30,23,52,PDT,5050
14174000,2021,09,30,23,57,PDT,5050

(Water years begin October 1st and end September 30th.)

The other three locations have the same format.

The boxplots for each PLACE (site_nbr) should summarize all MEASURE (cfs)
values for all recorded data (DATE).

The R tibbles have a datetime column which could be the DATE.

If I assemble all 4 sites into a single tupple I suppose it would have three
columns PLACE (the grouping factor), DATE (on the x axis), and MEASURE (cfs
on the y axis) and each boxplot would be grouped so the command would be:

disc_plot <- ggplot(df, aes(x = group, y = cfs)) +
geom_boxp

Re: [R] the opposite of pluck() in purrr

2021-11-18 Thread Avi Gross via R-help

As noted, this is not the place to ask about dplyr but the answer you may
want is perhaps straight R.

If you have a list called weekdays and you know you o not want to take the
fifth, then indexing with -5 removes it:

> weekdays <- list("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
> weekdays[-5]
[[1]]
[1] "Sun"

[[2]]
[1] "Mon"

[[3]]
[1] "Tue"

[[4]]
[1] "Wed"

[[5]]
[1] "Fri"

[[6]]
[1] "Sat"

In general, any function you use to get one or more indices can be used sort
of like this:

> odder <- seq(from=1, to=length(weekdays), by=2)
> weekdays[-odder]
[[1]]
[1] "Mon"

[[2]]
[1] "Wed"

[[3]]
[1] "Fri"

So do you need to really search for dplyr functionality, such as how to take
all but some in a list  in a pipeline. use the odd function `[` to access
the subset.

> weekdays %>% `[`(-5)
[[1]]
[1] "Sun"

[[2]]
[1] "Mon"

[[3]]
[1] "Tue"

[[4]]
[1] "Wed"

[[5]]
[1] "Fri"

[[6]]
[1] "Sat"

> weekdays %>% `[`(odder)
[[1]]
[1] "Sun"

[[2]]
[1] "Tue"

[[3]]
[1] "Thu"

[[4]]
[1] "Sat"


-Original Message-
From: R-help  On Behalf Of Christopher W. Ryan
via R-help
Sent: Thursday, November 18, 2021 4:40 PM
To: R-help@r-project.org
Subject: [R] the opposite of pluck() in purrr

I've just learned about pluck() and chuck() in the purrr package. Very cool!
As I understand it, they both will return one element of a list, either by
name or by [[]] index, or even "first"  or "last"

I was hoping to find a way to return all *but* one specified element of a
list. Speaking loosely, pluck(-1) or pluck(!1) or !pluck(1), but none of
those of course work. Thinking of English language, I had hopes for
chuck(1) as in "chuck element 1 away, leaving the rest"  but that's now how
it works.

Any tidyverse-centric ways to return all except one specified element of a
list?

Thanks.

--Chris Ryan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Date read correctly from CSV, then reformatted incorrectly by R

2021-11-20 Thread Avi Gross via R-help

This seems to be a topic that comes up periodically. The various ways in R
and other packages for reading in data often come with methods that simply
guess wrong or encounter one or more data items in a column that really do
not fit so fields may just by default become a more common denominator of
character or perhaps floating point.

There are ways that some such programs can be given a hint of what you
expect or even be supplied with a way to coerce them into what you want
while being read in. But realistically, often a more practical  method might
be to take the data.frame variety you read in and before using it for other
purposes, check it for validity and make any needed changes. Simplistic ones
might be to see how many columns were read in to see if it matches
expectations or generate an error. Or you may trim columns (or rows) that
are not wanted.

In that vein, are there existing functions available that will accept what
types you want one or more columns to be in and that validate if the current
type is something else and then convert if needed? I mean we have functions
like as.integer(df$x ) or more flexibly as(df$x, "integer") and you may
simply build on a set of those and create others to suit any special needs.

Of course a good method carefully checks the results before over-writing as
sometimes the result may not be the same length (as shown below) or may
violate some other ideas or rules:

> as(c(NULL, NA, 3, 3.1, "3.1", list(1,2,"a")), "character")
[1] "NA"  "3"   "3.1" "3.1" "1"   "2"   "a"  

So if you have dates in some format, or sometimes an unknown format, there
are ways, including some others have shown, to make them into some other
date format or even make multiple columns that together embody the format.

What people sometimes do is assume software is perfect and should do
anything they want. It is the other way around and the programmer or data
creator has some responsibilities to use the right software on the right
data and that may also mean sanity checks along the way to  see if the data
is what you expect or alter it to be what you need.

-Original Message-
From: R-help  On Behalf Of Philip Monk
Sent: Saturday, November 20, 2021 3:28 PM
To: Jeff Newmiller 
Cc: R-help Mailing List 
Subject: Re: [R] Date read correctly from CSV, then reformatted incorrectly
by R

Thanks, Jeff.

I follow what you're doing below, but know I need to read up on Date /
POSIXct.  Helpful direction!  :)

On Sat, 20 Nov 2021 at 18:41, Jeff Newmiller 
wrote:
>
> Beat me to it! But it is also worth noting that once converted to Date or
POSIXct, timestamps should be treated as data without regard to how that
data is displayed. When you choose to output that data you will have options
as to the display format associated with the function you are using for
output.
>
> My take:
>
> dta <- read.table( text=
> "Buffer28/10/201619/11/2016  31/12/201616/01/2017
05/03/2017
> 1002.437110889-8.696748953.2392998162.443183304
2.346743827
> 2002.524329899-7.6888620683.3868117342.680347706
2.253885237
> 3002.100784256-8.0598558353.1437865072.615152896
2.015645973
> 4001.985608385-10.67072062.8945727912.591925038
2.057913137
> 5001.824982163-9.1225197362.5603507272.372226799
1.995863839
> ", header=TRUE, check.names=FALSE, as.is=TRUE)
>
> dta
>
> library(dplyr)
> library(tidyr)
>
> dt_fmt <- "%d/%m/%Y"
>
> dta_long <- (   dta
> %>% pivot_longer( cols = -Buffer
> , names_to = "dt_chr"
> , values_to = "LST"
> )
> %>% mutate( dt_date = as.Date( dt_chr, format = dt_fmt )
>   , dt_POSIXct = as.POSIXct( dt_chr, format = dt_fmt,
tz = "Etc/GMT+8" )
>   )
> )
>
> dta_long
>
> On November 20, 2021 10:01:56 AM PST, Andrew Simmons 
wrote:
> >The as.Date function for a character class argument will try reading 
> >in two formats (%Y-%m-%d and %Y/%m/%d).
> >
> >
> >This does not look like the format you have provided, which is why it 
> >doesn't work. Try something like:
> >
> >
> >x <- c("28/10/2016", "19/11/2016", "31/12/2016", "16/01/2016", 
> >"05/03/2017") as.Date(x, format = "%d/%m/%Y")
> >
> >
> >which produces this output:
> >
> >
> >> x <- c("28/10/2016", "19/11/2016", "31/12/2016", "16/01/2016",
> >"05/03/2017")
> >> as.Date(x, format = "%d/%m/%Y")
> >[1] "2016-10-28" "2016-11-19" "2016-12-31" "2016-01-16" "2017-03-05"
> >>
> >
> >
> >much better than before! I hope this helps
> >
> >On Sat, Nov 20, 2021 at 12:49 PM Philip Monk  wrote:
> >
> >> Thanks Eric & Jeff.
> >>
> >> I'll certainly read up on lubridate, and the posting guide (again) 
> >> (this should be in plain text).
> >>
> >> CSV extract below...
> >>
> >> Philip
> >>
> >> Buffer28/10/201619/11/201631/12/201616/01/2017
> >> 05/03/2017
> >> 1002.437110889-8.696748953.239299816

[R] Large data and space use

2021-11-27 Thread Avi Gross via R-help

Several recent questions and answers have mad e me look at some code and I
realized that some functions may not be great to use when you are dealing
with very large amounts of data that may already be getting close to limits
of your memory. Does the function you call to do one thing to your object
perhaps overdo it and make multiple copies and not delete them as soon as
they are not needed?

 

An example was a recent post suggesting a nice set of tools you can use to
convert your data.frame so the columns are integers or dates no matter how
they were read in from a CSV file or created.

 

What I noticed is that often copies of a sort were made by trying to change
the original say to one date format or another and then deciding which, if
any to keep. Sometimes multiple transformations are tried and this may be
done repeatedly with intermediates left lying around. Yes, the memory will
all be implicitly returned when the function completes. But often these
functions invoke yet other functions which work on their copies. You an end
up with your original data temporarily using multiple times as much actual
memory.

 

R does have features so some things are "shared" unless one copy or another
changes. But in the cases I am looking at, changes are the whole idea.

 

What I wonder is whether such functions should clearly call an rm() or the
equivalent as soon as possible when something is no longer needed.

 

The various kinds of pipelines are another case in point as they involve all
kinds of hidden temporary variables that eventually need to be cleaned up.
When are they removed? I have seen pipelines with 10 or more steps as
perhaps data is read in, has rows removed or columns removed or re-ordered
and grouping applied and merged with others and reports generated. The
intermediates are often of similar sizes with the data and if large, can add
up. If writing the code linearly using temp1 and temp2 type of variables to
hold the output of one stage and the input of the text stage, I would be
tempted to add a rm(temp1) as soon as it was finished being used, or just
reuse the same name of temp1 so the previous contents are no longer being
pointed to and can be taken by the garbage collector at some time.

 

So I wonder if some functions should have a note in their manual pages
specifying what may happen to the volume of data as they run. An example
would be if I had a function that took a matrix and simply squared it using
matrix multiplication. There are various ways to do this and one of them
simply makes a copy and invokes the built-in way in R that multiplies two
matrices. It then returns the result. So you end up storing basically three
times the size  of the matrix right before you return it. Other methods
might do the actual multiplication in loops operating on subsections of the
matrix and if done carefully, never keep more than say 2.1 times as much
data around. 

 

Or is this not important often enough? All I know, is data may be getting
larger much faster than memory in our machines gets larger.

 

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Large data and space use

2021-11-28 Thread Avi Gross via R-help

Richard,

 

I currently have no problem with running out of memory. I was referring to 
people who have said they use LARGE structures and I am pointing out how they 
can temporarily get way larger even when not expected. Functions that 
temporarily will balloon up might come with notifications. And, yes, some 
transformations may well be doable outside R or in chunks. What gets me is how 
often users have no idea what happens when they invoke a package.

 

I am not against transformations and needed duplications. I am more interested 
in whether some existing code might be evaluated and updated in somewhat 
harmless ways as in removing stuff as soon as it is definitely not needed. Of 
course there are tradeoffs. I have seen times only one column of a data.frame 
was needed and the entire data.frame was copied and then returned. That is OK 
but clearly it might be more economical to ask just for a single column to be 
changed in place. People often use a sledgehammer when a thumbtack will do.

 

But as noted, R has features that often delay things so a full copy is not made 
and thus less memory is ever used. But people seem to think that since all 
“local” memory is generally returned when the function ends, so why bother 
micromanaging it as it runs.

 

Arguably, some R packages may make changes in what is kept and for how long. 
Standard R lets you specify what rows and what columns of a data.frame to keep 
in a single argument as in df[rows, columns] while something like dplyr offers 
multiple smaller steps in a grammar of sorts so you do something like a 
select() followed (often in a pipeline) by a filter() or done in the opposite 
order. Each additional change is sometimes done by programmers in minimal steps 
so that a more efficient implementation is harder to do as each one does just 
one thing well. That may also be a plus, especially if pipelined objects are 
released in progress and not all at the end of the pipeline.

 

From: Richard O'Keefe  
Sent: Sunday, November 28, 2021 3:54 AM
To: Avi Gross 
Cc: R-help Mailing List 
Subject: Re: [R] Large data and space use

 

If you have enough data that running out of memory is a serious problem,

then a language like R or Python or Octave or Matlab that offers you NO

control over storage may not be the best choice.  You might need to

consider Julia or even Rust.

 

However, if you have enough data that running out of memory is a serious

problem, your problems may be worse than you think.  In 2021, Linux is

*still* having OOM Killer problems.

https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/

Your process hogging memory may cause some other process to be killed.

Even if that doesn't happen, your process may be simply thrown off the

machine without being warned.

 

It may be one of the biggest problems around in statistical computing:

how to make it straightforward to carve up a problem so that it can be

run on many machines.  R has the 'Rmpi' and 'snow' packages, amongst others.

https://CRAN.R-project.org/view=HighPerformanceComputing

 

Another approach is to select and transform data outside R.  If you have

data in some kind of data base then doing select and transform in the

data base may be a good approach.

 

 

On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help mailto:r-help@r-project.org> > wrote:

Several recent questions and answers have mad e me look at some code and I
realized that some functions may not be great to use when you are dealing
with very large amounts of data that may already be getting close to limits
of your memory. Does the function you call to do one thing to your object
perhaps overdo it and make multiple copies and not delete them as soon as
they are not needed?



An example was a recent post suggesting a nice set of tools you can use to
convert your data.frame so the columns are integers or dates no matter how
they were read in from a CSV file or created.



What I noticed is that often copies of a sort were made by trying to change
the original say to one date format or another and then deciding which, if
any to keep. Sometimes multiple transformations are tried and this may be
done repeatedly with intermediates left lying around. Yes, the memory will
all be implicitly returned when the function completes. But often these
functions invoke yet other functions which work on their copies. You an end
up with your original data temporarily using multiple times as much actual
memory.



R does have features so some things are "shared" unless one copy or another
changes. But in the cases I am looking at, changes are the whole idea.



What I wonder is whether such functions should clearly call an rm() or the
equivalent as soon as possible when something is no longer needed.



The various kinds of pipelines are another case in point as they involve all
kinds of hidden temporary variables that eventually need to be cleaned up.
When are they

Re: [R] Question about Rfast colMins and colMaxs

2021-11-30 Thread Avi Gross via R-help

Stephen,

Although what is in the STANDARD R distribution can vary several ways, in
general, if you need to add a line like:

library(something)
or
require(something)

and your code does not work unless you have done that, then you can imagine
it is not sort of built in to R as it starts.

Having said that, tons of exceptions may exist that cause R to load in
things on your machine for everyone or just for you without you having to
notice.

I think this forum lately has been deluged with questions about all kinds of
add-on packages and in particular, lots of the ones in the tidyverse.
Clearly the purpose here is not that broad.

But since I use some packages like the tidyverse extensively, and I am far
from alone, I wonder if someday the powers that be realize it is a losing
battle to exclude at least some of it. It would be so nice not having to
include a long list of packages for some programs or somehow arrange that
people using something you shared had installed and loaded them. But there
are too many packages out there of varying quality and usefulness and
popularity with more every day being added. Worse, many are somewhat
incompatible such as having functions with the same names that hide earlier
ones loaded.

Base R doe come with functions like colSums and colMeans and similar row
functions. But as mentioned, a data.frame is a list of vectors and R
supports certain functional programming constructs over lists using things
like:

lapply(df, min)
sapply(df, min)

And actually quite a few ways depending on what info you want back and
whether you insist it be returned as a list or vector or other things . You
can even supply additional arguments that might be needed such as if you
want to ignore any NA values,

lapply(df, min, na.rm=TRUE

The package you looked at it is trying to be fast and uses what looks like
compiled external code but so does lapply.

If this is too bothersome for you, consider making a one-liner function like
this:

mycolMins <- function(df, ...) lapply(df, min, ...)

Once defined, you can use that just fine and not think about it again and I
note this answer (like others) is offering you something in base R that
works fine on data.frames and the like.

You can extend to many similar ideas like this one that calulates the min
unless you over-ride it with max or mean or sd or a bizarre function like
`[` so a call to:

mycolCalc(df, `[`, 3)

Will return exactly the third items in each row!

I find it to be very common for someone these days to do a quick search for
a way to do something in a language like R and not really look to see if it
is a standard way or something special/ Matrices in R are not quite the same
as some other objects like a data.frame or tibble and a package written to
be used on one may (or may not) happen to work with another. Some packages
are carefully written to try to detect what kind of object it gets and when
possible convert it to another. The "apply" function is meant for matrices
but if it sees something else it looks ta the dimensionality and tries to
coerce it with as.matrix or as.array first. As others have noted, this mean
a data.frame containing non-numeric parts may fail or should have any other
columns hidden/removed as in this df that has some non-numeric fields:

> df
i   s   f b i2
1 1   hello 1.2  TRUE  5
2 2   there 2.3 FALSE  4
3 3 goodbye 3.4  TRUE  3

So a bit more complex one-liner removes any non-numeric columns like this:

> mycolMins(df[, sapply(df, is.numeric)])
$i
[1] 1

$f
[1] 1.2

$i2
[1] 3

Clearly converting that to a matrix while whole would result in everything
being converted to character and a minimum may be elusive.

-Original Message-
From: R-help  On Behalf Of Stephen H. Dawson,
DSL via R-help
Sent: Tuesday, November 30, 2021 5:37 PM
To: Bert Gunter 
Cc: r-help@r-project.org
Subject: Re: [R] Question about Rfast colMins and colMaxs

Oh, you are segmenting standard R from the rest of R.

Well, that part did not come across to me in your original reply. I am not
clear on a standard versus non-standard list. I will look into this aspect
and see what I can learn going forward.

Thanks,
*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com 

On 11/30/21 5:26 PM, Bert Gunter wrote:
> ... but Rfast is *not* a "standard" package, as the rest of the PG 
> excerpt says. So contact the maintainer and ask him/her what they 
> think the best practice should be for their package. As has been 
> pointed out already, it appears to differ from the usual "read it in 
> as a data frame" procedure.
>
> Bert
>
> On Tue, Nov 30, 2021 at 2:11 PM Stephen H. Dawson, DSL 
>  wrote:
>> Right, R Studio is not R.
>>
>> However, the Rfast package is part of R.
>>
>> https://cran.r-project.org/web/packages/Rfast/index.html
>>
>> So, rephrasing my question...
>> What is the best practice to bring a csv file into R so it can be 
>> accessed by colM

Re: [R] Find tibble row with maximum recorded value

2021-12-04 Thread Avi Gross via R-help

The right answer obviously depends on the REQUIREMENTS and they may not have 
been fully stated.

This is a bit like finding the mode of a set of numbers. The most frequent 
value may not be as representative of the data as the mean or even the median 
for some purposes, as well as other measures of central tendency.

What does Rich want? Choices can include getting the entire row where the value 
happens first, or maybe even the last or ALL of them as a group.

Be that as it may, if you return all of them, as a mini-data.frame, it is easy 
enough to then choose to keep them all, or just the first by subsetting  it to 
get row 1, or even to get the last row by applying nrow() to the result.

I can think of many ways you want an answer including beyond this, some kind of 
blending of the other columns you get back to make some kind of composite row 
from all the rows that matched.

Without clear requirements, there often is no right answer.

-Original Message-
From: R-help  On Behalf Of Rui Barradas
Sent: Saturday, December 4, 2021 2:34 AM
To: Bert Gunter 
Cc: R-help 
Subject: Re: [R] Find tibble row with maximum recorded value

Hello,

You're right, I carelessly coded this.

which.max returns the index to the first maximum of a vector, while the 
comparison of a vector with its max() returns an index to all vector elements.

Às 23:27 de 03/12/21, Bert Gunter escreveu:
> Perhaps you meant to point this out, but the cfs[which.max(cfs)] and 
> cfs == ... are not the same:
> 
>> x <- rep(1:2,3)
>> x
> [1] 1 2 1 2 1 2
>> x[which.max(x)]
> [1] 2
>> x[x==max(x)]
> [1] 2 2 2
> 
> So maybe your point is: which does the OP want (in case there are 
> repeated maxes)? I suspect the == forms, but ...?
> 
> Bert Gunter
> 
> On Fri, Dec 3, 2021 at 2:56 PM Rui Barradas  wrote:
>>
>> Hello,
>>
>> Inline.
>>
>> Às 22:08 de 03/12/21, Rich Shepard escreveu:
>>> On Fri, 3 Dec 2021, Rich Shepard wrote:
>>>
 I find solutions when the data_frame is grouped, but none when it's not.
>>>
>>> Thanks, Bert. ?which.max confirmed that's all I need to find the 
>>> maximum value.
>>>
>>> Now I need to read more than ?filter to learn why I'm not getting 
>>> the relevant row with:
 which.max(pdx_disc$cfs)
>>> [1] 8054
>>
>> This is the *index* for which cfs is the first maximum, not the 
>> maximum value itself.
>>
>>>
 filter(pdx_disc, cfs == 8054)
>>
>> Therefore, you probably want any of

Should be "one of", not "any of"

Rui Barradas

>>
>>
>> filter(pdx_disc, cfs == cfs[8054])
>>
>> filter(pdx_disc, cfs == cfs[which.max(cfs)])
>>
>> filter(pdx_disc, cfs == max(cfs))# I find this one better, simpler
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>
>>> # A tibble: 0 × 9
>>> # … with 9 variables: site_nbr , year , mon , day ,
>>> #   hr , min , tz , cfs , sampdt 
>>>
>>> Regards,
>>>
>>> Rich
>>>
>>> __
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] checkpointing

2021-12-13 Thread Avi Gross via R-help

Jeff,

I am wondering what it even means to do what you say. In a compiled
language, I can imagine wrapping up an executable along with some kind of
run-time image (which may actually contain the parts of the executable that
includes what has not run yet) and revive it elsewhere.

But even there, how would it work if say the executable kept opening more
files for reading or appending and you move it to place those files did not
exists or had different contents or other such scenarios? What happens to
open pipes with another process attached, for an OS that supports pipes?
When you restart, the other processes re not there even if you supply an
image of a pipe and I am sure others can imagine much more.

R is interpreted. You could say the main interpreter may be like an
executable and there may be multiple threads active at the time you stop the
process and bundle it to be restarted later. But R has many fairly dynamic
features including some the interpreter has not even looked at yet. Besides
files it may want to open, there are any number of statements like
library(filename) it may encounter and of course other files it may
source(code) . In general, the info on what may be needed later is not in
any serious way bundled with the file and many things may be hard to predict
even with a look ahead as often arguments to functions are not evaluated
till some indefinite later time or even never. 

I am trying to imagine how you stop and restore say an R program running
connected to something like RSTUDIO which is also connected to a Python
program with data and instructions flowing back and forth.

It does not strike me as easy to make a reliable method to do this, albeit
as noted, there are operating systems that do allow you to suspend arbitrary
processes and restart them locally perhaps only before the system reboots.

But I can think of exceptions, including some I see others have thought of.
An example might be an R program that reads in lots of data, then makes
objects like data.frames and then pauses in some kind of nested loop that
will process the data while having the current indices saved in variables.
It could literally ask to be frozen so it starts up from there when asked
to. R can be set to intercept some signals and perhaps voluntarily save all
the variables as they are (including the data it may be operating on and
what it is making from it (as in what search items it has already found) as
well as the needed index info) and exit gracefully. If the application is
restarted, it might note the file with saved info and read in all the data
and continue from there. The above is not a serious proposal and has lots of
things that can go wrong, but I can imagine it as an app that sets itself up
doing heavy lifting once and later every time you want to do a search, it
loads the data and gets from you something to search for and does it quickly
and resuspends till needed. But this example is not exactly what you asked
for.

I have actually done weird things like the above including things that
simply start up again after a reboot as if nothing happened. 

What is a more interesting question for me is what R features might make
sense that help construct a program that is in some sense re-startable if
used right. I can imagine a package that lets you set things like a "level"
for debugging so that your code when started at some point says:

# initialize.
# load any left-in-file data if it exists.

if (level < 2) {
  do stuff
  level <- 2
  }

if (level < 3) {
  do more stuff
  level <- 3
  }

...

Something like the above might wrap parts in something like a "try()" that
intercepts some interrupt condition and saves the needed status info.

What I wonder is if long-running processes that can be up for months say in
a web-server, may already have ways to save all kinds of status info so when
they start up again after a normal reboot, are able to continue almost as if
nothing had happened.

-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Monday, December 13, 2021 11:54 AM
To: Andy Jacobson ; Andy Jacobson via R-help
; r-help@r-project.org
Subject: Re: [R] checkpointing

This sounds like an OS feature, not an R feature... certainly not a portable
R feature.

On December 13, 2021 8:37:30 AM PST, Andy Jacobson via R-help
 wrote:
>Has anyone ever considered what it would take to implement checkpointing in
R, so that long-running processes could be interrupted and resumed later,
from a different process or even a different machine?
>
>Thanks,
>
>Andy
>

--
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To

Re: [R] Changing time intervals in data set

2021-12-15 Thread Avi Gross via R-help

I think Rich has shared aspects of the data before and may have forgotten we
want something here and now.

Besides a small sample of what the relevant columns look like and a
suggestion of what he wants some new column to look like, we probably need
more to understand what he wants.

The issue could be  bit like people who want to group their data by quarter,
for example, or by some other aspect such as when someone started and ended
one topic and switched to another. No way we can guess what he actually
wants.

What Rich writes may be perfectly clear to him but not others. It does sound
like there are periods people sit there and record measurements in seemingly
multiple (?contiguous) records with each recording the time at intervals
such as every five minutes, and/or 10 or 30. So a wild guess might be to
cluster them together by finding a GAP where the next record is close enough
in time to the previous ones. In essence, the condition seems to be that:

 time-of-current-record - time-of-previous-record > threshold

Where threshold may simply be thirty minutes, assuming that all the records
are also in the same series as in locations of measurement and do not
intertwine.

I assume, as usual, there are umpteen ways to deal with such sliding window
problems but am loathe to suggest any ideas till Rich has more clearly
defined the issue, perhaps by including a small amount of data in a format
trivial to copy/paste into our R implementation to play with and verify that
the solution seems to work.

But very loosely speaking, a simple sliding window of one might work. In
base R, you can use some form of loop, obviously, starting with column 2,
that perhaps uses a comparison from row N to row N-1 and sets some new
column value to something like 1 until it encounters a big enough gap when
it starts setting it to2 and so on. A later pass on the new data could use
grouping by that column, IF all of what I assume makes sense. 

And, of course, the tidyverse has perhaps easier to use functionality such
as their non-base functions of lag() and lead() used within something like
mutate()

https://dplyr.tidyverse.org/reference/lead-lag.html

But again, you need clearer requirements. You asked how to find when DATES
change. That is not the same as my guess as the date changes at midnight
local time so measures seconds apart would change. If you want to know when
clusters of non-overlapping measures change, that is another issue.

And what exactly do you want to do after determining when things change?
Depending on what you want, you may need a different way to solve the
initial problem. I mentioned the idea of grouping by another variable you
create as one such possibility. But many other solutions would not make a
grouping variable on every row, but insert some kind of cut mark in just the
first row or add a special row between groups and anything lese your
imagination supplies.

Clearly, you do not want us to solve the entire problem you are working on,
but more context may get you answers to the specific thing you are working
on. And, note that adding a new time column may not be required as they can
be created on the fly too in some places, given the other columns. But it
does help to have it in place, at least for a while, if you want to provide
answers such as how many measures were made in what total amount of time
(first to last.)

-Original Message-
From: R-help  On Behalf Of jim holtman
Sent: Wednesday, December 15, 2021 1:05 PM
To: Rich Shepard 
Cc: R mailing list 
Subject: Re: [R] Changing time intervals in data set

At least show a sample of the data and then what you would like as output.

Thanks

Jim Holtman
*Data Munger Guru*

*What is the problem that you are trying to solve?Tell me what you want to
do, not how you want to do it.*

On Wed, Dec 15, 2021 at 6:40 AM Rich Shepard 
wrote:

> A 33-year set of river discharge data at one gauge location has 
> recording intervals of 5, 10, and 30 minutes over the period of record.
>
> The data.frame/tibble has columns for year, month, day, hour, minute, 
> and datetime.
>
> Would difftime() allow me to find the dates when the changes occurred?
>
> TIA,
>
> Rich
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/ma

Re: [R] Sum every n (4) observations by group

2021-12-19 Thread Avi Gross via R-help

Milu,

Your data seems to be very consistent in that each value of ID has eight
rows. You seem to want to just sum every four so that fits:

   ID Date   Value
1   A 4140 0.000207232
2   A 4141 0.000240141
3   A 4142 0.000271414
4   A 4143 0.000258384
5   A 4144 0.000243640
6   A 4145 0.000271480
7   A 4146 0.000280585
8   A 4147 0.000289691
9   B 4140 0.000298797
10  B 4141 0.000307903
11  B 4142 0.000317008
12  B 4143 0.000326114
13  B 4144 0.000335220
14  B 4145 0.000344326
15  B 4146 0.000353431
16  B 4147 0.000362537
17  C 4140 0.000371643
18  C 4141 0.000380749
19  C 4142 0.000389854
20  C 4143 0.000398960
21  C 4144 0.000408066
22  C 4145 0.000417172
23  C 4146 0.000426277
24  C 4147 0.000435383

There are many ways to do what you want, some more general than others, but
one trivial way is to add a column that contains 24 numbers ranging from 1
to 6 like this assuming mydf holds the above:

Here is an example of such a vector:

rep(1:(nrow(mydf)/4), each=4)
 [1] 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6

So you can add a column like:

> mydf$fours <- rep(1:(nrow(mydf)/4), each=4)
> mydf
   ID Date   Value fours
1   A 4140 0.000207232 1
2   A 4141 0.000240141 1
3   A 4142 0.000271414 1
4   A 4143 0.000258384 1
5   A 4144 0.000243640 2
6   A 4145 0.000271480 2
7   A 4146 0.000280585 2
8   A 4147 0.000289691 2
9   B 4140 0.000298797 3
10  B 4141 0.000307903 3
11  B 4142 0.000317008 3
12  B 4143 0.000326114 3
13  B 4144 0.000335220 4
14  B 4145 0.000344326 4
15  B 4146 0.000353431 4
16  B 4147 0.000362537 4
17  C 4140 0.000371643 5
18  C 4141 0.000380749 5
19  C 4142 0.000389854 5
20  C 4143 0.000398960 5
21  C 4144 0.000408066 6
22  C 4145 0.000417172 6
23  C 4146 0.000426277 6
24  C 4147 0.000435383 6

You now use grouping any way you want to apply a function and in this case
you want a sum. I like to use the tidyverse functions so will show that as
in:

mydf %>%
  group_by(ID, fours) %>%
  summarize(sums=sum(Value), n=n())

I threw in the extra column in case your data sometimes does not have 4 at
the end of a group or beginning of next. Here is the output:

# A tibble: 6 x 4
# Groups:   ID [3]
IDfours sums n
  
  1 A 1 0.000977 4
2 A 2 0.00109  4
3 B 3 0.00125  4
4 B 4 0.00140  4
5 C 5 0.00154  4
6 C 6 0.00169  4

Of course there are all kinds of ways to do this in standard R, including
trivial ones like looping over indices starting at 1 and taking four at a
time and getting the Value data for mydf$Value[N] + mydf$Value[N+1] ...



-Original Message-
From: R-help  On Behalf Of Miluji Sb
Sent: Sunday, December 19, 2021 1:32 PM
To: r-help mailing list 
Subject: [R] Sum every n (4) observations by group

Dear all,

I have a dataset (below) by ID and time sequence. I would like to sum every
four observations by ID.

I am confused how to combine the two conditions. Any help will be highly
appreciated. Thank you!

Best.

Milu

## Dataset
structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "C", "C"), Date =
c(4140L, 4141L, 4142L, 4143L, 4144L, 4145L, 4146L, 4147L, 4140L, 4141L,
4142L, 4143L, 4144L, 4145L, 4146L, 4147L, 4140L, 4141L, 4142L, 4143L, 4144L,
4145L, 4146L, 4147L ), Value = c(0.000207232, 0.000240141, 0.000271414,
0.000258384, 0.00024364, 0.00027148, 0.000280585, 0.000289691, 0.000298797,
0.000307903, 0.000317008, 0.000326114, 0.00033522, 0.000344326, 0.000353431,
0.000362537, 0.000371643, 0.000380749, 0.000389854, 0.00039896, 0.000408066,
0.000417172, 0.000426277, 0.000435383 )), class = "data.frame", row.names =
c(NA, -24L))

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Adding SORT to UNIQUE

2021-12-20 Thread Avi Gross via R-help

Stephen,

You can sort using sort() either before or after doing a unique. Unique
removes all duplicates in any order so sorting before may be wasteful. in
your data shown below, do this:

sort(unique(Data[1]))
sort(unique(Data[2]))
sort(unique(Data[3]))
sort(unique(Data[4]))

Even simpler is to define a function like this:

unisort <- function(vec) sort(unique(vec))

and use it like this:

unisort(Data[1])


And since you seem to want all of the first four columns of Data you may
want to do them all at once using something like:

lapply(Data[1:4], unisort)

As you note, the sort ordering depends on the data and perhaps on options
you specify.

-Original Message-
From: R-help  On Behalf Of Stephen H. Dawson,
DSL via R-help
Sent: Monday, December 20, 2021 11:59 AM
To: Stephen H. Dawson, DSL via R-help 
Subject: [R] Adding SORT to UNIQUE

Hi,


Running a simple syntax set to review entries in dataframe columns. Here is
the working code.

Data <- read.csv("./input/Source.csv", header=T)
describe(Data)
summary(Data)
unique(Data[1])
unique(Data[2])
unique(Data[3])
unique(Data[4])

I would like to add sort the unique entries. The data in the various columns
are not defined as numbers, but also text. I realize 1 and 10 will not sort
properly, as the column is not defined as a number, but want to see what I
have in the columns viewed as sorted.

QUESTION
What is the best process to sort unique output, please?


Thanks.
--
*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com 

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Adding SORT to UNIQUE

2021-12-20 Thread Avi Gross via R-help

Stephen,

 

Sorry about that. I tried modifying what you had and realize the use of [] 
returned a data.frame and you need [[]] to return a vector.

 

Try this:

 

sort(unique(Data[[1]]))

 

 

From: Stephen H. Dawson, DSL  
Sent: Monday, December 20, 2021 12:32 PM
To: Avi Gross ; r-help@r-project.org
Subject: Re: [R] Adding SORT to UNIQUE

 

Thanks for the reply.

sort(unique(Data[1]))
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = 
decreasing)) : 
  undefined columns selected

The recommended syntax did not work, as listed above.



Stephen Dawson, DSL
Executive Strategy Consultant
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com



On 12/20/21 12:15 PM, Avi Gross via R-help wrote:

Stephen,
 
You can sort using sort() either before or after doing a unique. Unique
removes all duplicates in any order so sorting before may be wasteful. in
your data shown below, do this:
 
sort(unique(Data[1]))
sort(unique(Data[2]))
sort(unique(Data[3]))
sort(unique(Data[4]))
 
Even simpler is to define a function like this:
 
unisort <- function(vec) sort(unique(vec))
 
and use it like this:
 
unisort(Data[1])
 
 
And since you seem to want all of the first four columns of Data you may
want to do them all at once using something like:
 
lapply(Data[1:4], unisort)
 
As you note, the sort ordering depends on the data and perhaps on options
you specify.
 
-Original Message-
From: R-help  <mailto:r-help-boun...@r-project.org> 
 On Behalf Of Stephen H. Dawson,
DSL via R-help
Sent: Monday, December 20, 2021 11:59 AM
To: Stephen H. Dawson, DSL via R-help  <mailto:r-help@r-project.org> 

Subject: [R] Adding SORT to UNIQUE
 
Hi,
 
 
Running a simple syntax set to review entries in dataframe columns. Here is
the working code.
 
Data <- read.csv("./input/Source.csv", header=T)
describe(Data)
summary(Data)
unique(Data[1])
unique(Data[2])
unique(Data[3])
unique(Data[4])
 
I would like to add sort the unique entries. The data in the various columns
are not defined as numbers, but also text. I realize 1 and 10 will not sort
properly, as the column is not defined as a number, but want to see what I
have in the columns viewed as sorted.
 
QUESTION
What is the best process to sort unique output, please?
 
 
Thanks.
--
*Stephen Dawson, DSL*
/Executive Strategy Consultant/
Business & Technology
+1 (865) 804-3454
http://www.shdawson.com  <http://www.shdawson.com> <http://www.shdawson.com>
 
__
R-help@r-project.org <mailto:R-help@r-project.org>  mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org <mailto:R-help@r-project.org>  mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Adding SORT to UNIQUE

2021-12-20 Thread Avi Gross via R-help

Duncan and Martin and ...,

There are multiple groups where people discuss R and this seems to be a help 
group. The topic keeps coming up as to whether you should teach anything other 
than base R and I claim it depends. 

Many packages are indeed written mostly using base R or using other packages 
that ultimately use only base R constructions. So, obviously, those packages 
can be gotten around by re-writing your own complicated code to do the things 
you want. Some have parts written using external libraries such as parts 
written in C and that makes it harder.

I think the debate is not one-sided and the whole purpose of creating and 
sharing libraries is to make life easier for those who see tools that already 
exist and perhaps may even be tested by others and deemed worthy. For people 
with serious programming needs, I suggest often 80% of your code can easily be 
made easier or more powerful AND easier to read than writing many times more 
fairly complex code in the base language.

BUT, if someone is teaching a language like R one step at a time and a student 
asks a question, it is very wrong to tell them of a package to do all the work 
when clearly they need practice using a small subset of base functionality. If 
they are asked to implement a sort algorithm using say a recursive merge sort, 
then the assignment involves using functions that call themselves and so on.

So I was thinking of how I might have dealt with finding unique members of a 
vector (called "temp" below) that contained character text as in "2.2" and 
"99.6" and "10.1" and "1.1" if they are all numeric POTENTIALLY and you want 
them sorted as if they were numeric but then returned as character data. My 
first attempt at it was this:

as.character(sort(as.numeric(unique(temp

Then I considered that if there was no guarantee that everything in temp could 
be coerced by as.numeric() then do I handle that? I see the following extension 
of temp causes an error and temp2 is not created:

  > as.numeric(c(temp. "blab")) -> temp2
  Error: unexpected string constant in "as.numeric(c(temp. "blab""

So clearly to handle that I might change the one-liner into multiple lines and 
check for the above error in one of many ways such as embedding it inside a 
try() but then what is the requirement if it fails? One possibility might be to 
remove such entries and another is to add them back after sorting the rest but 
add them back where? If there are many, they all need to be sorted. Or should 
they become an NA, which is tolerated?

Get the idea?

It can take non-trivial work. Now if the package was made carefully, perhaps 
with options letting you specify among such choices as above, and it is 
trusted, why not use that if your needs may be complex?

One final note. In some places the functionality of unique() is done by a 
method that first sorts the data and then simply removes duplicates of the same 
thing that are adjacent. In those cases, the data would come out sorted or in 
reverse sort order. But that is not what the built-in unique() does as it 
basically keeps a list of what it has encountered and each new item is kept 
only if it is not already in the list. This produces output in the same order 
it entered. Think of how factors normally are done in R. The first item found 
is stored usually as a 1 then the next unique item as a 2 and so on but 
duplicates all share the same integer value and thus the levels only need be 
stored once. If you want them stored sorted, there are fairly easy ways to do 
that but there are packages like forcats that do such things often more 
consistently and easier. In a sense, you can do a unique by simply making your 
data into a factor and ask for levels(whatever) ad optionally sort them. 
Effectively, you are using it as a kind of set, and there are ways to also use 
sets in R.

-Original Message-
From: R-help  On Behalf Of Duncan Murdoch
Sent: Monday, December 20, 2021 12:51 PM
To: Martin Maechler ; Rui Barradas 

Cc: Stephen H. Dawson, DSL via R-help 
Subject: Re: [R] Adding SORT to UNIQUE

On 20/12/2021 12:32 p.m., Martin Maechler wrote:
>> Rui Barradas
>>  on Mon, 20 Dec 2021 17:05:33 + writes:
> 
>  > Hello,
>  > Package stringr has functions str_sort and str_order, both with an
>  > argument 'numeric' that will sort the numbers correctly.
>  > Maybe that's what you are looking for, see the example below.
> 
> 
>  > x <- sample(sprintf("ab%d", 1:20)) # shuffle the vector
>  > stringr::str_sort(x, numeric = TRUE)   # sort considering the numbers
> 
> Again:
> There's really no need to use non-base R here (and in almost all such 
> questions about string handling!) as Avi Gross' answer shows.

That gives a different sort order:

  stringr::str_sort(x, numeric = TRUE)

gives

  [1] "ab1"  "ab2"  "ab3"  "ab4"  "ab5"  "ab6"  "ab7"  "ab8"  "ab9" 
"ab10" "ab11" "ab12" "ab13" "ab14" "ab15" "ab16" "ab17"
[18] "ab18" "ab19" "ab20"

(with the num

Re: [R] Adding SORT to UNIQUE

2021-12-21 Thread Avi Gross via R-help

Stephen,

Languages have their own philosophies and are often focused initially on doing 
specific things well. Later, they tend to accumulate additional functionality 
both in the base language and extensions.

I am wondering if you have explained your need precisely enough to get the 
answers you want. 

SQL and Python have their own ways and both have advantages but also huge 
deficiencies relative to just base R. 

But there are rules you live with and if you choose day a data.frame to store 
things in, the columns must all be the same length. The unique members of one 
data.frame are likely to not be the same number so storing them in a data.frame 
does not work. They can be stored quite  few other ways, such as a list of 
lists.

And what is your definition of ease? I can program in Python and SQL and way 
over a hundred other languages and I know I need to adapt my thinking to the 
flow of the language and not the other way around. Base R was not designed to 
be like either SQL or Python. But it can be extended quite a few ways to do 
just about anything.

What you ran into for example is the fact that some functionality is more 
selective in what it works on. A data.frame with one column is logically the 
same as a matrix with one column and as a vector but in reality, they are not 
the same thing. Yes, they can be converted into each other fairly trivially. 
Sort() seems to care what you feed it. If you did not worry about efficiency, 
you could have a version of sort that accepts a wide variety of inputs, 
converts any it can to some possibly common internal form, then converts the 
output back into the form it was received in, or uses a command-line option to 
specify the output format. It is not hard in R to make such a function as it 
has the primitives needed to examine an arbitrary object and see what 
dimensions it has for some number of types and so on, and has utilities to do 
the conversion.

If you want a language that has calculated every possible combination of ways 
to combine functions and already made tens of thousands available, good luck. 
What languages (including Python and R) expect is for you to compose such 
combinations yourself in one of many ways. The annoying discussions here 
between purists and those wanting to use pre-made packages aside, your question 
can be handled in many of the ways we already discussed. They include making 
your own (often very small) function that implements consolidating the many 
steps into one logical step. It can mean using pipelines like the new "|>" 
operator recently added to base R or the older versions often used in the 
tidyverse packages like "%>%".

You want to take a data.frame and select a column at a time and ask for it to 
be made into unique values then ordered and shown. So you want a VECTOR and 
your initial use of the "[" operator does not take the underlying list 
structure of a data.frame apart the way you might have thought but as a narrow 
data.frame. So you MAY need to either extract it using "[[" or use various 
routines R supplies like unlist() or as.vector().

Here is a pipeline using this as my data:

mydf <- data.frame(ints=c(5,4,3,3,4,5), chars=c("z","i","t","s","t","i"))

Note the number of unique items differs s does the data type:

  mydf
  ints chars
  15 z
  24 i
  33 t
  43 s
  54 t
  65 i

To handle the columns one at a time can be done using a pipeline like:

  > mydf[2] |> unlist() |> unique() |> sort()
  [1] "i" "s" "t" "z"
  > mydf[1] |> unlist() |> unique() |> sort()
  [1] 3 4 5

The above takes a two-column data.frame and restricts it into a one-column 
data.frame and then passes the new temporary variable/object into the command 
line of the unlist() function which returns an object (again temporary) which 
is a  vector (in one case numeric and in the other character) and then that 
result is passed into the command line of unique() which returns a shorter 
vector in the same order and then you pass it on to sort() which reorders it. 

Note the first steps can be shortened if using the "[[" notation or by using 
the named way of asking for a column:

  > mydf[[1]] |> unique() |> sort()
  [1] 3 4 5
  > mydf$ints |> unique() |> sort()
  [1] 3 4 5

But pipelines are simply syntactic sugar mostly so you also can just nest 
function calls as in sort(unique(unlist(mydf[1]))) or do what I showed earlier 
of creating a function that does the work invisibly and call that.

Python often does their own version of pipelines by adding a dot at the end and 
calling a method and if needed another dot and then calling a method on the 
resulting object and so on. But that is arguably more limiting in some ways and 
more powerful in others. Different paradigms. In R, you do not do 
object.method1.method2(args).method3(args) so a pieline method is used to sort 
of so something related.

Now if your need was to do your operation on an entire data.frame at once, then 
somet

Re: [R] Adding SORT to UNIQUE

2021-12-21 Thread Avi Gross via R-help

Duncan,

Let's not go there discussing the trouble with tibbles when the topic asked how 
to do things in more native R.

The reality is that tibbles when used in the tidyverse often use somewhat 
different ways to select what columns you want including some very quite 
sophisticated ones like:

select(mydf, wed:fri, ends_with(".xyz), everything())

So it is often not really used to select columns by number but you can do that 
too. What you re talking about is using [] notation which is often not needed 
as you use verbs like filter and select independently.

I find it often way more intuitive to solve things the dplyr way but I agree 
you sometimes want to convert tibbles back to data.frames before using base R 
techniques on them.

-Original Message-
From: R-help  On Behalf Of Duncan Murdoch
Sent: Tuesday, December 21, 2021 12:03 PM
To: Jeff Newmiller ; r-help@r-project.org; 
serv...@shdawson.com; Rui Barradas 
Subject: Re: [R] Adding SORT to UNIQUE

On 21/12/2021 11:59 a.m., Jeff Newmiller wrote:
> Intuitive, perhaps, but noticably slower. And it doesn't work on tibbles by 
> design. Data frames are lists of columns.

That's just one of the design flaws in tibbles, but not the worst one.

Duncan Murdoch

> 
> On December 21, 2021 8:38:35 AM PST, Duncan Murdoch 
>  wrote:
>> On 21/12/2021 11:31 a.m., Duncan Murdoch wrote:
>>> On 21/12/2021 11:20 a.m., Stephen H. Dawson, DSL wrote:
 Thanks for the reply.

 sort(unique(Data[1]))
 Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing =
 decreasing)) :
   undefined columns selected
>>>
>>> That's the wrong syntax:  Data[1] is not "column one of Data".  Use 
>>> Data[[1]] for that, so
>>>
>>>  sort(unique(Data[[1]]))
>>
>> Actually, I'd probably recommend
>>
>>sort(unique(Data[, 1]))
>>
>> instead.  This treats Data as a matrix rather than as a list.
>> Dataframes are lists that look like matrices, but to me the matrix 
>> aspect is usually more intuitive.
>>
>> Duncan Murdoch
>>
>>>
>>> I think Rui already pointed out the typo in the quoted text below...
>>>
>>> Duncan Murdoch
>>>

 The recommended syntax did not work, as listed above.

 What I want is the sort of distinct column output. Again, the 
 column may be text or numbers. This is a huge analysis effort with 
 data coming at me from many different sources.


 *Stephen Dawson, DSL*
 /Executive Strategy Consultant/
 Business & Technology
 +1 (865) 804-3454
 http://www.shdawson.com 


 On 12/21/21 11:07 AM, Duncan Murdoch wrote:
> On 21/12/2021 10:16 a.m., Stephen H. Dawson, DSL via R-help wrote:
>> Thanks everyone for the replies.
>>
>> It is clear one either needs to write a function or put the 
>> unique entries into another dataframe.
>>
>> It seems odd R cannot sort a list of unique column entries with ease.
>> Python and SQL can do it with ease.
>
> I've seen several responses that looked pretty simple.  It's hard 
> to beat sort(unique(x)), though there's a fair bit of confusion 
> about what you actually want.  Maybe you should post an example of 
> the code you'd use in Python?
>
> Duncan Murdoch
>
>>
>> QUESTION
>> Is there a simpler means than other than the unique function to 
>> capture distinct column entries, then sort that list?
>>
>>
>> *Stephen Dawson, DSL*
>> /Executive Strategy Consultant/
>> Business & Technology
>> +1 (865) 804-3454
>> http://www.shdawson.com 
>>
>>
>> On 12/20/21 5:53 PM, Rui Barradas wrote:
>>> Hello,
>>>
>>> Inline.
>>>
>>> Às 21:18 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:
 Thanks.

 sort(unique(Data[[1]]))

 This syntax provides row numbers, not column values.
>>>
>>> This is not right.
>>> The syntax Data[1] extracts a sub-data.frame, the syntax 
>>> Data[[1]] extracts the column vector.
>>>
>>> As for my previous answer, it was not addressing the question, I 
>>> misinterpreted it as being a question on how to sort by numeric 
>>> order when the data is not numeric. Here is a, hopefully, complete 
>>> answer.
>>> Still with package stringr.
>>>
>>>
>>> cols_to_sort <- 1:4
>>>
>>> Data2 <- lapply(Data[cols_to_sort], \(x){
>>>   stringr::str_sort(unique(x), numeric = TRUE)
>>> })
>>>
>>>
>>> Or using Avi's suggestion of writing a function to do all the 
>>> work and simplify the lapply loop later,
>>>
>>>
>>> unisort2 <- function(vec, ...) stringr::str_sort(unique(vec), 
>>> ...)
>>> Data2 <- lapply(Data[cols_to_sort], unisort, numeric = TRUE)
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
>>>

 *Stephen Dawson, DSL*
 /Executive Strategy

Re: [R] Creating NA equivalent

2021-12-21 Thread Avi Gross via R-help

I wonder if the package Adrian Dușa created might be helpful or point you along 
the way.

It was eventually named "declared" 

https://cran.r-project.org/web/packages/declared/index.html

With a vignette here:

https://cran.r-project.org/web/packages/declared/vignettes/declared.pdf

I do not know if it would easily satisfy your needs but it may be a step along 
the way. A package called Haven was part of the motivation and Adrian wanted a 
way to import data from external sources that had more than one category of NA 
that sounds a bit like what you want. His functions should allow the creation 
of such data within R, as well. I am including him in this email if you want to 
contact him or he has something to say.

-Original Message-
From: R-help  On Behalf Of Duncan Murdoch
Sent: Tuesday, December 21, 2021 5:26 AM
To: Marc Girondot ; r-help@r-project.org
Subject: Re: [R] Creating NA equivalent

On 20/12/2021 11:41 p.m., Marc Girondot via R-help wrote:
> Dear members,
> 
> I work about dosage and some values are bellow the detection limit. I 
> would like create new "numbers" like LDL (to represent lower than 
> detection limit) and UDL (upper the detection limit) that behave like 
> NA, with the possibility to test them using for example is.LDL() or 
> is.UDL().
> 
> Note that NA is not the same than LDL or UDL: NA represent missing data.
> Here the data is available as LDL or UDL.
> 
> NA is built in R language very deep... any option to create new 
> version of NA-equivalent ?
> 

There was a discussion of this back in May.  Here's a link to one approach that 
I suggested:

   https://stat.ethz.ch/pipermail/r-devel/2021-May/080776.html

Read the followup messages, I made at least one suggested improvement. 
I don't know if anyone has packaged this, but there's a later version of the 
code here:

   https://stackoverflow.com/a/69179441/2554330

Duncan Murdoch

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 2 >

1 - 100 of 145 matches

Mail list logo