[R] NAs produced by integer overflow, but only some time ...

2018-05-08 Thread Stefan Th. Gries
I have problem with integer overflow that I cannot understand.

I have a character vector curr.lemmas with the following properties:

length(curr.lemmas) # 61224
length(unique(curr.lemmas)) # 2652

That vector is the input to the following function:

yules.k1 <- function(input) {
   m1 <- length(input); temp <- table(table(input))
   m2 <- sum("*"(temp, as.numeric(names(temp))^2))
   return(1*(m2-m1) / (m1*m1))
}

When I run this, I get the following output:

[1] NA
Warning message:
In m1 * m1 : NAs produced by integer overflow

But when I change the function to this one by just replacing m1*m1 by m1^2 ...

yules.k2 <- function(input) {
   m1 <- length(input); temp <- table(table(input))
   m2 <- sum("*"(temp, as.numeric(names(temp))^2))
   return(1*(m2-m1) / (m1^2))
}

yules.k2(curr.lemmas) # -> 157.261

I am using RStudio 1.1.447 and here's my sessionInfo
##
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
 [6] LC_MESSAGES=en_US.UTF-8LC_PAPER=en_US.UTF-8   LC_NAME=C
   LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
 [1] compiler_3.4.4  backports_1.1.2 magrittr_1.5rprojroot_1.3-2
htmltools_0.3.6 tools_3.4.4 yaml_2.1.19 Rcpp_0.12.16
stringi_1.2.2
[10] rmarkdown_1.9   knitr_1.20  stringr_1.3.0   digest_0.6.15
evaluate_0.10.1
##

What is even more puzzling is that one time I ran R in the console of
Geany and this happened:

> m1
[1] 61224
> 61224*61224
[1] 3748378176
> 61224^2
[1] 3748378176
> m1*m1
[1] NA
Warning message:
In m1 * m1 : NAs produced by integer overflow
> m1^2
[1] 3748378176

That is, the multiplication worked with the numbers but not the
numeric vectors; the above is literally copied from the console. Why
is that happening?

Any help would be much appreciated!
STG
--
Stefan Th. Gries
--
Univ. of California, Santa Barbara
http://tinyurl.com/stgries

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] NAs produced by integer overflow, but only some time ...

2018-05-09 Thread Stefan Th. Gries
Before responding to Jeff's posting, let me reiterate my question: Why
does a function using m1*m1 produce an integer overflow, but m1^2 does
not?

As for Jeff's 'response':

> a) Numeric values may be either integers (signed 32 bit) or double precision 
> (53 bit mantissa).
> b) Double precision constants are numeric with no decoration (e.g. 61224). 
> Integer constants have an L (e.g. 61224L).
> c) 61224*61224 > 2^31-1 so that answer cannot fit into an integer.
> d) Exponentiation is a floating point operation so the result of 61224L^2L is 
> a floating point answer that CAN fit into the 53bit mantissa of a double 
> precision value, so no overflow occurs.
Yes, that's all great and I knew that from
.

> e) Defining a function like yules.k1 and never showing how you called it does 
> not constitute a reproducible example. To avoid such gaffes you can use the 
> reprex package to confirm that the errors shown in your question are in fact 
> reproducible.
Responding to a post and never seeing that the provided code does
actually show how I call the function does not constitute a useful
answer. To avoid such gaffes you can use your reading skills to
confirm that the perceived lack of a function call is in fact such a
lack. In addition, typing m1 <- 61224 makes the multiplication example
that I shows in the bottom part of the posting reproducible ...

> f) On this mailing list, the fact that you are using RStudio is at best 
> irrelevant, and at worst off-topic. If you don't see problems running your 
> reproducible example from R in the terminal then the question probably 
> belongs in the RStudio support forum. This is another reason to use the 
> reprex package to check your reproducibility (this works even if you invoke 
> it from RStudio).
I did provide the information for the sake of comprehensiveness and I
did mention that the problem also showed up in the console; the whole
second part of the post was on that.

> g) Calling table on the result of table must be one of the more bizarre 
> calculation sequences I have ever seen in R. I hope you are getting the 
> answers you are expecting when you do use double precision numeric values. 
> Also, using the prefix form of multiplication is unnecessarily obscure, and 
> your use of the return function at the end of your function is redundant.
On this mailing list, your assessment of calculation sequences and
their comparison to others you have seen is at best irrelevant and at
worst off-topic since it doesn't answer the question. I didn't ask
(you or anyone) to grade my code and there are reasons why "*" and
return where used there as they are) but to answer the question why
m1*m1 returned an error and m1^2 does not.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] NAs produced by integer overflow, but only some time ...

2018-05-09 Thread Stefan Th. Gries
> You are right that various arithmetic operators map a pair of integer 
> arguments to various type: the power and division operators map them to 
> double precision while the the addition, multiplication, and subtraction 
> operators map them to integer results (giving NA's if the result cannot fit 
> into 32 bits).
Ah, ok, _that_ explains it, thanks a lot, I did not know that, which
is why it never occurred to me to check str(m1)!

> As for table(table(x)) being an unnatural construct, I use it all the time 
> instead of anyDuplicated to see the pattern of duplications.
Thanks for this, too.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to count occurrences of string?

2009-09-21 Thread Stefan Th. Gries
This should do what you want:

set.seed(1)
# cerating a vector with comments, in which d's will be counted
comments<-character(25)
for (i in 1:25) {
   comments[i] <- paste(sample(letters[1:10], 10, replace=T), collapse="")
}

# creating a vector to cross -tabulate
age <-sample(c("old", "young"), 25, replace=T)

# find d's in comments, determine their frequencies, and crosstab
table(sapply(gregexpr("d", comments), length), age)

HTH,
STG

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] break up a string into strings with a fixed length

2009-10-02 Thread Stefan Th. Gries
This should do what you want:

x<-"abcdefghijkl"
strsplit(x, "(?<=...)", perl=T)

HTH,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Hierarchical multi-level model with lmer: why are the highest-level random adjustments 0?

2013-07-07 Thread Stefan Th. Gries
Hi all

I have a hopefully not too stupid question about multi-level /
mixed-effects modeling. I was trying to test a strategy from Crawley's
2013 R Book on a data set with the following structure:

- dependent variable: CONSTRUCTION (a factor with 2 levels)
- independent fixed effect: LENGTH (an integer in the interval [1, 61])
- random effects with the following hierarchical structure: MODE >
REGISTER > SUBREGISTER > FILE. Specifically:

MODE: S
  REGISTER: monolog
SUBREGISTER: scripted
SUBREGISTER: unscripted
  REGISTER: dialog
SUBREGISTER: private
SUBREGISTER: public
  REGISTER: mix
SUBREGISTER: broadcast
MODE: W
  REGISTER: printed
SUBREGISTER: academic
SUBREGISTER: creative
SUBREGISTER: instructional
SUBREGISTER: nonacademic
SUBREGISTER: persuasive
SUBREGISTER: reportage
  REGISTER: nonprinted
SUBREGISTER: letters
SUBREGISTER: nonprofessional

with various levels of FILE in each level of SUBREGISTER. Here's the
head of the relevant data frame (best viewed with a non-proportional
font):

  CASE MODE   REGISTER SUBREGISTERFILE CONSTRUCTION LENGTH
11Wprinted   reportage W2C-002 V_P_DO   11
22Wprinted nonacademic W2B-035 V_P_DO8
33W nonprinted nonprofessional W1A-014 V_P_DO8
44Wprinted   reportage W2C-005 V_P_DO6
55S dialog private S1A-073 V_DO_P2
66S dialog private S1A-073 V_DO_P2

And here's the unique-types distribution of FILE in the design:
tapply(FILE, list(SUBREGISTER, REGISTER, MODE), function (qwe)
length(unique(qwe)))

, , S
dialog mix monolog nonprinted printed
academic .   .   .  .   .
broadcast.  20   .  .   .
creative .   .   .  .   .
instructional.   .   .  .   .
letters  .   .   .  .   .
nonacademic  .   .   .  .   .
nonprofessional  .   .   .  .   .
persuasive   .   .   .  .   .
private 96   .   .  .   .
public  77   .   .  .   .
reportage.   .   .  .   .
scripted .   .  25  .   .
unscripted   .   .  66  .   .

, , W
dialog mix monolog nonprinted printed
academic .   .   .  .  26
broadcast.   .   .  .   .
creative .   .   .  .  20
instructional.   .   .  .  19
letters  .   .   . 28   .
nonacademic  .   .   .  .  37
nonprofessional  .   .   . 17   .
persuasive   .   .   .  .   9
private  .   .   .  .   .
public   .   .   .  .   .
reportage.   .   .  .  20
scripted .   .   .  .   .
unscripted   .   .   .  .   .

# I would usually have done this (using lme4)
model.1.tog <- lmer(CONSTRUCTION ~ LENGTH +
(1|MODE/REGISTER/SUBREGISTER), family=binomial)

# but Crawley (2013:692ff.) suggests this:
LEVEL2 <- MODE:REGISTER
LEVEL3 <- MODE:REGISTER:SUBREGISTER
model.1.sep <- lmer(CONSTRUCTION ~ LENGTH + (1|MODE) + (1|LEVEL2) +
(1|LEVEL3), family=binomial)

The results are the same for fixed and random effects, ok, but what I
don't understand in this case is why the random adjustments to
intercepts at the highest level of hierarchical organization (MODE)
are 0:

ranef(model.1.sep)
$LEVEL3
 (Intercept)
S:dialog:private  -0.9482442
S:dialog:public0.1216021
[...]

$LEVEL2
 (Intercept)
S:dialog  -0.4746389
[...]

$MODE
  (Intercept)
S   0
W   0

I am guessing that's got something to do with the fact that what the
random adjustments to intercepts at the level of MODE would do is
already taken care of by the random adjustments to intercepts on the
lower levels of the hierarchical organization of the data - but when I
run the code from Crawley on his data (a linear mixed-effects model,
not a generalized linear mixed-effects model as mine), the highest
hierarchical level of organization does not return 0 as random
adjustment. What am I missing?

Thanks for any input you may have!
STG

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] predict.lrm vs. predict.glm (with newdata)

2010-12-25 Thread Stefan Th. Gries
Hi all

I have run into a case where I don't understand why predict.lrm and
predict.glm don't yield the same results. My data look like this:

set.seed(1)
library(Design); ilogit <- function(x) { 1/(1+exp(-x)) }

ORDER <- factor(sample(c("mc-sc", "sc-mc"), 403, TRUE))
CONJ <- factor(sample(c("als", "bevor", "nachdem", "weil"), 403, TRUE))
LENGTH_DIFF <- sample(-32:25, 403, TRUE)

I fit two models:
model.01.lrm <- lrm(ORDER ~ CONJ*LENGTH_DIFF)
model.01.glm <- glm(ORDER ~ CONJ*LENGTH_DIFF, family=binomial)

Then I wanted to plot for each level of CONJ the predicted
probabilities of ORDER="sc-mc" across a large range of LENGTH_DIFF
values. Thus I use this (yes, I know about type="response" etc.):

# range of LENGTH_DIFF for CONJ="als"
x1 <- factor(rep("als", 201)); x2 <- seq(-100, 100, 1)
y.als.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.als.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

This works perfectly and the two vectors y.als.lrm and y.als.glm are
the same. But then:

# range of LENGTH_DIFF for CONJ="bevor"
x1 <- factor(rep("bevor", 201)); x2 <- seq(-100, 100, 1)
y.bevor.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.bevor.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

Here, y.bevor.lrm is the same as y.als.lrm, but not the same as y.bevor.glm

# range of LENGTH_DIFF for CONJ="nachdem"
x1 <- factor(rep("nachdem", 201)); x2 <- seq(-100, 100, 1)
y.nachdem.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.nachdem.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

Here, y.nachdem.lrm is the same as y.als.lrm, but not the same as y.nachdem.glm

# range of LENGTH_DIFF for CONJ="weil"
x1 <- factor(rep("weil", 201)); x2 <- seq(-100, 100, 1)
y.weil.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.weil.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

Here, y.weil.lrm is the same as y.als.lrm, but not the same as y.weil.glm

As a result, in this plot, the left panel is useless, but the right
has the four different curves:

par(mfrow=c(1,2))
plot(x2, y.als.lrm, pch="a", xlim=c(-100, 100), ylim=c(0, 1),
main="with predict.lrm", xlab="Main cl. length - subord. cl. length
(in words)", ylab="Predicted probability of 'sc-mc'");  grid()
   points(x2, y.bevor.lrm, pch="b")
   points(x2, y.nachdem.lrm, pch="n")
   points(x2, y.weil.lrm, pch="w")
plot(x2, y.als.glm, pch="a", xlim=c(-100, 100), ylim=c(0, 1),
main="with predict.glm", xlab="Main cl. length - subord. cl. length
(in words)", ylab="Predicted probability of 'sc-mc'");  grid()
   points(x2, y.bevor.glm, pch="b")
   points(x2, y.nachdem.glm, pch="n")
   points(x2, y.weil.glm, pch="w")
par(mfrow=c(1,1))

What am I doing wrong with predict.lrm?

Thanks in advance for any input you may have,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] predict.lrm vs. predict.glm (with newdata)

2010-12-30 Thread Stefan Th. Gries
Hi all

Trying again with this question. (If it's really that stupid a
question, I'd be happy to just receive a link or one or two words to
Google that I haven't thought of before):

I have run into a case where I don't understand why predict.lrm and
predict.glm don't yield the same results.

# My data look like this:
set.seed(1)
library(Design); ilogit <- function(x) { 1/(1+exp(-x)) }
ORDER <- factor(sample(c("mc-sc", "sc-mc"), 403, TRUE))
CONJ <- factor(sample(c("als", "bevor", "nachdem", "weil"), 403, TRUE))
LENGTH_DIFF <- sample(-32:25, 403, TRUE)

# I fit two models:
model.01.lrm <- lrm(ORDER ~ CONJ*LENGTH_DIFF)
model.01.glm <- glm(ORDER ~ CONJ*LENGTH_DIFF, family=binomial)

# Then I wanted to plot for each level of CONJ the predicted
probabilities of ORDER="sc-mc" across a large range of LENGTH_DIFF
values. Thus I use this (yes, I know about type="response" etc.):
# range of LENGTH_DIFF for CONJ="als"
x1 <- factor(rep("als", 201)); x2 <- seq(-100, 100, 1)
y.als.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.als.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

# This works perfectly and the two vectors y.als.lrm and y.als.glm are
the same. But then:
# range of LENGTH_DIFF for CONJ="bevor"
x1 <- factor(rep("bevor", 201)); x2 <- seq(-100, 100, 1)
y.bevor.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.bevor.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

# Here, y.bevor.lrm is the same as y.als.lrm, but *not* the same as y.bevor.glm
# range of LENGTH_DIFF for CONJ="nachdem"
x1 <- factor(rep("nachdem", 201)); x2 <- seq(-100, 100, 1)
y.nachdem.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.nachdem.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

# Here, y.nachdem.lrm is the same as y.als.lrm, but *not* the same as
y.nachdem.glm
# range of LENGTH_DIFF for CONJ="weil"
x1 <- factor(rep("weil", 201)); x2 <- seq(-100, 100, 1)
y.weil.lrm <- ilogit(predict(model.01.lrm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))
y.weil.glm <- ilogit(predict(model.01.glm, newdata=list(CONJ=x1,
LENGTH_DIFF=x2)))

# Here, y.weil.lrm is the same as y.als.lrm, but *not* the same as y.weil.glm
#As a result, in this plot, the left panel is useless, but the right
has the four different curves:
par(mfrow=c(1,2))
plot(x2, y.als.lrm, pch="a", xlim=c(-100, 100), ylim=c(0, 1),
main="with predict.lrm", xlab="Main cl. length - subord. cl. length
(in words)", ylab="Predicted probability of 'sc-mc'");  grid()
  points(x2, y.bevor.lrm, pch="b")
  points(x2, y.nachdem.lrm, pch="n")
  points(x2, y.weil.lrm, pch="w")
plot(x2, y.als.glm, pch="a", xlim=c(-100, 100), ylim=c(0, 1),
main="with predict.glm", xlab="Main cl. length - subord. cl. length
(in words)", ylab="Predicted probability of 'sc-mc'");  grid()
  points(x2, y.bevor.glm, pch="b")
  points(x2, y.nachdem.glm, pch="n")
  points(x2, y.weil.glm, pch="w")
par(mfrow=c(1,1))

What am I doing wrong with predict.lrm?

Thanks in advance for any input you may have,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Developing a web crawler

2011-03-03 Thread Stefan Th. Gries
Hi

The book whose companion website is here
<http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/qclwr.html>
deals with many of the things you need for a web crawler, and
assignment "other 5" on that site
(<http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf>)
is a web crawler.

Best,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] pvclust crashing R on Ubuntu 10.10

2011-03-04 Thread Stefan Th. Gries
Hi all

I am writing to you with a question regarding the pvclust package. And
yes, before the usual people produce their usual
contact-the-package-maintainers line, ye, I tried that but the emails
one can find on the web either bounce or are not responded to. Also,
yes, this error has already been reported as a bug but been shot down
as not reproducible
(<http://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14430>). Thus,
here's now the version that I can reproduce. When I run this:

###
library(pvclust); data(iris)
x <- pvclust(iris[,1:4], method.dist="canberra", method.hclust="ward",
nboot=100)
plot(x)
###

I get this:

###
address 0x7fffa22b5000, cause 'memory not mapped'
###

Here is some info that might help:

###
Ubuntu 10.10
Linux kernel: 2.6.35-27 generic

> R.version
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 12.2
year 2011
month 02
day 25
svn rev 54585
language R
version.string R version 2.12.2 (2011-02-25)
###

Paradoxically enough, R for Windows installed on this machine (running
with Wine) causes no problems, but on Linux ... Any ideas what that
could be?

Thanks a lot,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] lty=NULL crashing R for x11(type="cairo")

2011-03-08 Thread Stefan Th. Gries
Yes, and plotting a pvclust object also works on Windows.
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
---



On Tue, Mar 8, 2011 at 04:06, Prof Brian Ripley  wrote:
> On Sat, 5 Mar 2011, Martin Maechler wrote:
>
>>>>>>> "IZ" == Ista Zahn 
>>>>>>>    on Sat, 5 Mar 2011 14:07:04 + writes:
>>
>>   IZ> I confirm this bug exists and is 100% replicable on R
>>   IZ> version 2.12.2 (2011-02-25) Platform: i686-pc-linux-gnu
>>   IZ> (32-bit)
>>
>> WHoa... debugging 
>> ===> it *is* a bug in R after all :
>>
>>  > plot(1); axis(1, lty=NULL)
>>
>>  *** caught segfault ***
>>  address 0x7fff423ab000, cause 'memory not mapped'
>>
>> and yes, the bug is device dependent:
>
> But it is user error.  Since when has lty=NULL been a valid value?
> Nowhere in the documentation for graphics devices does it say what to do
> with lty=NA_integer_, which is what do_axis maps NULL to.
>
>> E.g., it nicely works for postscript() or pdf()
>>
>>> postscript(); plot(1); axis(1, lty=NULL) ; dev.off()
>>
>> null device
>>         1
>
> It writes a solid line: that's not 'nicely' in my book.  It's chance that it
> works on some devices.
>
>> and it's ok for type = "Xlib", but not for the default
>> type = "cairo":
>>
>>> x11(type="Xlib")
>>> plot(1); axis(1, lty=NULL)
>>> x11(type="cairo")
>>> plot(1); axis(1, lty=NULL)
>>
>> *** caught segfault ***
>> address 0x7fffd875f000, cause 'memory not mapped'
>> /u/maechler/bin/R_arg: line 137: 14914 Segmentation fault      $exe $@
>>
>> Process R-devel exited abnormally with code 139 at Sat Mar  5 22:53:35
>> 2011
>>
>> and similarly for
>>
>>> png(type="Xlib") # fine
>>> png()            # not fine
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Brian D. Ripley,                  rip...@stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] results of fligner.test

2008-05-28 Thread Stefan Th. Gries
Hi all

I have a question regarding the Fligner-Killeen test. I am using

- a PC with Windows XP (Build 20600.xpsp080413-2111 (Service Pack 3);
- the following R version:
> sessionInfo()
R version 2.7.0 (2008-04-22)
i386-pc-mingw32


I have a vector LENGTH and a factor RELATION that are distributed like this:

> table(LENGTH, RELATION)
  RELATION
LENGTH object subject
1   9  51
2  25  18
3  36  23
4  23  14
5  17  13
6  12   7
7  10   5
8   3   7
9   4   3
10  3   5
11  4   2
12  2   0
13  0   2
16  2   1
22  0   1
49  1   0
53  1   0

I wanted to run a Fligner test to see whether the lengths for the two
relations differ in terms of their variance. This is what I found,
though:

#
> fligner.test(LENGTH[RELATION=="subject"], LENGTH[RELATION=="object"])
Fligner-Killeen test of homogeneity of variances
data:  LENGTH[RELATION == "subject"] and LENGTH[RELATION == "object"]
Fligner-Killeen:med chi-squared = 18.3552, df = 14, p-value = 0.1911

> fligner.test(LENGTH[RELATION=="object"], LENGTH[RELATION=="subject"])
Fligner-Killeen test of homogeneity of variances
data:  LENGTH[RELATION == "object"] and LENGTH[RELATION == "subject"]
Fligner-Killeen:med chi-squared = 16.8838, df = 13, p-value = 0.2047

> fligner.test(LENGTH~RELATION)
Fligner-Killeen test of homogeneity of variances
data:  LENGTH by RELATION
Fligner-Killeen:med chi-squared = 0.626, df = 1, p-value = 0.4288
#

The order of the vectors etc. changes the results??? Needless to say
this does not happen with var.test or ... Is this normal; if so, which
one is the one to report?
Thanks,
STG

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)

2008-05-30 Thread Stefan Th. Gries
Hi all

Four questions regarding Unicode.

Three Windows questions. I am using

- a PC with Windows XP (Build 20600.xpsp080413-2111 (Service Pack 3);
- the following R version:

> R.version
platform   i386-pc-mingw32
arch   i386
os mingw32
system i386, mingw32
status
major  2
minor  7.0
year   2008
month  04
day22
svn rev45424
language   R
version.string R version 2.7.0 (2008-04-22)

- the following locale:
> Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"



# I loaded the file
# 
# into R, and this works fine.
x<-scan(choose.files(), what="char", sep="\n", quote="",
comment.char="", encoding="UTF-8")

# My problems are the following:
# 1 strsplit

# This does not work:
words.1<-unlist(strsplit(corpus.file, "[-!;:\'\"\\?\\. ]+", perl=T))

# - words.1[173] should be "фирме", as in corpus.file[6]
# but it is "фирме"
# - words.1[208] should be "Торговли", as in corpus.file[13]
# but it is "Торговли"
# - words.1[214] should be "клиентов", as in corpus.file[14]
# but it is "Торговли"



# 2 entering Unicode characters into R: I want to search for,
# say, "для". So I try to define it as follows,
# but this doesn't work:
(x123<-"\u0434\u043b\u044F")

# I can define each individual character
(x1<-"\u0434"); (x2<-"\u043b"); (x3<-"\u044F")

# and each pair of character
(x12<-"\u0434\u043b")
(x13<-"\u0434\u044F")
(x23<-"\u043b\u044F")

# but not all three ... the last one gets skipped.
# why's that and how do I do it?


# 3 defining Unicode character ranges: in each of the following,
# the last bracket does not get included (even if it gets defined
# as a Unicode character, too):
russ.char.yes<-"[\u0401\u0410-\u044F\u0451]" # all Russian Cyrillics
russ.char.no<-"[^\u0401\u0410-\u044F\u0451]" # other characters
russ.char.capit<-"[\u0410-\u042F\u0451]" # capital Russian Cyrillics
russ.char.small<-"[\u0430-\u044F\u0401]" # small Russian Cyrillics

# I can do that all on Linux, but this arises in a context where
# many other character processing issues are explained for Mac,
# Linux, *and* Windows, and I'd hate to have to say "this one
# thing, you can't do on Windows"


One Linux question. I am using Ubuntu Hardy Heron:

> sessionInfo()
R version 2.7.0 (2008-04-22)
i486-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

# strange(?) behavior of word boundary characters:
# I understand why these work ...
grep("\\bмолод", "а молодость", perl=F, value=T) # OK
# [1] "а молодость"
gsub("\\bмолод", ">XX<", "а молодость", perl=F) # OK
# [1] "а >XX<ость"

# but why does "\\b" not work with perl=T?
grep("\\bмолод", "а молодость", perl=T, value=T) # FAIL
# character(0)
gsub("\\bмолод", ">XX<", "а молодость", perl=T) # FAIL
# [1] "а молодость"



Any pointers would be much appreciated and acknowledged ...
STG
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] fligner.test

2008-06-20 Thread Stefan Th. Gries
Hi all

I have a question regarding the Fligner-Killeen test. I am using

- a PC with Windows XP (Build 20600.xpsp080413-2111 (Service Pack 3);
- the following R version:
> sessionInfo()
R version 2.7.0 (2008-04-22)
i386-pc-mingw32


I have a vector LENGTH and a factor RELATION that are distributed like this:

> table(LENGTH, RELATION)
 RELATION
LENGTH object subject
   1   9  51
   2  25  18
   3  36  23
   4  23  14
   5  17  13
   6  12   7
   7  10   5
   8   3   7
   9   4   3
   10  3   5
   11  4   2
   12  2   0
   13  0   2
   16  2   1
   22  0   1
   49  1   0
   53  1   0

I wanted to run a Fligner test to see whether the lengths for the two
relations differ in terms of their variance. This is what I found,
though:

#
> fligner.test(LENGTH[RELATION=="subject"], LENGTH[RELATION=="object"])
   Fligner-Killeen test of homogeneity of variances
data:  LENGTH[RELATION == "subject"] and LENGTH[RELATION == "object"]
Fligner-Killeen:med chi-squared = 18.3552, df = 14, p-value = 0.1911

> fligner.test(LENGTH[RELATION=="object"], LENGTH[RELATION=="subject"])
   Fligner-Killeen test of homogeneity of variances
data:  LENGTH[RELATION == "object"] and LENGTH[RELATION == "subject"]
Fligner-Killeen:med chi-squared = 16.8838, df = 13, p-value = 0.2047

> fligner.test(LENGTH~RELATION)
   Fligner-Killeen test of homogeneity of variances
data:  LENGTH by RELATION
Fligner-Killeen:med chi-squared = 0.626, df = 1, p-value = 0.4288
#

The order of the vectors etc. changes the results??? Needless to say
this does not happen with var.test or ... Is this normal; if so, which
one is the one to report?

Thx,
STG

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text vector clustering

2009-01-23 Thread Stefan Th. Gries
Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an
example (very clumsy with two loops, but you can tweak that with apply
stuff).
HTH,
STG


levenshtein <- function(string1, string2, case=TRUE, map=NULL) {

# levenshtein algorithm in R
#
# Author  : Hans-Joerg Bibiko
# Date: 29/06/2006
# Contact : bib...@eva.mpg.de

# string1, string2 := strings to compare
# case = TRUE := case sensitivity; case = FALSE := case insensitivity
# map := character vector of c(regexp1, replacement1, regexp2,
replacement2, ...)
#   example:
#  map <- c("[aeiou]","V","[^aeiou]","C") := replaces all vowels
with V and all others with C
#  levenshtein("Bank","Bond", map=map)   =>  0


if(!is.null(map)) {
m <- matrix(map, ncol=2, byrow=TRUE)
s <- c(ifelse(case, string1, tolower(string1)), ifelse(case,
string2, tolower(string2)))
for(i in 1:dim(m)[1]) s <- gsub(m[i,1], m[i,2], s)
string1 <- s[1]
string2 <- s[2]
}

if(ifelse(case, string1, tolower(string1)) == ifelse(case, string2,
tolower(string2))) return(0)

s1 <- strsplit(paste(" ", ifelse(case, string1, tolower(string1)),
sep=""), NULL)[[1]]
s2 <- strsplit(paste(" ", ifelse(case, string2, tolower(string2)),
sep=""), NULL)[[1]]

l1 <- length(s1)
l2 <- length(s2)

d <- matrix(nrow = l1, ncol = l2)

for(i in 1:l1) d[i,1] <- i-1
for(i in 1:l2) d[1,i] <- i-1
for(i in 2:l1) for(j in 2:l2) d[i,j] <- min((d[i-1,j]+1) ,
(d[i,j-1]+1) , (d[i-1,j-1]+ifelse(s1[i] == s2[j], 0, 1)))

d[l1,l2]
} # end of function Hans-Joerg Bibiko's levenshtein

# generate names
set.seed(1)
all.names<-character(10)
for (i in 1:10) {
   all.names[i]<-paste(sample(letters, sample(4:10, 1), replace=T), collapse="")
}
all.names

# generate matrix
sims<-matrix(0, nrow=10, ncol=10)
attr(sims, "dimnames")<-list(all.names, all.names)

# fill matrix (clumsy)
for (j in 1:9) {
   for (k in (j+1):10) {
  sims[j,k]<-sims[k,j]<-levenshtein(all.names[j], all.names[k])
   }
}
plot(hclust(as.dist(sims)))

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text vector clustering

2009-01-23 Thread Stefan Th. Gries
On Fri, Jan 23, 2009 at 08:28, Stefan Th. Gries  wrote:
> Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an
> example (very clumsy with two loops, but you can tweak that with apply
> stuff).

Like this maybe (sorry, should've thought about that earlier):

[...]
x<-rep(all.names, length(all.names))
y<-rep(all.names, each=length(all.names))
sims<-matrix(mapply(levenshtein, x, y), ncol=10)
[...]

STG

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Pairwise intersect comparisons of list elements

2008-11-20 Thread Stefan Th. Gries
Hi all

I have two lists that have the same number of numeric vectors such as:
q<-list(1, 2:3, 4:6, 7:10)
w<-list(0, 1:2, 3:7, 8:10)

What I want to do is create a vector desired.result that looks like
this but I am thinking there must be some kind of non-loop /
"\\wapply" way to so ...

desired.result<-numeric()
for (i in 1:4) desired.result[i]<-length(intersect(q[[i]], w[[i]]))
desired.result

Any ideas? Thx,
STG

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Regex: workaround for variable length negative lookbehind

2008-11-30 Thread Stefan Th. Gries
Hi all

I have the following regular expression problem: I want to find
complete elements of a vector that end in a repeated character but
where the repetition doesn't make up the whole word. That is, for the
vector vec:

vec<-c("", "baaa", "bbaa", "bbba", "baamm", "aa")

I would like to get
"baaa"
"bbaa"
"baamm"

>From tools where negative lookbehind can involve variable lengths, one
would think this would work:

grep("(?https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Intercept in lm and in library(car): Anova

2007-09-14 Thread Stefan Th. Gries
Hi

I have two questions regarding the meaning of intercept outputs of lm.

Question 1: In data set 1 (a fully-balanced design), the line with
(Intercept) contains the overall mean, and the estimates contain the
differences from the overall mean (matching those from model.tables).
But in data set 2, the line with the intercept does not correspond to
the overall mean and the estimates don't correspond to the differences
outputted by model.tables. What does the output contain here?

#
rm(list=ls(all=TRUE))
options(contrasts=c("contr.sum", "contr.poly"))

# data set 1
Y1<-c(43, 23, 88, 45, 2, 68, 39, 41, 55, 64, 91, 9, 90, 37, 88, 41)
M1<-factor(c("k", "g", "k", "g", "k", "k", "g", "g", "g", "g", "k",
"k", "g", "g", "k", "k"))
N1<-factor(c("k", "g", "g", "k", "k", "g", "k", "g", "k", "g", "g",
"k", "g", "k", "g", "k"))

# linear model 1
model1<-lm(Y1~M1*N1); summary(model1)
model.tables(aov(Y1~M1*N1), "means")
model.tables(aov(Y1~M1*N1))

# data set 2
Y2<-c(34, 16, 46, 5, 2, 78, 31, 39, 25, 64, 45, 92, 65, 91, 60, 12,
33, 40, 72, 61, 49, 59)
M2<-factor(c(rep("a", 10), rep("b", 12)))
N2<-factor(c(rep("d", 4), rep("e", 6), rep("d", 8), rep("e", 4)))

# linear model 2
model2<-lm(Y2~M2*N2); summary(model2)
model.tables(aov(Y2~M2*N2), "means")
model.tables(aov(Y2~M2*N2))
#


Question 2: what does the line with (Intercept) mean that the
following lines produce?

#
library(car)
Anova(model, type=c("III"))
#

Any help would be much appreciated. Thx,
STG

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.