Re: [R] iterators : checkFunc with ireadLines

Laurent Rhelp Tue, 19 May 2020 16:19:32 -0700

Ok, thank you for the advice I will take some time to see in detailsthese packages.



Le 19/05/2020 à 05:44, Jeff Newmiller a écrit :

Laurent... Bill is suggesting building your own indexed database... but this 
has been done before, so re-inventing the wheel seems inefficient and risky. It 
is actually impossible to create such a beast without reading the entire file 
into memory at least temporarily anyway, so you are better off looking at ways 
to process the entire file efficiently.

For example, you could load the data into a sqlite database in a couple of 
lines of code and use SQL directly or use the sqldf data frame interface, or 
use dplyr to query the database.

Or you could look at read_csv_chunked from readr package.

On May 18, 2020 11:37:46 AM PDT, William Michels via R-help 
<r-help@r-project.org> wrote:

Hi Laurent,

Thank you for explaining your size limitations. Below is an example
using the read.fwf() function to grab the first column of your input
file (in 2000 row chunks). This column is converted to an index, and
the index is used to create an iterator useful for skipping lines when
reading input with scan(). (You could try processing your large file
in successive 2000 line chunks, or whatever number of lines fits into
memory). Maybe not as elegant as the approach you were going for, but
read.fwf() should be pretty efficient:

sensors <-  c("N053", "N163")
read.fwf("test2.txt", widths=c(4), as.is=TRUE, flush=TRUE, n=2000,

skip=0)
    V1
1 Time
2 N023
3 N053
4 N123
5 N163
6 N193

first_col <- read.fwf("test2.txt", widths=c(4), as.is=TRUE,

flush=TRUE, n=2000, skip=0)

which(first_col$V1 %in% sensors)

[1] 3 5

index1 <- which(first_col$V1 %in% sensors)
iter_index1 <- iter(1:2000, checkFunc= function(n) {n %in% index1})
unlist(scan(file="test2.txt",

what=list("","","","","","","","","",""), flush=TRUE, multi.line=FALSE,
skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
[1] "N053"      "-0.014083" "-0.004741" "0.001443"  "-0.010152"
"-0.012996" "-0.005337" "-0.008738" "-0.015094" "-0.012104"

unlist(scan(file="test2.txt",

what=list("","","","","","","","","",""), flush=TRUE, multi.line=FALSE,
skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
[1] "N163"      "-0.054023" "-0.049345" "-0.037158" "-0.04112"
"-0.044612" "-0.036953" "-0.036061" "-0.044516" "-0.046436"
(Note for this email and the previous one, I've deleted the first
"hash" character from each line of your test file for clarity).

HTH, Bill.

W. Michels, Ph.D.





On Mon, May 18, 2020 at 3:35 AM Laurent Rhelp <laurentrh...@free.fr>
wrote:

Dear William,
   Thank you for your answer
My file is very large so I cannot read it in my memory (I cannot use
read.table). So I want to put in memory only the line I need to

process.

With readLines, as I did, it works but I would like to use an

iterator

and a foreach loop to understand this way to do because I thought

that

it was a better solution to write a nice code.


Le 18/05/2020 à 04:54, William Michels a écrit :

Apologies, Laurent, for this two-part answer. I misunderstood your
post where you stated you wanted to "filter(ing) some
selected lines according to the line name... ." I thought that

meant

you had a separate index (like a series of primes) that you wanted

to

use to only read-in selected line numbers from a file (test file

below

with numbers 1:1000 each on a separate line):

library(gmp)
library(iterators)
iprime <- iter(1:100, checkFunc = function(n) isprime(n))
scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)

Read 1 item
[1] 2

scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)

Read 1 item
[1] 3

scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)

Read 1 item
[1] 5

scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, nlines=1)

Read 1 item
[1] 7
However, what it really seems that you want to do is read each line

of

a (possibly enormous) file, test each line "string-wise" to keep or
discard, and if you're keeping it, append the line to a list. I can
certainly see the advantage of this strategy for reading in very,

very

large files, but it's not clear to me how the "ireadLines" function

in the "iterators" package) will help you, since it doesn't seem to
generate anything but a sequential index.

Anyway, below is an absolutely standard read-in of your data using
read.table(). Hopefully some of the code I've posted has been

useful

to you.

sensors <-  c("N053", "N163")
read.table("test2.txt")

      V1        V2        V3        V4        V5        V6        V7
     V8        V9       V10
1 Time  0.000000  0.000999  0.001999  0.002998  0.003998  0.004997
0.005997  0.006996  0.007996
2 N023 -0.031323 -0.035026 -0.029759 -0.024886 -0.024464 -0.026816
-0.033690 -0.041067 -0.038747
3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
-0.008738 -0.015094 -0.012104
4 N123 -0.019008 -0.013494 -0.013180 -0.029208 -0.032748 -0.020243
-0.015089 -0.014439 -0.011681
5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953
-0.036061 -0.044516 -0.046436
6 N193 -0.022171 -0.022384 -0.022338 -0.023304 -0.022569 -0.021827
-0.021996 -0.021755 -0.021846

Laurent_data <- read.table("test2.txt")
Laurent_data[Laurent_data$V1 %in% sensors, ]

      V1        V2        V3        V4        V5        V6        V7
     V8        V9       V10
3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
-0.008738 -0.015094 -0.012104
5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953
-0.036061 -0.044516 -0.046436

Best, Bill.

W. Michels, Ph.D.


On Sun, May 17, 2020 at 5:43 PM Laurent Rhelp

<laurentrh...@free.fr> wrote:

Dear R-Help List,

      I would like to use an iterator to read a file filtering some
selected lines according to the line name in order to use after a
foreach loop. I wanted to use the checkFunc argument as the

following

example found on internet to select only prime numbers :

|                                iprime <- ||iter||(1:100,

checkFunc =

||function||(n) ||isprime||(n))|

|(https://datawookie.netlify.app/blog/2013/11/iterators-in-r/)
<https://datawookie.netlify.app/blog/2013/11/iterators-in-r/>|

but the checkFunc argument seems not to be available with the

function

ireadLines (package iterators). So, I did the code below to solve

my

problem but I am sure that I miss something to use iterators with

files.

Since I found nothing on the web about ireadLines and the

checkFunc

argument, could somebody help me to understand how we have to use
iterator (and foreach loop) on files keeping only selected lines ?

Thank you very much
Laurent

Presently here is my code:

##        mock file to read: test.txt
##
# Time    0    0.000999    0.001999    0.002998    0.003998

0.004997

0.005997    0.006996    0.007996
# N023    -0.031323    -0.035026    -0.029759    -0.024886

-0.024464

-0.026816    -0.03369    -0.041067    -0.038747
# N053    -0.014083    -0.004741    0.001443    -0.010152

-0.012996

-0.005337    -0.008738    -0.015094    -0.012104
# N123    -0.019008    -0.013494    -0.01318    -0.029208

-0.032748

-0.020243    -0.015089    -0.014439    -0.011681
# N163    -0.054023    -0.049345    -0.037158    -0.04112

-0.044612

-0.036953    -0.036061    -0.044516    -0.046436
# N193    -0.022171    -0.022384    -0.022338    -0.023304

-0.022569

-0.021827    -0.021996    -0.021755    -0.021846


# sensors to keep

sensors <-  c("N053", "N163")


library(iterators)

library(rlist)


file_name <- "test.txt"

con_obj <- file( file_name , "r")
ifile <- ireadLines( con_obj , n = 1 )


## I do not do a loop for the example

res <- list()

r <- get_Lines_iter( ifile , sensors)
res <- list.append( res , r )
res
r <- get_Lines_iter( ifile , sensors)
res <- list.append( res , r )
res
r <- get_Lines_iter( ifile , sensors)
do.call("cbind",res)

## the function get_Lines_iter to select and process the line

get_Lines_iter  <-  function( iter , sensors, sep = '\t', quiet =

FALSE){

     ## read the next record in the iterator
     r = try( nextElem(iter) )
    while(  TRUE ){
       if( class(r) == "try-error") {
             return( stop("The iterator is empty") )
      } else {
      ## split the read line according to the separator
       r_txt <- textConnection(r)
       fields <- scan(file = r_txt, what = "character", sep = sep,

quiet =

quiet)
        ## test if we have to keep the line
        if( fields[1] %in% sensors){
          ## data processing for the selected line (for the example
transformation in dataframe)
          n <- length(fields)
          x <- data.frame( as.numeric(fields[2:n]) )
          names(x) <- fields[1]
          ## We return the values
          print(paste0("sensor ",fields[1]," ok"))
          return( x )
        }else{
         print(paste0("Sensor ", fields[1] ," not selected"))
         r = try(nextElem(iter) )}
      }
}# end while loop
}







--
L'absence de virus dans ce courrier électronique a été vérifiée

par le logiciel antivirus Avast.

https://www.avast.com/antivirus

          [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



--
L'absence de virus dans ce courrier électronique a été vérifiée par

le logiciel antivirus Avast.

https://www.avast.com/antivirus

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] iterators : checkFunc with ireadLines

Reply via email to