There is also apparently a package called disk.frame that you might consider.
On May 19, 2020 12:07:38 AM PDT, Laurent Rhelp <laurentrh...@free.fr> wrote: >Ok, thank you for the advice I will take some time to see in details >these packages. > > >Le 19/05/2020 à 05:44, Jeff Newmiller a écrit : >> Laurent... Bill is suggesting building your own indexed database... >but this has been done before, so re-inventing the wheel seems >inefficient and risky. It is actually impossible to create such a beast >without reading the entire file into memory at least temporarily >anyway, so you are better off looking at ways to process the entire >file efficiently. >> >> For example, you could load the data into a sqlite database in a >couple of lines of code and use SQL directly or use the sqldf data >frame interface, or use dplyr to query the database. >> >> Or you could look at read_csv_chunked from readr package. >> >> On May 18, 2020 11:37:46 AM PDT, William Michels via R-help ><r-help@r-project.org> wrote: >>> Hi Laurent, >>> >>> Thank you for explaining your size limitations. Below is an example >>> using the read.fwf() function to grab the first column of your input >>> file (in 2000 row chunks). This column is converted to an index, and >>> the index is used to create an iterator useful for skipping lines >when >>> reading input with scan(). (You could try processing your large file >>> in successive 2000 line chunks, or whatever number of lines fits >into >>> memory). Maybe not as elegant as the approach you were going for, >but >>> read.fwf() should be pretty efficient: >>> >>>> sensors <- c("N053", "N163") >>>> read.fwf("test2.txt", widths=c(4), as.is=TRUE, flush=TRUE, n=2000, >>> skip=0) >>> V1 >>> 1 Time >>> 2 N023 >>> 3 N053 >>> 4 N123 >>> 5 N163 >>> 6 N193 >>>> first_col <- read.fwf("test2.txt", widths=c(4), as.is=TRUE, >>> flush=TRUE, n=2000, skip=0) >>>> which(first_col$V1 %in% sensors) >>> [1] 3 5 >>>> index1 <- which(first_col$V1 %in% sensors) >>>> iter_index1 <- iter(1:2000, checkFunc= function(n) {n %in% index1}) >>>> unlist(scan(file="test2.txt", >>> what=list("","","","","","","","","",""), flush=TRUE, >multi.line=FALSE, >>> skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE)) >>> [1] "N053" "-0.014083" "-0.004741" "0.001443" "-0.010152" >>> "-0.012996" "-0.005337" "-0.008738" "-0.015094" "-0.012104" >>>> unlist(scan(file="test2.txt", >>> what=list("","","","","","","","","",""), flush=TRUE, >multi.line=FALSE, >>> skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE)) >>> [1] "N163" "-0.054023" "-0.049345" "-0.037158" "-0.04112" >>> "-0.044612" "-0.036953" "-0.036061" "-0.044516" "-0.046436" >>> (Note for this email and the previous one, I've deleted the first >>> "hash" character from each line of your test file for clarity). >>> >>> HTH, Bill. >>> >>> W. Michels, Ph.D. >>> >>> >>> >>> >>> >>> On Mon, May 18, 2020 at 3:35 AM Laurent Rhelp <laurentrh...@free.fr> >>> wrote: >>>> Dear William, >>>> Thank you for your answer >>>> My file is very large so I cannot read it in my memory (I cannot >use >>>> read.table). So I want to put in memory only the line I need to >>> process. >>>> With readLines, as I did, it works but I would like to use an >>> iterator >>>> and a foreach loop to understand this way to do because I thought >>> that >>>> it was a better solution to write a nice code. >>>> >>>> >>>> Le 18/05/2020 à 04:54, William Michels a écrit : >>>>> Apologies, Laurent, for this two-part answer. I misunderstood your >>>>> post where you stated you wanted to "filter(ing) some >>>>> selected lines according to the line name... ." I thought that >>> meant >>>>> you had a separate index (like a series of primes) that you wanted >>> to >>>>> use to only read-in selected line numbers from a file (test file >>> below >>>>> with numbers 1:1000 each on a separate line): >>>>> >>>>>> library(gmp) >>>>>> library(iterators) >>>>>> iprime <- iter(1:100, checkFunc = function(n) isprime(n)) >>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, >nlines=1) >>>>> Read 1 item >>>>> [1] 2 >>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, >nlines=1) >>>>> Read 1 item >>>>> [1] 3 >>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, >nlines=1) >>>>> Read 1 item >>>>> [1] 5 >>>>>> scan(file="one_thou_lines.txt", skip=nextElem(iprime)-1, >nlines=1) >>>>> Read 1 item >>>>> [1] 7 >>>>> However, what it really seems that you want to do is read each >line >>> of >>>>> a (possibly enormous) file, test each line "string-wise" to keep >or >>>>> discard, and if you're keeping it, append the line to a list. I >can >>>>> certainly see the advantage of this strategy for reading in very, >>> very >>>>> large files, but it's not clear to me how the "ireadLines" >function >>> ( >>>>> in the "iterators" package) will help you, since it doesn't seem >to >>>>> generate anything but a sequential index. >>>>> >>>>> Anyway, below is an absolutely standard read-in of your data using >>>>> read.table(). Hopefully some of the code I've posted has been >>> useful >>>>> to you. >>>>> >>>>>> sensors <- c("N053", "N163") >>>>>> read.table("test2.txt") >>>>> V1 V2 V3 V4 V5 V6 >V7 >>>>> V8 V9 V10 >>>>> 1 Time 0.000000 0.000999 0.001999 0.002998 0.003998 0.004997 >>>>> 0.005997 0.006996 0.007996 >>>>> 2 N023 -0.031323 -0.035026 -0.029759 -0.024886 -0.024464 -0.026816 >>>>> -0.033690 -0.041067 -0.038747 >>>>> 3 N053 -0.014083 -0.004741 0.001443 -0.010152 -0.012996 -0.005337 >>>>> -0.008738 -0.015094 -0.012104 >>>>> 4 N123 -0.019008 -0.013494 -0.013180 -0.029208 -0.032748 -0.020243 >>>>> -0.015089 -0.014439 -0.011681 >>>>> 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953 >>>>> -0.036061 -0.044516 -0.046436 >>>>> 6 N193 -0.022171 -0.022384 -0.022338 -0.023304 -0.022569 -0.021827 >>>>> -0.021996 -0.021755 -0.021846 >>>>>> Laurent_data <- read.table("test2.txt") >>>>>> Laurent_data[Laurent_data$V1 %in% sensors, ] >>>>> V1 V2 V3 V4 V5 V6 >V7 >>>>> V8 V9 V10 >>>>> 3 N053 -0.014083 -0.004741 0.001443 -0.010152 -0.012996 -0.005337 >>>>> -0.008738 -0.015094 -0.012104 >>>>> 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612 -0.036953 >>>>> -0.036061 -0.044516 -0.046436 >>>>> >>>>> Best, Bill. >>>>> >>>>> W. Michels, Ph.D. >>>>> >>>>> >>>>> On Sun, May 17, 2020 at 5:43 PM Laurent Rhelp >>> <laurentrh...@free.fr> wrote: >>>>>> Dear R-Help List, >>>>>> >>>>>> I would like to use an iterator to read a file filtering >some >>>>>> selected lines according to the line name in order to use after a >>>>>> foreach loop. I wanted to use the checkFunc argument as the >>> following >>>>>> example found on internet to select only prime numbers : >>>>>> >>>>>> | iprime <- ||iter||(1:100, >>> checkFunc = >>>>>> ||function||(n) ||isprime||(n))| >>>>>> >>>>>> |(https://datawookie.netlify.app/blog/2013/11/iterators-in-r/) >>>>>> <https://datawookie.netlify.app/blog/2013/11/iterators-in-r/>| >>>>>> >>>>>> but the checkFunc argument seems not to be available with the >>> function >>>>>> ireadLines (package iterators). So, I did the code below to solve >>> my >>>>>> problem but I am sure that I miss something to use iterators with >>> files. >>>>>> Since I found nothing on the web about ireadLines and the >>> checkFunc >>>>>> argument, could somebody help me to understand how we have to use >>>>>> iterator (and foreach loop) on files keeping only selected lines >? >>>>>> >>>>>> Thank you very much >>>>>> Laurent >>>>>> >>>>>> Presently here is my code: >>>>>> >>>>>> ## mock file to read: test.txt >>>>>> ## >>>>>> # Time 0 0.000999 0.001999 0.002998 0.003998 >>> 0.004997 >>>>>> 0.005997 0.006996 0.007996 >>>>>> # N023 -0.031323 -0.035026 -0.029759 -0.024886 >>> -0.024464 >>>>>> -0.026816 -0.03369 -0.041067 -0.038747 >>>>>> # N053 -0.014083 -0.004741 0.001443 -0.010152 >>> -0.012996 >>>>>> -0.005337 -0.008738 -0.015094 -0.012104 >>>>>> # N123 -0.019008 -0.013494 -0.01318 -0.029208 >>> -0.032748 >>>>>> -0.020243 -0.015089 -0.014439 -0.011681 >>>>>> # N163 -0.054023 -0.049345 -0.037158 -0.04112 >>> -0.044612 >>>>>> -0.036953 -0.036061 -0.044516 -0.046436 >>>>>> # N193 -0.022171 -0.022384 -0.022338 -0.023304 >>> -0.022569 >>>>>> -0.021827 -0.021996 -0.021755 -0.021846 >>>>>> >>>>>> >>>>>> # sensors to keep >>>>>> >>>>>> sensors <- c("N053", "N163") >>>>>> >>>>>> >>>>>> library(iterators) >>>>>> >>>>>> library(rlist) >>>>>> >>>>>> >>>>>> file_name <- "test.txt" >>>>>> >>>>>> con_obj <- file( file_name , "r") >>>>>> ifile <- ireadLines( con_obj , n = 1 ) >>>>>> >>>>>> >>>>>> ## I do not do a loop for the example >>>>>> >>>>>> res <- list() >>>>>> >>>>>> r <- get_Lines_iter( ifile , sensors) >>>>>> res <- list.append( res , r ) >>>>>> res >>>>>> r <- get_Lines_iter( ifile , sensors) >>>>>> res <- list.append( res , r ) >>>>>> res >>>>>> r <- get_Lines_iter( ifile , sensors) >>>>>> do.call("cbind",res) >>>>>> >>>>>> ## the function get_Lines_iter to select and process the line >>>>>> >>>>>> get_Lines_iter <- function( iter , sensors, sep = '\t', quiet = >>> FALSE){ >>>>>> ## read the next record in the iterator >>>>>> r = try( nextElem(iter) ) >>>>>> while( TRUE ){ >>>>>> if( class(r) == "try-error") { >>>>>> return( stop("The iterator is empty") ) >>>>>> } else { >>>>>> ## split the read line according to the separator >>>>>> r_txt <- textConnection(r) >>>>>> fields <- scan(file = r_txt, what = "character", sep = >sep, >>> quiet = >>>>>> quiet) >>>>>> ## test if we have to keep the line >>>>>> if( fields[1] %in% sensors){ >>>>>> ## data processing for the selected line (for the >example >>>>>> transformation in dataframe) >>>>>> n <- length(fields) >>>>>> x <- data.frame( as.numeric(fields[2:n]) ) >>>>>> names(x) <- fields[1] >>>>>> ## We return the values >>>>>> print(paste0("sensor ",fields[1]," ok")) >>>>>> return( x ) >>>>>> }else{ >>>>>> print(paste0("Sensor ", fields[1] ," not selected")) >>>>>> r = try(nextElem(iter) )} >>>>>> } >>>>>> }# end while loop >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> L'absence de virus dans ce courrier électronique a été vérifiée >>> par le logiciel antivirus Avast. >>>>>> https://www.avast.com/antivirus >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible >code. >>>> >>>> >>>> -- >>>> L'absence de virus dans ce courrier électronique a été vérifiée par >>> le logiciel antivirus Avast. >>>> https://www.avast.com/antivirus >>>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.