Às 23:06 de 25/07/2023, Bob Green escreveu:
Hello,

I am seeking advice as to how I can download the 833 files from this site:"http://home.brisnet.org.au/~bgreen/Data/";

I want to be able to download them to perform a textual analysis.

If the 833 files, which are in a Directory with two subfolders were on my computer I could read them through readtext. Using readtext I get the error:

 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*";)
Error in download_remote(file, ignore_missing, cache, verbosity) :
  Remote URL does not end in known extension. Please download the file manually.

 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
Error in download_remote(file, ignore_missing, cache, verbosity) :
  Remote URL does not end in known extension. Please download the file manually.

Any suggestions are appreciated.

Bob

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,

The following code downloads all files in the posted link.



suppressPackageStartupMessages({
  library(rvest)
})

# destination directory, change this at will
dest_dir <- "~/Temp"

# first get the two subfolders from the Data webpage
link <- "http://home.brisnet.org.au/~bgreen/Data/";
page <- read_html(link)
page %>%
  html_elements("a") %>%
  html_text() %>%
  grep("/$", ., value = TRUE) -> sub_folder

# create relevant disk sub-directories, if
# they do not exist yet
for(subf in sub_folder) {
  d <- file.path(dest_dir, subf)
  if(!dir.exists(d)) {
    success <- dir.create(d)
    msg <- paste("created directory", d, "-", success)
    message(msg)
  }
}

# prepare to download the files
dest_dir <- file.path(dest_dir, sub_folder)
source_url <- paste0(link, sub_folder)

success <- mapply(\(src, dest) {
  # read each Data subfolder
  # and get the file names therein
  # then lapply 'download.file' to each filename
  pg <- read_html(src)
  pg %>%
    html_elements("a") %>%
    html_text() %>%
    grep("\\.txt$", ., value = TRUE) %>%
    lapply(\(x) {
      s <- paste0(src, x)
      d <- file.path(dest, x)
      tryCatch(
        download.file(url = s, destfile = d),
        warning = function(w) w,
        error = function(e) e
      )
    })
}, source_url, dest_dir)

lengths(success)
# http://home.brisnet.org.au/~bgreen/Data/Hanson1/
#                                               84
# http://home.brisnet.org.au/~bgreen/Data/Hanson2/
#                                              749

# matches the question's number
sum(lengths(success))
# [1] 833



Hope this helps,

Rui Barradas

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to