textreadr would be the obvious approach. When you say it is depreciated do you mean it's not available on cran? Sometimes maintaining a package on cran in just a pain in the ass.
devtools::install_github("trinker/textreadr") Should let you install it. In theory docx files are actually just zip files (you can unzip them) and you may find there is then a specific file in the zip that is readable with on of R's General text file readers. Alternatively, read_docx from: https://www.rdocumentation.org/packages/qdapTools May be worth a look. What platform are you on. Certainly options to command line convert files to txt and do from there. On Fri, 29 Dec 2023, 18:25 Roy Mendelssohn - NOAA Federal via R-help, < r-help@r-project.org> wrote: > Hi Andy: > > I don’t have an answer but I do have what I hope is some friendly advice. > Generally the more information you can provide, the more likely you will > get help that is useful. In your case you say that you tried several > packages and they didn’t do what you wanted. Providing that code, as well > as why they didn’t do what you wanted (be specific) would greatly > facilitate things. > > Happy new year, > > -Roy > > > > On Dec 29, 2023, at 10:14 AM, Andy <phaedr...@gmail.com> wrote: > > > > Hello > > > > I am trying to work through a problem, but feel like I've gone down a > rabbit hole. I'd very much appreciate any help. > > > > The task: I have several directories of multiple (some directories, up > to 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I > want to iterate through to append to a spreadsheet only those articles that > satisfy a condition (i.e., a specific keyword is present for >= 50% > coverage of the subject matter). Lexis+ has a very specific structure and > keywords are given in the row "Subject". > > > > I'd like to be able to accomplish the following: > > > > (1) Append the title, the month, the author, the number of words, and > page number(s) to a spreadsheet > > > > (2) Read each article and extract keywords (in the docs, these are > listed in 'Subject' section as a list of keywords with a percentage showing > the extent to which the keyword features in the article (e.g., FAST FASHION > (72%)) and to append the keyword and the % coverage to the same row in the > spreadsheet. However, I want to ensure that the keyword coverage meets the > threshold of >= 50%; if not, then pass onto the next article in the > directory. Rinse and repeat for the entire directory. > > > > So far, I've tried working through some Stack Overflow-based solutions, > but most seem to use the textreadr package, which is now deprecated; others > use either the officer or the officedown packages. However, these packages > don't appear to do what I want the program to do, at least not in any of > the examples I have found, nor in the vignettes and relevant package > manuals I've looked at. > > > > The first point is, is what I am intending to do even possible using R? > If it is, then where do I start with this? If these docx files were > converted to UTF-8 plain text, would that make the task easier? > > > > I am not a confident coder, and am really only just getting my head > around R so appreciate a steep learning curve ahead, but of course, I don't > know what I don't know, so any pointers in the right direction would be a > big help. > > > > Many thanks in anticipation > > > > Andy > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.