Here is another approach, just for fun: library(tidyverse) library(tokenizers)
anyall <- function(x, # a character vector terms # a list of character vectors ){ any(map_lgl(terms, function(term) { all(term %in% x) })) } mutate(th, flag = map_lgl(tokenize_tweets(text), anyall, terms = tokenize_words(st$terms))) Best, Ista On Tue, Oct 16, 2018 at 5:39 PM Nathan Parsons <nathan.f.pars...@gmail.com> wrote: > > Thanks all for your patience. Here’s a second go that is perhaps more > explicative of what it is I am trying to accomplish (and hopefully in plain > text form)... > > > I’m using the following packages: tidyverse, purrr, tidytext > > > I have a number of tweets in the following form: > > > th <- structure(list(status_id = c("x1047841705729306624", > "x1046966595610927105", > > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632", > > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z", > > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z", > > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt", > > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me > tell you man. You are the fucking messiah", > > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not > being hung over tomorrow vs. not fucking up your life ten years later.", > > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti > @SenatorCollins So, if your client was in her 20s, attending parties with > teenagers, doesn't that make her at the least immature as hell, or at the > worst, a pedophile and a person contributing to the delinquency of minors?", > > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123, > > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118, > > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426 > > ), county_name = c("Cumberland County", "Delaware County", "San Francisco > County", > > "Allegheny County", "Concho County", "Los Angeles County"), fips = c(23005L, > > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine", > > "Ohio", "California", "Pennsylvania", "Texas", "California"), > > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level = c("Medium > Metro", > > "Large Fringe Metro", "Large Central Metro", "Large Central Metro", > > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L, > > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L, > > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame" > > ), row.names = c(NA, -6L), .internal.selfref = ) > > > I also have a number of search terms in the following form: > > > st <- structure(list(terms = c("me abused depressed", "me hurt depressed", > > "feel hopeless depressed", "feel alone depressed", "i feel helpless", > > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df", > > "tbl", "data.frame”)) > > > I am trying to isolate the tweets that contain all of the words in each of > the search terms, i.e “me” “abused” and “depressed” from the first example > search term, but they do not have to be in order or even next to one > another. > > > I am familiar with the dplyr suite of tools and have been attempting to > generate some sort of ‘filter()’ to do this. I am not very familiar with > purrr, but there may be a solution using the map function? I have also > explored the tidytext ‘unnest_tokens’ function which transforms the ’th’ > data in the following way: > > > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt > > > head(tt) > > status_id created_at lat lng > > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > county_name fips state_name state_abb urban_level urban_code > > 1: Cumberland County 23005 Maine ME Medium Metro 3 > > 2: Cumberland County 23005 Maine ME Medium Metro 3 > > 3: Cumberland County 23005 Maine ME Medium Metro 3 > > 4: Cumberland County 23005 Maine ME Medium Metro 3 > > 5: Cumberland County 23005 Maine ME Medium Metro 3 > > 6: Cumberland County 23005 Maine ME Medium Metro 3 > > population word > > 1: 277308 technique > > 2: 277308 is > > 3: 277308 everything > > 4: 277308 with > > 5: 277308 olympic > > 6: 277308 lifts > > > but once I have unnested the tokens, I am unable to recombine them back > into tweets. > > > Ideally the end result would append a new column to the ‘th’ data that > would flag a tweet that contained all of the search words for any of the > search terms; so the work flow would look like > > 1) look for all search words for one search term in a tweet > > 2) if all of the search words in the search term are found, create a flag > (mutate(flag = 1) or some such) > > 3) do this for all of the tweets > > 4) move on the next search term and repeat > > > Again, my thanks for your patience. > > > -- > > > Nate Parsons > > Pronouns: He, Him, His > > Graduate Teaching Assistant > > Department of Sociology > > Portland State University > > Portland, Oregon > > > 503-725-9025 > > 503-725-3957 FAX > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.