I'm not familiar with dataframe but a filter on a list is basically a join so you can use the two ways to make a join in spark :
both the list and dataset are huge => dataframe join method the list is small and dataset huge => broadcast the list (using context broadcast method) so it's available for each partition and you should be able to use the dataframe filter function : filter( x => broadcastList.contains(x)) both the list and dataset are small => spark ??? 2015-09-10 11:11 GMT+08:00 Ted Yu <yuzhih...@gmail.com>: > Take a look at the following methods: > > * Filters rows using the given condition. > * {{{ > * // The following are equivalent: > * peopleDf.filter($"age" > 15) > * peopleDf.where($"age" > 15) > * }}} > * @group dfops > * @since 1.3.0 > */ > def filter(condition: Column): DataFrame = Filter(condition.expr, > logicalPlan) > > * Filters rows using the given SQL expression. > * {{{ > * peopleDf.filter("age > 15") > * }}} > * @group dfops > * @since 1.3.0 > */ > def filter(conditionExpr: String): DataFrame = { > > Cheers > > On Wed, Sep 9, 2015 at 8:04 PM, prachicsa <prachi...@gmail.com> wrote: > >> >> >> I want to apply filter based on a list of values in Spark. This is how I >> get >> the list: >> >> DataFrame df = sqlContext.read().json("../sample.json"); >> >> df.groupBy("token").count().show(); >> >> Tokens = df.select("token").collect(); >> for(int i = 0; i < Tokens.length; i++){ >> System.out.println(Tokens[i].get(0)); // Need to apply filter >> for Token[i].get(0) >> } >> >> Rdd on which I want apply filter is this: >> >> JavaRDD<String> file = context.textFile(args[0]); >> >> I figured out a way to filter in java: >> >> private static final Function<String, Boolean> Filter = >> new Function<String, Boolean>() { >> @Override >> public Boolean call(String s) { >> return s.contains("Set"); >> } >> }; >> >> How do I go about it? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Filtering-an-rdd-depending-upon-a-list-of-values-in-Spark-tp24631.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- Alexis GILLAIN