I'm not familiar with dataframe but a filter on a list is basically a join
so you can use the two ways to make a join in spark :

both the list and dataset are huge => dataframe join method
the list is small and dataset huge => broadcast the list (using context
broadcast method) so it's available for each partition and you should be
able to use the dataframe filter function : filter( x =>
broadcastList.contains(x))
both the list and dataset are small => spark ???

2015-09-10 11:11 GMT+08:00 Ted Yu <yuzhih...@gmail.com>:

> Take a look at the following methods:
>
>    * Filters rows using the given condition.
>    * {{{
>    *   // The following are equivalent:
>    *   peopleDf.filter($"age" > 15)
>    *   peopleDf.where($"age" > 15)
>    * }}}
>    * @group dfops
>    * @since 1.3.0
>    */
>   def filter(condition: Column): DataFrame = Filter(condition.expr,
> logicalPlan)
>
>   * Filters rows using the given SQL expression.
>    * {{{
>    *   peopleDf.filter("age > 15")
>    * }}}
>    * @group dfops
>    * @since 1.3.0
>    */
>   def filter(conditionExpr: String): DataFrame = {
>
> Cheers
>
> On Wed, Sep 9, 2015 at 8:04 PM, prachicsa <prachi...@gmail.com> wrote:
>
>>
>>
>> I want to apply filter based on a list of values in Spark. This is how I
>> get
>> the list:
>>
>> DataFrame df = sqlContext.read().json("../sample.json");
>>
>>         df.groupBy("token").count().show();
>>
>>         Tokens = df.select("token").collect();
>>         for(int i = 0; i < Tokens.length; i++){
>>             System.out.println(Tokens[i].get(0)); // Need to apply filter
>> for Token[i].get(0)
>>         }
>>
>> Rdd on which I want apply filter is this:
>>
>> JavaRDD<String> file = context.textFile(args[0]);
>>
>> I figured out a way to filter in java:
>>
>> private static final Function<String, Boolean> Filter =
>>             new Function<String, Boolean>() {
>>                 @Override
>>                 public Boolean call(String s) {
>>                     return s.contains("Set");
>>                 }
>>             };
>>
>> How do I go about it?
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Filtering-an-rdd-depending-upon-a-list-of-values-in-Spark-tp24631.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Alexis GILLAIN

Reply via email to