Hi

You cannot use PairRDD but you can use JavaRDD<List>. So in your case, to
make it work with least change, you would call run(transactions.values()).

Each MLLib implementation has its own data structure typically and you
would have to convert from your data structure before you invoke. For ex if
you were doing regression on transactions you would instead convert that to
an RDD of LabeledPoint using a transactions.map(). If you wanted clustering
you would convert that to an RDD of Vector.

And taking a step back, without knowing what you want to accomplish, What
your fp growth snippet will tell you is as to which sensor values occur
together most frequently. That may or may not be what you are looking for.

Regards
Sab
On 30-Oct-2015 3:00 am, "Fernando Paladini" <[email protected]> wrote:

> Hello guys!
>
> First of all, if you want to take a look in a more readable question, take
> a look in my StackOverflow question
> <http://stackoverflow.com/questions/33422560/how-to-run-fpgrowth-algorithm-with-a-javapairrdd-object>
> (I've made the same question there).
>
> I want to test Spark machine learning algorithms and I have some questions
> on how to run these algorithms with non-native data types. I'm going to run
> FPGrowth algorithm over the input because I want to get the most frequent
> itemsets for this input.
>
> *My data is disposed as the following:*
>
> [timestamp, sensor1value, sensor2value] # id: 0[timestamp, sensor1value, 
> sensor2value] # id: 1[timestamp, sensor1value, sensor2value] # id: 
> 2[timestamp, sensor1value, sensor2value] # id: 3...
>
> As I need to use Java (because Python doesn't have a lot of machine
> learning algorithms from Spark), this data structure isn't very easy to
> handle / create.
>
> *To achieve this data structure in Java I can visualize two approaches:*
>
>    1. Use existing Java classes and data types to structure the input (I
>    think some problems can occur in Spark depending on how complex is my 
> data).
>    2. Create my own class (don't know if it works with Spark algorithms)
>
> 1. Existing Java classes and data types
>
> In order to do that I've created a* List<Tuple2<Long, List<Double>>>*, so
> I can keep my data structured and also can create a RDD:
>
> List<Tuple2<Long, List<Double>>> algorithm_data = new ArrayList<Tuple2<Long, 
> List<Double>>>();
> populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions = 
> sc.parallelizePairs(algorithm_data);
>
> I don't feel okay with JavaPairRDD because FPGrowth algorithm seems to be not 
> available for this data structure, as I will show you later in this post.
>
> 2. Create my own class
>
> I could also create a new class to store the input properly:
>
> public class PointValue {
>
>     private long timestamp;
>     private double sensorMeasure1;
>     private double sensorMeasure2;
>
>     // Constructor, getters and setters omitted...
> }
>
> However, I don't know if I can do that and still use it with Spark
> algorithms without any problems (in other words, running Spark algorithms
> without headaches). I'll focus in the first approach, but if you see that
> the second one is easier to achieve, please tell me.
> The solution (based on approach #1):
>
> // Initializing SparkSparkConf conf = new SparkConf().setAppName("FP-growth 
> Example");JavaSparkContext sc = new JavaSparkContext(conf);
> // Getting data for ML algorithmList<Tuple2<Long, List<Double>>> 
> algorithm_data = new ArrayList<Tuple2<Long, List<Double>>>();
> populate(algorithm_data);JavaPairRDD<Long, List<Double>> transactions = 
> sc.parallelizePairs(algorithm_data);
> // Running FPGrowthFPGrowth fpg = new 
> FPGrowth().setMinSupport(0.2).setNumPartitions(10);FPGrowthModel<Tuple2<Long, 
> List<Double>>> model = fpg.run(transactions);
> // Printing everythingfor (FPGrowth.FreqItemset<Tuple2<Long, List<Double>>> 
> itemset: model.freqItemsets().toJavaRDD().collect()) {
>     System.out.println("[" + itemset.javaItems() + "], " + itemset.freq());}
>
> But then I got:
>
> *The method run(JavaRDD<Basket>) in the type FPGrowth is not applicable for 
> the arguments (JavaPairRDD<Long,List<Double>>)*
>
> *What can I do in order to solve my problem (run FPGrowth over
> JavaPairRDD)?*
>
> I'm available to give you more information, just tell me exactly what you
> need.
> Thank you!
> Fernando Paladini
>

Reply via email to