Hi, I’m quite new to Spark and MR, but have a requirement to get all distinct values with their respective counts from a transactional file. Let’s assume the following file format:
0 1 2 3 4 5 6 7 1 3 4 5 8 9 9 10 11 12 13 14 15 16 17 18 1 4 7 11 12 13 19 20 3 4 7 11 15 20 21 22 23 1 2 5 9 11 12 16 Given this, I would like an ArrayList<String, Integer> back, where the String is the item identifier and the Integer the count of that item identifier in the file. The following is what I came up with to map the values, but can’t figure out how to do the counting :( // create RDD of an arraylist of strings JavaRDD<ArrayList<String>> transactions = sc.textFile(dataPath).map( new Function<String, ArrayList<String>>() { private static final long serialVersionUID = 1L; @Override public ArrayList<String> call(String s) { return Lists.newArrayList(s.split(" ")); } } ); Any ideas? Thanks! Patrick