How can I access data on RDDs?

2015-10-05 Thread jatinganhotra
Consider the following 2 scenarios: *Scenario #1* val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoint pagecounts.count *Scenario #2* val pagecounts = sc.textFile("data/pagecounts") pagecounts.count The total time show in the Spark shell Application UI was different for both sce

Checkpointing RDD calls the job twice?

2015-10-17 Thread jatinganhotra
Hi, I noticed that when you checkpoint a given RDD, it results in performing the action twice as I can see 2 jobs being executed in the Spark UI. Example: val logFile = "/data/pagecounts" sc.setCheckpointDir("/checkpoints") val logData = sc.textFile(logFile, 2) val as = logData.filter(line => lin

How to debug Spark source using IntelliJ/ Eclipse

2015-12-05 Thread jatinganhotra
Hi, I am trying to understand Spark internal code and wanted to debug Spark source, to add a new feature. I have tried the steps lined out here on the Spark Wiki page IDE setup , but they do