Hi , I am trying to load 931MB file into an RDD, then create a DataFrame and store the data in a Parquet file. The save method of Parquet file is hanging. I have set the timeout to 1800 but still the system fails to respond and hangs. I can’t spot any errors in my code. Can someone help me? Thanks in advance.
Environment 1. OS X 10.10.5 with 8G RAM 2. JDK 1.8.0_60 Code final SQLContext sqlContext = new SQLContext(jsc); //convert user viewing history to ratings (hash user_id to int) JavaRDD<Rating> ratingJavaRDD = createMappedRatingsRDD(jsc); //for testing with 2d_full.txt data //JavaRDD<Rating> ratingJavaRDD = createMappedRatingRDDFromFile(jsc); JavaRDD<Row> ratingRowsRDD = ratingJavaRDD.map(new GenericRowFromRating()); ratingRowsRDD.cache(); //This line saves the files correctly ratingJavaRDD.saveAsTextFile("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd"); final DataFrame ratingDF = sqlContext.createDataFrame(ratingRowsRDD, getStructTypeForRating()); ratingDF.registerTempTable("rating_db"); ratingDF.show(); ratingDF.cache(); //this line hangs ratingDF.write().format("parquet").save("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet"); wks-195:rec-spark-java-poc r.viswanadha$ ls -lah /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-0000* -rw-r--r-- 1 r.viswanadha staff 785K Oct 22 18:55 /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00000 -rw-r--r-- 1 r.viswanadha staff 790K Oct 22 18:55 /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00001 -rw-r--r-- 1 r.viswanadha staff 786K Oct 22 18:55 /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00002 -rw-r--r-- 1 r.viswanadha staff 796K Oct 22 18:55 /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00003 -rw-r--r-- 1 r.viswanadha staff 791K Oct 22 18:55 /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00004 wks-195:rec-spark-java-poc r.viswanadha$ ls -lah /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/ The only thing that is saved is the temporary part file wks-195:rec-spark-java-poc r.viswanadha$ ls -lah /Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/task_201510221857_0007_m_000000/ total 336 drwxr-xr-x 4 r.viswanadha staff 136B Oct 22 18:57 . drwxr-xr-x 4 r.viswanadha staff 136B Oct 22 18:57 .. -rw-r--r-- 1 r.viswanadha staff 1.3K Oct 22 18:57 .part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet.crc -rw-r--r-- 1 r.viswanadha staff 163K Oct 22 18:57 part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet Active Stages (1) Stage Id Description Submitted Duration Tasks: Succeeded/Total Input Output Shuffle Read Shuffle Write 7 (kill)<http://localhost:4040/stages/stage/kill/?id=7&terminate=true>save at Recommender.java:549<http://localhost:4040/stages/stage?id=7&attempt=0>+details <http://localhost:4040/storage/rdd?id=15> 2015/10/22 18:57:15 17 min 1/5 9.4 MB Best Regards, Ram