Hi ,
I am trying to load 931MB file into an RDD, then create a DataFrame and store
the data in a Parquet file. The save method of Parquet file is hanging. I have
set the timeout to 1800 but still the system fails to respond and hangs. I
can’t spot any errors in my code. Can someone help me? Thanks in advance.
Environment
1. OS X 10.10.5 with 8G RAM
2. JDK 1.8.0_60
Code
final SQLContext sqlContext = new SQLContext(jsc);
//convert user viewing history to ratings (hash user_id to int)
JavaRDD<Rating> ratingJavaRDD = createMappedRatingsRDD(jsc);
//for testing with 2d_full.txt data
//JavaRDD<Rating> ratingJavaRDD = createMappedRatingRDDFromFile(jsc);
JavaRDD<Row> ratingRowsRDD = ratingJavaRDD.map(new GenericRowFromRating());
ratingRowsRDD.cache();
//This line saves the files correctly
ratingJavaRDD.saveAsTextFile("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd");
final DataFrame ratingDF = sqlContext.createDataFrame(ratingRowsRDD,
getStructTypeForRating());
ratingDF.registerTempTable("rating_db");
ratingDF.show();
ratingDF.cache();
//this line hangs
ratingDF.write().format("parquet").save("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet");
wks-195:rec-spark-java-poc r.viswanadha$ ls -lah
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-0000*
-rw-r--r-- 1 r.viswanadha staff 785K Oct 22 18:55
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00000
-rw-r--r-- 1 r.viswanadha staff 790K Oct 22 18:55
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00001
-rw-r--r-- 1 r.viswanadha staff 786K Oct 22 18:55
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00002
-rw-r--r-- 1 r.viswanadha staff 796K Oct 22 18:55
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00003
-rw-r--r-- 1 r.viswanadha staff 791K Oct 22 18:55
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00004
wks-195:rec-spark-java-poc r.viswanadha$ ls -lah
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/
The only thing that is saved is the temporary part file
wks-195:rec-spark-java-poc r.viswanadha$ ls -lah
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/task_201510221857_0007_m_000000/
total 336
drwxr-xr-x 4 r.viswanadha staff 136B Oct 22 18:57 .
drwxr-xr-x 4 r.viswanadha staff 136B Oct 22 18:57 ..
-rw-r--r-- 1 r.viswanadha staff 1.3K Oct 22 18:57
.part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet.crc
-rw-r--r-- 1 r.viswanadha staff 163K Oct 22 18:57
part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet
Active Stages (1)
Stage Id Description Submitted Duration Tasks:
Succeeded/Total Input Output Shuffle Read Shuffle Write
7
(kill)<http://localhost:4040/stages/stage/kill/?id=7&terminate=true>save at
Recommender.java:549<http://localhost:4040/stages/stage?id=7&attempt=0>+details
<http://localhost:4040/storage/rdd?id=15>
2015/10/22 18:57:15 17 min
1/5
9.4 MB
Best Regards,
Ram