Hi ,
I am trying to load 931MB file into an RDD, then create a DataFrame and store 
the data in a Parquet file. The save method of Parquet file is hanging. I have 
set the timeout to 1800 but still the system fails to respond and hangs. I 
can’t spot any errors in my code. Can someone help me? Thanks in advance.

Environment

  1.  OS X 10.10.5 with 8G RAM
  2.  JDK 1.8.0_60

Code


final SQLContext sqlContext = new SQLContext(jsc);
//convert user viewing history to ratings (hash user_id to int)
JavaRDD<Rating> ratingJavaRDD = createMappedRatingsRDD(jsc);
//for testing with 2d_full.txt data
//JavaRDD<Rating> ratingJavaRDD = createMappedRatingRDDFromFile(jsc);
JavaRDD<Row> ratingRowsRDD = ratingJavaRDD.map(new GenericRowFromRating());
ratingRowsRDD.cache();

//This line saves the files correctly

ratingJavaRDD.saveAsTextFile("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd");

final DataFrame ratingDF = sqlContext.createDataFrame(ratingRowsRDD, 
getStructTypeForRating());
ratingDF.registerTempTable("rating_db");
ratingDF.show();
ratingDF.cache();

//this line hangs

ratingDF.write().format("parquet").save("file:///Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet");


wks-195:rec-spark-java-poc r.viswanadha$ ls -lah 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-0000*

-rw-r--r--  1 r.viswanadha  staff   785K Oct 22 18:55 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00000

-rw-r--r--  1 r.viswanadha  staff   790K Oct 22 18:55 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00001

-rw-r--r--  1 r.viswanadha  staff   786K Oct 22 18:55 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00002

-rw-r--r--  1 r.viswanadha  staff   796K Oct 22 18:55 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00003

-rw-r--r--  1 r.viswanadha  staff   791K Oct 22 18:55 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings_rdd/part-00004

wks-195:rec-spark-java-poc r.viswanadha$ ls -lah 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/

The only thing that is saved is the temporary part file

wks-195:rec-spark-java-poc r.viswanadha$ ls -lah 
/Users/r.viswanadha/Documents/workspace/rec-spark-java-poc/output/ratings.parquet/_temporary/0/task_201510221857_0007_m_000000/

total 336

drwxr-xr-x  4 r.viswanadha  staff   136B Oct 22 18:57 .

drwxr-xr-x  4 r.viswanadha  staff   136B Oct 22 18:57 ..

-rw-r--r--  1 r.viswanadha  staff   1.3K Oct 22 18:57 
.part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet.crc

-rw-r--r--  1 r.viswanadha  staff   163K Oct 22 18:57 
part-r-00000-65562f67-357c-4645-8075-13b733a71ee5.gz.parquet


Active Stages (1)
Stage Id        Description     Submitted       Duration        Tasks: 
Succeeded/Total  Input   Output  Shuffle Read    Shuffle Write
7       
(kill)<http://localhost:4040/stages/stage/kill/?id=7&terminate=true>save at 
Recommender.java:549<http://localhost:4040/stages/stage?id=7&attempt=0>+details
<http://localhost:4040/storage/rdd?id=15>
        2015/10/22 18:57:15     17 min

1/5
        9.4 MB
Best Regards,
Ram

Reply via email to