Read a TextFile(1 record contains 4 lines) into a RDD

2014-10-25 Thread Parthus
Hi, It might be a naive question, but I still wish that somebody could help me handle it. I have a textFile, in which every 4 lines represent a record. Since SparkContext.textFile() API deems of one line as a record, it does not fit into my case. I know that SparkContext.hadoopFile or newAPIHadoo

How to write a RDD into One Local Existing File?

2014-10-17 Thread Parthus
Hi, I have a spark mapreduce task which requires me to write the final rdd to an existing local file (appending to this file). I tried two ways but neither works well: 1. use saveAsTextFile() api. Spark 1.1.0 claims that this API can write to local, but I never make it work. Moreover, the result

How to transform large local files into Parquet format and write into HDFS?

2014-08-14 Thread Parthus
Hi there, I have several large files (500GB per file) to transform into Parquet format and write to HDFS. The problems I encountered can be described as follows: 1) At first, I tried to load all the records in a file and then used "sc.parallelize(data)" to generate RDD and finally used "saveAsNew

Create a new object by given classtag

2014-08-04 Thread Parthus
Hi there, I was wondering if somebody could tell me how to create an object with given classtag so as to make the function below work. The only thing to do is just to write one line to create an object of Class T. I tried new T but it does not work. Would it possible to give me one scala line to f

What if there are large, read-only variables shared by all map functions?

2014-07-22 Thread Parthus
Hi there, I was wondering if anybody could help me find an efficient way to make a MapReduce program like this: 1) For each map function, it need access some huge files, which is around 6GB 2) These files are READ-ONLY. Actually they are like some huge look-up table, which will not change during

Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Parthus
Hi there, I have a bunch of data in a RDD, which I processed it one by one previously. For example, there was a RDD denoted by "data: RDD[Object]" and then I processed it using "data.map(...)". However, I got a new requirement to process the data in a patched way. It means that I need to convert