I think the new API sc. binaryRecords [1] (added in 1.2) can help in this case.
[1]
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
Davies
On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz wrote:
> Hi,
>
> I want to process some files, there're a
if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.
On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser wrote:
> You can also use your InputFormat/RecordReader in Spark, e.g. using
> newAPIHadoopFile. See her
You can also use your InputFormat/RecordReader in Spark, e.g. using
newAPIHadoopFile. See here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
.
-Sven
On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz
wrote:
> Hi,
>
> I want to process some files, there're
Hi,
I want to process some files, there're a king of big, dozens of
gigabytes each one. I get them like a array of bytes and there's an
structure inside of them.
I have a header which describes the structure. It could be like:
Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
This