Hi Ayoub,
You could try using the sql format to set the compression type:
sc = SparkContext()
sqc = SQLContext(sc)
sqc.sql("SET spark.sql.parquet.compression.codec=gzip")
You get a notification on screen while running the spark job when you set
the compression codec like this. I havent compared
I am trying to run a spark application with
-Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000
-Dspark.akka.frameSize=1
and the job fails because one or more of the akka frames are larger than
1mb (12000 ish).
When I change the Dspark.akka.frameSize=1 to 12000,1
Hi Guys,
Im running a spark cluster in AWS with Spark 1.1.0 in EC2
I am trying to convert a an RDD with tuple
(u'string', int , {(int, int): int, (int, int): int})
to a schema rdd using the schema:
fields = [StructField('field1',StringType(),True),
StructField('field2',Intege
As a tempary fix, it works when I convert field six to a list manually. That
is:
def generateRecords(line):
# input : the row stored in parquet file
# output : a python dictionary with all the key value pairs
field1 = line.field1
summary = {}
summary['
Hi Guys,
I used applySchema to store a set of nested dictionaries and lists in a
parquet file.
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461
It was successful and i could successf
I worked man.. Thanks alot :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html
Sent from the Apache Spark User List mailing list archive at Nabble.c
Hi Davies,
Thanks for the reply
The problem is I have empty dictionaries in my field3 as well. It gives me
an error :
Traceback (most recent call last):
File "", line 1, in
File "/root/spark/python/pyspark/sql.py", line 1042, in inferSchema
schema = _infer_schema(first)
File "/root
Hi Guys,
I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can
save it in parquet format.
A record in my RDD has the following format:
RDD1
{
field1:5,
field2: 'string',
field3: {'a':1, 'c':2}
}
I am using field3 to represent a "sparse vector" and it can have keys
Hi,
I am trying to save an RDD to an S3 bucket using
RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I
need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec
class as a parameter into the function.
I tried
*RDD.saveAsSequenceFile('{0}{1}'.format(out