Re: Parquet compression codecs not applied

2015-02-04 Thread sahanbull
Hi Ayoub, You could try using the sql format to set the compression type: sc = SparkContext() sqc = SQLContext(sc) sqc.sql("SET spark.sql.parquet.compression.codec=gzip") You get a notification on screen while running the spark job when you set the compression codec like this. I havent compared

Problem with changing the akka.framesize parameter

2015-02-04 Thread sahanbull
I am trying to run a spark application with -Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000 -Dspark.akka.frameSize=1 and the job fails because one or more of the akka frames are larger than 1mb (12000 ish). When I change the Dspark.akka.frameSize=1 to 12000,1

Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread sahanbull
Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int}) to a schema rdd using the schema: fields = [StructField('field1',StringType(),True), StructField('field2',Intege

Re: Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull
As a tempary fix, it works when I convert field six to a list manually. That is: def generateRecords(line): # input : the row stored in parquet file # output : a python dictionary with all the key value pairs field1 = line.field1 summary = {} summary['

Error when mapping a schema RDD when converting lists

2014-12-08 Thread sahanbull
Hi Guys, I used applySchema to store a set of nested dictionaries and lists in a parquet file. http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461 It was successful and i could successf

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-05 Thread sahanbull
I worked man.. Thanks alot :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html Sent from the Apache Spark User List mailing list archive at Nabble.c

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-04 Thread sahanbull
Hi Davies, Thanks for the reply The problem is I have empty dictionaries in my field3 as well. It gives me an error : Traceback (most recent call last): File "", line 1, in File "/root/spark/python/pyspark/sql.py", line 1042, in inferSchema schema = _infer_schema(first) File "/root

Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-03 Thread sahanbull
Hi Guys, I am trying to use SparkSQL to convert an RDD to SchemaRDD so that I can save it in parquet format. A record in my RDD has the following format: RDD1 { field1:5, field2: 'string', field3: {'a':1, 'c':2} } I am using field3 to represent a "sparse vector" and it can have keys

Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread sahanbull
Hi, I am trying to save an RDD to an S3 bucket using RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec class as a parameter into the function. I tried *RDD.saveAsSequenceFile('{0}{1}'.format(out