Re: Saving Parquet files to S3

2016-06-10 Thread Bijay Kumar Pathak
Hi Ankur, I also tried setting a property to write parquet file size of 256MB. I am using pyspark below is how I set the property but it's not working for me. How did you set the property? spark_context._jsc.hadoopConfiguration().setInt( "dfs.blocksize", 268435456) spark_context._jsc.hadoopConfi

Re: Error joining dataframes

2016-05-17 Thread Bijay Kumar Pathak
Hi, Try this one: df_join = df1.*join*(df2, 'Id', "fullouter") Thanks, Bijay On Tue, May 17, 2016 at 9:39 AM, ram kumar wrote: > Hi, > > I tried to join two dataframe > > df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter") > > df_join.registerTempTable("join_test") > > > When

Disable parquet metadata summary in

2016-05-05 Thread Bijay Kumar Pathak
Hi, How can we disable writing _common_metdata while saving Data Frame in parquet format in PySpark. I tried to set the property using below command but didn't helped. sparkContext._jsc.hadoopConfiguration().set("parquet.enable.summary-metadata", "false") Thanks, Bijay

Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
ning on 64-bit JVM with less than 32G heap, you might want > to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow > generating more than 2^31-1 number of arrays, you might have to rethink > your options. > > [1] https://spark.apache.org/docs/latest/tuning.html > >

SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Hi, I am reading the parquet file around 50+ G which has 4013 partitions with 240 columns. Below is my configuration driver : 20G memory with 4 cores executors: 45 executors with 15G memory and 4 cores. I tried to read the data using both Dataframe read and using hive context to read the data us

Re: Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak
Thanks Ted. This looks like the issue since I am running it in EMR and the Hive version is 1.0.0. Thanks, Bijay On Wed, May 4, 2016 at 10:29 AM, Ted Yu wrote: > Looks like you were hitting HIVE-11940 > > On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak > wrote: > >&

Performance with Insert overwrite into Hive Table.

2016-05-04 Thread Bijay Kumar Pathak
Hello, I am writing Dataframe of around 60+ GB into partitioned Hive Table using hiveContext in parquet format. The Spark insert overwrite jobs completes in a reasonable amount of time around 20 minutes. But the job is taking a huge amount of time more than 2 hours to copy data from .hivestaging

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Bijay Kumar Pathak
Hi, I was facing the same issue on Spark 1.6. My data size was around 100 GB and was writing in the partition Hive table. I was able to solve this issue by starting from 6G of memory and reaching upto 15GB of memory per executor with overhead of 2GB and partitioning the DataFrame before doing t

Re: Spark SQL insert overwrite table not showing all the partition.

2016-04-22 Thread Bijay Kumar Pathak
or partition > >- unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0 ><https://issues.apache.org/jira/browse/HIVE-2612>). > > > > Thanks. > > Zhan Zhang > > On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak wrote: > > Hi, > > I

Spark SQL insert overwrite table not showing all the partition.

2016-04-21 Thread Bijay Kumar Pathak
Hi, I have a job which writes to the Hive table with dynamic partition. Inside the job, I am writing into the table two-time but I am only seeing the partition with last write although I can see in the Spark UI it is processing data fro both the partition. Below is the query I am using to write

Reading conf file in Pyspark in cluster mode

2016-04-16 Thread Bijay Kumar Pathak
Hello, I have spark jobs packaged in zipped and deployed using cluster mode in AWS EMR. The job has to read conf file packaged with the zip under the resources directory. I can read the conf file in the client mode but not in cluster mode. How do I read the conf file packaged in the zip while dep

Re: Connection closed Exception.

2016-04-11 Thread Bijay Kumar Pathak
the memory allocated for this job. > > Sent from Outlook for iPhone <https://aka.ms/wp8k5y> > > > > > On Sun, Apr 10, 2016 at 9:12 PM -0700, "Bijay Kumar Pathak" < > bkpat...@mtu.edu> wrote: > > Hi, >> >> I am running Spark 1.6 on EMR. I hav

Connection closed Exception.

2016-04-10 Thread Bijay Kumar Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the following things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table whi