Hi Ankur,
I also tried setting a property to write parquet file size of 256MB. I am
using pyspark below is how I set the property but it's not working for me.
How did you set the property?
spark_context._jsc.hadoopConfiguration().setInt( "dfs.blocksize", 268435456)
spark_context._jsc.hadoopConfi
Hi,
Try this one:
df_join = df1.*join*(df2, 'Id', "fullouter")
Thanks,
Bijay
On Tue, May 17, 2016 at 9:39 AM, ram kumar wrote:
> Hi,
>
> I tried to join two dataframe
>
> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")
>
> df_join.registerTempTable("join_test")
>
>
> When
Hi,
How can we disable writing _common_metdata while saving Data Frame in
parquet format in PySpark. I tried to set the property using below command
but didn't helped.
sparkContext._jsc.hadoopConfiguration().set("parquet.enable.summary-metadata",
"false")
Thanks,
Bijay
ning on 64-bit JVM with less than 32G heap, you might want
> to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
> generating more than 2^31-1 number of arrays, you might have to rethink
> your options.
>
> [1] https://spark.apache.org/docs/latest/tuning.html
>
>
Hi,
I am reading the parquet file around 50+ G which has 4013 partitions with
240 columns. Below is my configuration
driver : 20G memory with 4 cores
executors: 45 executors with 15G memory and 4 cores.
I tried to read the data using both Dataframe read and using hive context
to read the data us
Thanks Ted. This looks like the issue since I am running it in EMR and the
Hive version is 1.0.0.
Thanks,
Bijay
On Wed, May 4, 2016 at 10:29 AM, Ted Yu wrote:
> Looks like you were hitting HIVE-11940
>
> On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak
> wrote:
>
>&
Hello,
I am writing Dataframe of around 60+ GB into partitioned Hive Table using
hiveContext in parquet format. The Spark insert overwrite jobs completes in
a reasonable amount of time around 20 minutes.
But the job is taking a huge amount of time more than 2 hours to copy data
from .hivestaging
Hi,
I was facing the same issue on Spark 1.6. My data size was around 100 GB
and was writing in the partition Hive table.
I was able to solve this issue by starting from 6G of memory and reaching
upto 15GB of memory per executor with overhead of 2GB and partitioning
the DataFrame before doing t
or partition
>
>- unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0
><https://issues.apache.org/jira/browse/HIVE-2612>).
>
>
>
> Thanks.
>
> Zhan Zhang
>
> On Apr 21, 2016, at 3:20 PM, Bijay Kumar Pathak wrote:
>
> Hi,
>
> I
Hi,
I have a job which writes to the Hive table with dynamic partition. Inside
the job, I am writing into the table two-time but I am only seeing the
partition with last write although I can see in the Spark UI it is
processing data fro both the partition.
Below is the query I am using to write
Hello,
I have spark jobs packaged in zipped and deployed using cluster mode in AWS
EMR. The job has to read conf file packaged with the zip under the
resources directory. I can read the conf file in the client mode but not in
cluster mode.
How do I read the conf file packaged in the zip while dep
the memory allocated for this job.
>
> Sent from Outlook for iPhone <https://aka.ms/wp8k5y>
>
>
>
>
> On Sun, Apr 10, 2016 at 9:12 PM -0700, "Bijay Kumar Pathak" <
> bkpat...@mtu.edu> wrote:
>
> Hi,
>>
>> I am running Spark 1.6 on EMR. I hav
Hi,
I am running Spark 1.6 on EMR. I have workflow which does the following
things:
1. Read the 2 flat file, create the data frame and join it.
2. Read the particular partition from the hive table and joins the
dataframe from 1 with it.
3. Finally, insert overwrite into hive table whi
13 matches
Mail list logo