ot; is to figure out why is logical plan creation so
> slow for 700 columns.
>
>
> On Thu, Jun 30, 2016 at 1:58 PM, Darshan Singh
> wrote:
>
>> Is there a way I can use same Logical plan for a query. Everything will
>> be same except underlying file will be differen
Is there a way I can use same Logical plan for a query. Everything will be
same except underlying file will be different.
Issue is that my query has around 700 columns and Generating logical plan
takes 20 seconds and it happens every 2 minutes but every time underlying
file is different.
I do not
0)
> sc = SparkContext(conf=conf)
>
>
>
> On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh
> wrote:
>
>> Hi,
>>
>> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
>> each , joined dataframe has 100 partition. I want to know wh
Hi,
My default parallelism is 100. Now I join 2 dataframes with 20 partitions
each , joined dataframe has 100 partition. I want to know what is the way
to keep it to 20 (except re-partition and coalesce.
Also, when i join these 2 dataframes I am using 4 columns as joined
columns. The dataframes a
These are 2 parameters and default value for these are 0.6 and 0.2 which is
around 80%. I am wondering where remaining 0.2 % goes. Is it for JVM for
other memory requirements?
If yes, then what is spark.memory.fraction used for.
My understanding is that if we have 10GB of memory per executor then
Hi,
I am using standalone spark cluster and using zookeeper cluster for the
high availbilty. I am getting sometimes error when I start the master. The
error is related to Leader election in curator and says that noMethod found
(getProcess) and master doesnt get started.
Just wondering what could
Hi,
I have a dataframe df1 and I partitioned it by col1,col2 and persisted it.
Then I created new dataframe df2.
val df2 = df1.sortWithinPartitions("col1","col2","col3")
df1.persist()
df2.persist()
df1.count()
df2.count()
now I expect that any group by statement using the "col1","col2","col3"
s
Hi,
I have an application which uses 3 parquet files , 2 of which are large and
another one is small. These files are in hdfs and are partitioned by column
"col1". Now I create 3 data-frames one for each parquet file but I pass
col1 value so that it reads the relevant data. I always read from the
Hi,
I would like to know if there is any max limit of union of data-frames. How
does performance of say 1 data frame union will be in spark of which
all the data will be in cache?
Other option is 1 partitions of a single dataframe.
Thanks
t; Thanks for the information. When I mention map side join. I meant that
> each partition from 1 DF join with partition with same key of DF 2 on the
> worker node without shuffling the data.In other words do as much as work
> within worker node before shuffling the data.
>
> Thank
Thanks for the information. When I mention map side join. I meant that each
partition from 1 DF join with partition with same key of DF 2 on the worker
node without shuffling the data.In other words do as much as work within
worker node before shuffling the data.
Thanks
Darshan Singh
On Wed
Thanks a lot for this. I was thinking of using cogrouped RDDs. We will try
to move to 1.6 as there are other issues as well in 1.5.2.
Same code is much faster in the 1.6.1.But plan wise I do not see much
diff.Why it is still partitioning and then sorting and then joining?
I expect it to sort with
I used 1.5.2.I have used movies data to reproduce the issue. Below is the
physical plan. I am not sure why it is hash partitioning the data and then
sort and then join. I expect the data to be joined first and then send for
further processing.
I sort of expect a common partitioner which will work
at 9:41 PM, Darshan Singh
wrote:
> Thanks a lot. I will try this one as well.
>
> On Tue, Apr 5, 2016 at 9:28 PM, Michael Armbrust
> wrote:
>
>> The following should ensure partition pruning happens:
>>
>> df.write.partitionBy("country").save("/
;).where("country = 'UK'")
>
> On Tue, Apr 5, 2016 at 1:13 PM, Darshan Singh
> wrote:
>
>> Thanks for the reply.
>>
>> Now I saved the part_movies as parquet file.
>>
>> Then created new dataframe from the saved parquet file and I did
at 12:14 PM, Darshan Singh
> wrote:
>
>> Thanks. It is not my exact scenario but I have tried to reproduce it. I
>> have used 1.5.2.
>>
>> I have a part-movies data-frame which has 20 partitions 1 each for a
>> movie.
>>
>> I created following
Thanks. It is not my exact scenario but I have tried to reproduce it. I
have used 1.5.2.
I have a part-movies data-frame which has 20 partitions 1 each for a movie.
I created following query
val part_sql = sqlContext.sql("select * from part_movies where movie = 10")
part_sql.count()
I expect t
17 matches
Mail list logo