Join selection

2019-03-04 Thread Akhilanand
Hello, I was going through the Spark strategies class and found that by default Sort merge join is preferred over shuffled hash join. The preferSortMergeJoin needs to be explicitly set to False if we have to force a shuffled hash join. 1) why is Sort merge join preferred over hash join? 2) are th

Re: error in sprark sql

2019-03-04 Thread Shyam P
Something wrong with query. Add the code snippet to exactly what are you trying to do. ~Shyam On Fri, Mar 1, 2019 at 1:07 PM yuvraj singh <19yuvrajsing...@gmail.com> wrote: > Hi, > > I am running spark as a service , when we change some sql schema we are > facing some problems . > > ERROR [http

Re: spark df.write.partitionBy run very slow

2019-03-04 Thread Shyam P
Hi JF , Try to execute it before df.write //count by partition_id import org.apache.spark.sql.functions.spark_partition_id df.groupBy(spark_partition_id).count.show() You will come to know how data has been partitioned inside df. Small trick we can apply here while partition

subscribe

2019-03-04 Thread Qian He

[SQL] 64-bit hash function, and seeding

2019-03-04 Thread Huon.Wilson
Hi, I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations: - hash: 32-bit output (only 4 billion possi

Difference between One map vs multiple maps

2019-03-04 Thread Yeikel
Considering that I have a Dataframe df , I could run df.map(operation1).map(operation2) or run df.map(logic for both operations). In addition , I could also run df.map(operation3) where operation3 would be : return operation2(operation1()) Similarly , with UDFs, I could build a UDF that does tw

Re: Spark SQL doesn't produce output while hive does

2019-03-04 Thread Patrik Medvedev
Hi, Are you connecting in collocated mode or via jdbc? Could you give more details? Cheers, Patrick пн, 4 мар. 2019 г. в 08:18, mayangyang02 : > Hi, > > We have a sql. When we ran it with Hive, it produced the result normally. > But when we ran it with Spark, id didn’t produce any output. > > W

Re: Spark SQL doesn't produce output while hive does

2019-03-04 Thread Chunpeng Wang
check the data schema especially for user_mobile On Mon, Mar 4, 2019 at 1:18 AM mayangyang02 wrote: > Hi, > > We have a sql. When we ran it with Hive, it produced the result normally. > But when we ran it with Spark, id didn’t produce any output. > > We found that what caused the problem is the

Connect to hive 3 from spark

2019-03-04 Thread Nicolas Paris
Hi all Do anybody knows if spark spark able to connect to hive metastore for hive 3 (metastore v3)? I know spark cannot deal with transactional tables, however I wonder if at least it can read/write non-transactional tables from hive 3. Thanks -- nicolas

Timeout between driver and application master (Thrift Server)

2019-03-04 Thread Jürgen Thomann
Hi, I'm using the Spark Thrift Server and after some time the driver and application master are shutting down because of timeouts. There is a firewall in between and there is no traffic between them as it seems. Is there a way to configure TCP keep alive for the connection or some other way to

spark df.write.partitionBy run very slow

2019-03-04 Thread JF Chen
I am trying to write data in dataset to hdfs via df.write.partitionBy(column_a, column_b, column_c).parquet(output_path) However, it costs several minutes to write only hundreds of MB data to hdfs. >From this article