Filtering DataType column with Timestamp

2018-07-20 Thread fmilano
Hi. I have a DateType column and I want to filter all the values greater or equal than a certain Timestamp. This works, for example df.col(columnName).geq(value) evaluates to a column with DateTypes greater or equal than value. Except for one case: if the value of the Timestamp is initialized to "1

[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

2018-07-20 Thread Daniel Mateus Pires
I've been trying to figure out this one for some time now, I have JSONs representing Products coming (physically) partitioned by Brand and I would like to create a DataFrame from the JSON but also keep the partitioning information (Brand) ``` case class Product(brand: String, value: String) val

Re: Re: spark sql data skew

2018-07-20 Thread Xiaomeng Wan
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character. On 17 July 2018 at 02:25, 崔苗 wrote: > 30G user data, how to get distinct users count after creating a composite > key based on company and userid? > >

Query on Spark Hive with kerberos Enabled on Kubernetes

2018-07-20 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi All, I am trying to use Spark 2.2.0 Kubernetes(https://github.com/apache-spark-on-k8s/spark/tree/v2.2.0-kubernetes-0.5.0) code to run the Hive Query on Kerberos Enabled cluster. Spark-submit's fail for the Hive Queries, but pass when I am trying to access the hdfs. Is this a known limitation

Re: Parquet

2018-07-20 Thread Muthu Jayakumar
I generally write to Parquet when I want to repeat the operation of reading data and perform different operations on it every time. This would save db time for me. Thanks Muthu On Thu, Jul 19, 2018, 18:34 amin mohebbi wrote: > We do have two big tables each includes 5 billion of rows, so my que