date:20200226

Spark join: grouping of records having same value for a particular column in same partition

2020-02-26 Thread ARAVIND ARUMUGHAM SETHURATHNAM

Hi, We have 2 Hive tables which are read in spark and joined using a join key, let’s call it user_id. Then, we write this joined dataset to S3 and register it hive as a 3rd table for subsequent tasks to use this joined dataset. One of the other columns in the joined dataset is called keychain_id

Re: Standard practices for building dashboards for spark processed data

2020-02-26 Thread Breno Arosa

I have been using Athena/Presto to read the parquet files in datalake, if your are already saving data to s3 I think this is the easiest option. Then I use Redash or Metabase to build dashboards (they have different limitations), both are very intuitive to use and easy to setup with docker. -- S

Re: [External Email] Re: Standard practices for building dashboards for spark processed data

2020-02-26 Thread Aniruddha P Tekade

Hi Roland, Thank you for your reply. That's quite helpful. I think I should try influxDB then. But I am curious if in case of prometheus writing a custom exporter be a good choice and solve the purpose efficiently? Grafana is not something I want to drop. Best, Aniruddha --- ᐧ On Tue, F

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-26 Thread Liu Genie

Exactly. My problem is a big dataframe joins a lot of small dataframes which I convert to Maps and then use udf apply on the big dataframe. (broadcast didn’t work in too many joins) 2020年2月26日 09:32，Jianneng Li mailto:jianneng...@workday.com>> 写道： I could be wrong, but I'm guessing that it use

[SPARK-30957][SQL] Null-safe variant of Dataset.join(Dataset[_], Seq[String])

2020-02-26 Thread Enrico Minack

I have created a jira to track this request: https://issues.apache.org/jira/browse/SPARK-30957 Enrico Am 08.02.20 um 16:56 schrieb Enrico Minack: Hi Devs, I am forwarding this from the user mailing list. I agree that the <=> version of join(Dataset[_], Seq[String]) would be useful. Does a

Spark join: grouping of records having same value for a particular column in same partition

Re: Standard practices for building dashboards for spark processed data

Re: [External Email] Re: Standard practices for building dashboards for spark processed data

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

[SPARK-30957][SQL] Null-safe variant of Dataset.join(Dataset[_], Seq[String])

5 matches

Site Navigation

Mail list logo

Footer information