Hi,
We have 2 Hive tables which are read in spark and joined using a join key,
let’s call it user_id.
Then, we write this joined dataset to S3 and register it hive as a 3rd table
for subsequent tasks to use this joined dataset.
One of the other columns in the joined dataset is called keychain_id
I have been using Athena/Presto to read the parquet files in datalake, if
your are already saving data to s3 I think this is the easiest option.
Then I use Redash or Metabase to build dashboards (they have different
limitations), both are very intuitive to use and easy to setup with docker.
--
S
Hi Roland,
Thank you for your reply. That's quite helpful. I think I should try
influxDB then. But I am curious if in case of prometheus writing a custom
exporter be a good choice and solve the purpose efficiently? Grafana is not
something I want to drop.
Best,
Aniruddha
---
ᐧ
On Tue, F
Exactly. My problem is a big dataframe joins a lot of small dataframes which I
convert to Maps and then use udf apply on the big dataframe. (broadcast didn’t
work in too many joins)
2020年2月26日 09:32,Jianneng Li
mailto:jianneng...@workday.com>> 写道:
I could be wrong, but I'm guessing that it use
I have created a jira to track this request:
https://issues.apache.org/jira/browse/SPARK-30957
Enrico
Am 08.02.20 um 16:56 schrieb Enrico Minack:
Hi Devs,
I am forwarding this from the user mailing list. I agree that the <=>
version of join(Dataset[_], Seq[String]) would be useful.
Does a