Re: Standard practices for building dashboards for spark processed data

Roland Johann Tue, 25 Feb 2020 23:37:52 -0800

Hi Ani,

Prometheus is not well suited for ingesting explicit timeseries data. Its
purpose is for technical monitoring. If you want to monitor your spark jobs
with prometheus you can publish the metrics so prometheus can scrape it.
What you propably are looking for is a timeseries database that you can
push metrics to.


Looking for an alternative for grafana should be done only if you find
grafana is not well suited for your use case regarding visualization.

As said earlier, at a quick glance it sounds that you should look for an
alternative to prometheus.

For timeseries you can reach out to TimescaleDB, InfluxDB. Other databases
like normal SQL databases or cassandra lacks up/downsampling capabilities
that can lead to large query responses and the need for the client to post
process.

Kind regards,

Aniruddha P Tekade <ateka...@binghamton.edu> schrieb am Mi. 26. Feb. 2020
um 02:23:

> Hello,
>
> I am trying to build a data pipeline that uses spark structured streaming
> with delta project and runs into Kubernetes. Due to this, I get my output
> files only into parquet format. Since I am asked to use the prometheus and
> grafana
> for building the dashboard for this pipeline, I run an another small spark
> job and convert output into json so that I would be able to insert them
> into Grafana. Although I can see that this step is redundant, considering
> the important of delta lake project, I can not write my data directly into
> json. Therefore I need some help/guidelines/opinions about moving forward
> from here.
>
> I would appreciate if the spark user(s) can provide me some practices to
> follow with respect to the following questions -
>
>    1. Since I can not have direct json output from spark structured
>    streams, is there any better way to convert parquet into json? Or should I
>    keep only parquet?
>    2. Will I need to write some custom exporter for prometheus so as to
>    make grafana read those time-series data?
>    3. Is there any better dashboard alternative than Grafana for this
>    requirement?
>    4. Since the pipeline is going to run into Kubernetes, I am trying to
>    avoid InfluxDB as time-series database and moving with prometheus. Is this
>    approach correct?
>
> Thanks,
> Ani
> -----------
> ᐧ
>
-- 
Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann

Re: Standard practices for building dashboards for spark processed data

Reply via email to