I have significant amount of data stored on my Hadoop HDFS as Parquet files I am using Spark streaming to interactively receive queries from a web server and transform the received queries into SQL to run on my data using SparkSQL.
In this process I need to run several SQL queries and then return some aggregate result by merging or subtracting the results of individual queries. Are there any ways I could optimize and increase the speed of the process by, for example, running queries on already received dataframes rather than the whole database? Is there a better way to interactively query the Parquet stored data and give results? Thank you! Narek Galstyan Նարեկ Գալստյան