2018-06-06 10:58 GMT+02:00 Konstantin Knizhnik <k.knizh...@postgrespro.ru>:
> > > On 05.06.2018 20:17, MauMau wrote: > >> From: Merlin Moncure >> >>> FWIW, Distributed analytical queries is the right market to be in. >>> This is the field in which I work, and this is where the action is >>> >> at. >> >>> I am very, very, sure about this. My view is that many of the >>> existing solutions to this problem (in particular hadoop class >>> soltuions) have major architectural downsides that make them >>> inappropriate in use cases that postgres really shines at; direct >>> hookups to low latency applications for example. postgres is >>> fundamentally a more capable 'node' with its multiple man-millennia >>> >> of >> >>> engineering behind it. Unlimited vertical scaling (RAC etc) is >>> interesting too, but this is not the way the market is moving as >>> hardware advancements have reduced or eliminated the need for that >>> >> in >> >>> many spheres. >>> >> I'm feeling the same. As the Moore's Law ceases to hold, software >> needs to make most of the processor power. Hadoop and Spark are >> written in Java and Scala. According to Google [1] (see Fig. 8), Java >> is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by >> 2.5x - 3.6x. >> >> Won't PostgreSQL be able to cover the workloads of Hadoop and Spark >> someday, when PostgreSQL supports scaleout, in-memory database, >> multi-model capability, and in-database filesystem? That may be a >> pipedream, but why do people have to tolerate the separation of the >> relational-based data warehouse and Hadoop-based data lake? >> >> >> [1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala". >> Proceedings of Scala Days 2011 >> >> Regards >> MauMau >> >> >> I can not completely agree with it. I have done a lot of benchmarking of > PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for TPC-H > queries. > The picture is not so obvious... All this systems provides different > scalability and so shows best performance at different hardware > configurations. > Also Java JIT has made a good progress since 2011. Calculation intensive > code (like matrix multiplication) implemented in Java is about 2 times > slower than optimized C code. > But DBMSes are rarely CPU bounded. Even if all database fits in memory > (which is not so common scenario for big data applications), speed of > modern CPU is much higher than RAM access speed... Java application are > slower than C/C++ mostly because of garbage collection. This is why > SparkSQL is moving to off-heap approach when objects are allocated outside > Java heap and so not affecting Java GC. New versions of SparkSQL with > off-heap memory and native code generation show very good performance. And > high scalability always was one of the major features of SparkSQL. > > So it is naive to expect that Postgres will be 4 times faster than > SparkSQL on analytic queries just because it is written in C and SparkSQL - > in Scala. > Postgres has made a very good progress in support of OLAP in last > releases: it now supports parallel query execution, JIT, partitioning... > But still its scalability is very limited comparing with SparkSQL. I am > not sure about GreenPlum with its sophisticated distributed query > optimizer, but > most of other OLAP solutions for Postgres are not able to efficiently > handle complex queries (with a lot of joins by non-partitioning keys). > > I do not want to say that it is not possible to implement good analytic > platform for OLAP on top of Postgres. But it is very challenged task. > And IMHO choice of programming language is not so important. What is more > important is format of storing data. The bast systems for data analytic: > Vartica, HyPer, KDB,... > are using vertical data mode. SparkSQL is also using Parquet file format > which provides efficient extraction and processing of data. > With abstract storage API Postgres is also given a chance to implement > efficient storage for OLAP data processing. But huge amount of work has to > be done here. > Unfortunately, storage is one factor. For good performance columnar storages needs different executor. Although smart columnar storage can get very good compress ratio, so can has sense self. Regards Pavel > -- > Konstantin Knizhnik > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company > > >