+1 binding. would be interested to participate as a mentor
thanks On Wed, Nov 7, 2018, 12:53 PM Matt Sicker <boa...@gmail.com wrote: > +1 (binding) > > On Wed, 7 Nov 2018 at 08:03, Kevin A. McGrail <kmcgr...@apache.org> wrote: > > > +1 (binding) > > > > On 11/7/2018 2:46 AM, hxd wrote: > > > Hi, > > > Sorry for the previous mail with bad format. > > > I'd like to call a VOTE to accept IoTDB project, a database for > managing > > large amounts of time series data from IoT sensors in industrial > > applications, into the Apache Incubator. > > > The full proposal is available on the wiki: > > https://wiki.apache.org/incubator/IoTDBProposal > > > and it is also attached below for your convenience. > > > > > > Please cast your vote: > > > > > > [ ] +1, bring IoTDB into Incubator > > > [ ] +0, I don't care either way, > > > [ ] -1, do not bring IoTDB into Incubator, because... > > > > > > The vote will open at least for 72 hours. > > > > > > Thanks, > > > Xiangdong Huang. > > > > > > = IoTDB Proposal = > > > v0.1.1 > > > > > > > > > == Abstract == > > > IoTDB is a data store for managing large amounts of time series data > > such as timestamped data from IoT sensors in industrial applications. > > > > > > == Proposal == > > > IoTDB is a database for managing large amount of time series data with > > columnar storage, data encoding, pre-computation, and index techniques. > It > > has SQL-like interface to write millions of data points per second per > node > > and is optimized to get query results in few seconds over trillions of > data > > points. It can also be easily integrated with Apache Hadoop MapReduce and > > Apache Spark for analytics. > > > > > > == Background == > > > > > > A new class of data management system requirements is becoming > > increasingly important with the rise of the Internet of Things. There are > > some database systems and technologies aimed at time series data > > management. For example, Gorilla and InfluxDB which are mainly built for > > data centers and monitoring application metrics. Other systems, for > > example, OpenTSDB and KairosDB, are built on Apache HBase and Apache > > Cassandra, respectively. > > > > > > However, many applications for time series data management have more > > requirements especially in industrial applications as follows: > > > > > > * Supporting time series data which has high data frequency. For > > example, a turbine engine may generate 1000 points per second (i.e., > > 1000Hz), while each CPU only reports 1 data points per 5 seconds in a > data > > center monitoring application. > > > > > > * Supporting scanning data multi-resolutionally. For example, > > aggregation operation is important for time series data. > > > > > > * Supporting special queries for time series, such as pattern > matching, > > time series segmentation, time-frequency transformation and frequency > query. > > > > > > * Supporting a large number of monitoring targets (i.e. time series). > > An excavator may report more than 1000 time series, for example, > revolving > > speed of the motor-engine, the speed of the excavator, the accelerated > > speed, the temperature of the water tank and so on, while a CPU or an > > application monitor has much fewer time series. > > > > > > * Optimization for out-of-order data points. In the industrial sector, > > it is common that equipment sends data using the UDP protocol rather than > > the TCP protocol. Sometimes, the network connect is unstable and parts of > > the data will be buffered for later sending. > > > > > > * Supporting long-term storage. Historical data is precious for > > equipment manufacturers. Therefore, removing or unloading historical data > > is highly desired for most industrial applications. The database system > > must not only support fast retrieval of historical data, but also should > > guarantee that the historical data does not impact the processing speed > for > > “hot” or current data. > > > > > > * Supporting online transaction processing (OLTP) as well as complex > > analytics. It is obvious that supporting analyzing from the data files > > using Apache Spark/Apache Hadoop MapReduce directly is better than > > transforming data files to another file format for Big Data analytics. > > > > > > * Flexible deployment either on premise or in the cloud. IoTDB is as > > simple and can be deployed on a Raspberry Pi handling hundreds of time > > series. Meanwhile, the system can be also deployed in the cloud so that > it > > supports tens of millions ingestions per second, OLTP queries in > > milliseconds, and analytics using Apache Spark/Apache Hadoop MapReduce. > > > > > > * * (1) If users deploy IoTDB on a device, such as a Raspberry Pi, a > > wind turbine, or a meteorological station, the deployment of the chosen > > database is designed to be simple. A device may have hundreds of time > > series (but less than a thousand time series) and the database needs to > > handle them. > > > * * (2) When deploying IoTDB in a data center, the computational > > resources (i.e., the hardware configuration of servers) is not a problem > > when compared to a Raspberry Pi. In this deployment, IoTDB can use more > > computation resources, and has the ability to handle more time seires > > (e.g., millions of time series). > > > > > > Based on these requirements, we developed IoTDB, a new data store > system > > for managing time series data. > > > > > > IoTDB started as a Tsinghua University research project. IoTDB's > > developer community has also grown to include additional institutions, > for > > example, universities (e.g., Fudan University), research labs (e.g, > NEL-BDS > > lab), and corporations (e.g., K2Data, Tencent). Funding has been provided > > by various institutions including the National Natural Science Foundation > > of China, and industry sponsors, such as Lenovo and K2Data. > > > > > > == Rationale == > > > Because there is no existed open-sourced time series databases covering > > all the above requirements, we developed IoTDB. As the system matures, we > > are seeking a long-term home for the project. We believe the Apache > > Software Foundation would be an ideal fit. Also joining Apache will help > > coordinate and improve the development effort of the growing number of > > organizations which contribute to IoTDB improving the diversity of our > > community. > > > > > > IoTDB contains multiple modules, which are classified into categories: > > > > > > * '''TsFile Format''': TsFile is a new columnar file format. > > > * '''Adaptor for Analytics and Visualization''': Integrating TsFile > > with Apache Hadoop HDFS, Apache Hadoop MapReduce and Apache Spark. > Examples > > of integrating IoTDB with Apache Kafka, Apache Storm and Grafana are also > > provided. > > > * '''IoTDB Engine''': An engine which consists of SQL parser, query > > plan generator, memtable, authentication and authorization,write ahead > log > > (WAL), crash recovery, out-of-order data handler, and index for > aggregation > > and pattern matching. The engine stores system data in TsFile format. > > > * '''IoTDB JDBC''': An implementation of Java Database Connectivity > > (JDBC) for clients to connect to IoTDB using Java. > > > > > > === TsFile Format === > > > > > > TsFile format is a columnar store, which is similar with Apache Parquet > > and Apache CarbonData. It has the concepts of Chunk Group, Column Chunk, > > Page and Footer. Comparing with Apache Parquet and Apache CarbonData, it > is > > designed and optimized for time series: > > > > > > ==== Time Series Friendly Encoding ==== > > > IoTDB currently supports run length encoding (RLE), delta-of-delta > > encoding, and Facebook's Gorilla encoding. > > > > > > Lossy encoding methods (e.g., Piecewise Linear Approximation (PLA) and > > time-frequency transformation are works-in-progress. > > > > > > > > > ==== Chunk Group ==== > > > The data part of a TsFile consists of many Chunk Groups. Each Chunk > > Group stores the data of a device at a time interval. A Chunk Group is > > similar to the row group in Apache Parquet, while there are some > > constraints of the time dimension: For each device, the time intervals > of > > different Chunk Groups are not overlapped and the latter Chunk Group > always > > has a larger timestamp. > > > > > > Given a TsFile and a query with a time range filter, the query process > > can terminate scanning data once it reads data points whose timestamp > > reaches the time limit of the filter. We call the feature ''fast-return'' > > and it makes the time range query in a TsFile very efficient. > > > > > > > > > > > > ==== Different Column Chunk Format (Unnecessary the Repetition (R) and > > Definition (D) Fields) ==== > > > > > > While Apache Parquet and Apache CarbonData support complex data types, > > e.g., nested data and sparse columns, TsFile is exclusively designed for > > time series whose data model is \<device_id, series_id, timestamp, > value\>. > > > > > > In a `Chunk Group`, each time series is a `Column Chunk`. Even though > > these time series belong to the same device, the data points in different > > time series are not aligned in the time dimension originally. > > > > > > For example, if you have a device with 2 sensors on the same data > > collection frequencies, sensor 1 may collect data at time 1521622662000 > > while the other one collects data at time 1521622662001 (delta=1ms). > > Therefore, each Column Chunk has its timestamps and values, which is > quite > > different from Apache Parquet and Apache CarbonData. Because we store > the > > time column along with each value column instead of making different > chunks > > share the same time column for the sake of diverse data frequency for > > different time series, we do not store any null value on disk to align > > across time series. Besides, we do not need to attach `repetition` (R) > and > > `definition` (D) fields on each value. Therefore, the disk space is saved > > and the query latency is reduced (because we do not align data by > > calculating R and D fields). > > > > > > > > > ==== Domain Specific Information in Each Page ==== > > > Similar to Apache Parquet and Apache CarbonData, a `Column Chunk` > > consists of several `Pages`, and each `Page` has a `Page header`. The > `Page > > header` is a summary of the data in the page. > > > > > > Because TsFile is optimized for time series, the page header contains > > more domain specific information, such as the minimal and maximal value, > > the minimal and the maximal timestamp, the frequency and so on. TsFile > can > > even store the histogram of values in the page header. > > > > > > This header information helps IoTDB in speeding up queries by skipping > > unnecessary pages. > > > > > > > > > === Adaptor for Analytics === > > > The TsFile provides: > > > > > > * InputFormat/OutputFormat interfaces for Reading/Writing data. > > > * Deep integration with Apache Spark/Hadoop MapReduce including > > predicate push-down, column pruning, aggregation push down, etc. So users > > can use Apache Spark SQL/HiveQL to connect and query TsFiles. > > > > > > > > > === IoTDB Engine === > > > The IoTDB engine is a database engine, which uses TsFile as its storage > > file format. The IoTDB Engine supports SQL-like query plus many useful > > functions: > > > > > > * Tree-based time series schema > > > * Log-Structured Merge (LSM)-based storage > > > * Overflow file for out-of-order data > > > * Scalable index framework > > > * Special queries for time series > > > > > > ==== Tree-based Time Series Schema ==== > > > IoTDB manages all the time series definitions using a tree structure. A > > path from the root of the tree to a leaf node represents a time series. > > Therefore, the unique id of a time series is a path, e.g., > > `root.China.beijing.windFarm1.windTurbine1.speed`. > > > > > > This kind of schema can express `group by` naturally. For example, > > `root.China.beijing.windFarm1.*.speed` represents the speed of all the > wind > > turbines in wind farm 1 in Beijing, China. > > > > > > ==== Log-Structured Merge (LSM)-based Storage ==== > > > In a time series, the data points should be ordered by their > timestamps. > > In IoTDB, we use Log-Structured Merge (LSM) based mechanism. Therefore, a > > part of the data is stored in memory first and can be called as > `memtable`. > > At this time, if data points come out-of-order, we resort them in memory. > > When this part of data exceeds the configured memory limit, we flush it > on > > disk as a `Chunk Group` into an unclosed TsFile. Finally, a TsFile may > > contain several Chunk Groups, for reducing the number of small data > files, > > which is helpful to reduce the I/O load of the storage system and reduces > > the execution time of a file-merge in LSM. Notice that the data is > > time-ordered in one Chunk Group on disk, and this layout is helpful for > > fast filtering in one Chunk Group for a query. > > > > > > Rule 1: In a TsFile, the Chunk Groups of one device are ordered by > > timestamp (Rule 1), and it is helpful for fast filtering among Chunk > Groups > > for a query. > > > > > > Rule 2: When the size of the unclosed TsFile reaches the threshold > > defined in the configuration file, we close the file and generate a new > one > > to store new arriving data spanning the entire data set. Like many > systems > > which use LSM-based storage, we never modify a TsFile which has been > closed > > except for the file-merge process (Rule 2). > > > > > > Rule 3: To reduce the number of TsFiles involved in a query process, we > > guarantee that the data points in different TsFiles are not overlapping > on > > the time dimension after file mergence (Rule 3). > > > > > > ==== Overflow File for Out-of-order Data ==== > > > When a part of data is flushed on disk (and will form a `Chunk Group` > in > > a TsFile), the newly arriving data points whose timestamps are smaller > than > > the largest timestamp in the Tsfile are `out-of-order`. > > > > > > To store the out-of-order data, we organize all the troublesome > > `out-of-order` data point insertions into a special TsFile, named > > `UnSequenceTsFile`. In an UnSequenceTsFile, the Chunk Groups of one > device > > may be overlapping in the time dimension, which violates the Rule 1 and > > costs additional time compared to a normal TsFile for query filtering. > > > > > > There is another special operation: updating all the data points in a > > time range, e.g., `update all the speed values of device1 as 0 where the > > data time is in [1521622000000, 1521622662000]`. The operation is called > > when: (1) a sensor malfunctions and the database receives wrong data for > a > > period; (2) we may want to reset all the records. Many NoSQL time series > > databases do not support such an operation. To support the operation in > > IoTDB, we use a tree-based structure, Treap, to store this part of > > operations and store them as `Overflow` files. > > > > > > Therefore, there are 3 kinds of data files: TsFiles, UnSequenceTsFiles > > and Overflow files. TsFiles should store most of the data. The volume of > > UnSequenceTsFiles depends on the workload: if there are too many > > out-of-order and the time span of out-of-order is huge, the volume will > be > > large. Overflow files handle fewest data operations but will depend on > the > > use of the special operations. > > > > > > ==== LSM-tree ==== > > > Normally, LSM-based storage engines merge data files level by level so > > that it looks like a tree structure. In this way, data is well organized. > > The disadvantage is that data will be read and written several times. If > > the tree has 4 levels, each data point will be rewritten at least 4 > times. > > > > > > Currently, we do not merge all the TsFiles into one because (1) the > > number of TsFiles is kept lower than many LSM storage engines because a > > memtable is mapped to several Chunk Groups rather than a file; (2) > > different TsFiles are not overlapping with each other in the time > dimension > > (because of Rule 3). > > > > > > As mentioned before, TsFile supports ''fast-return'' to accelerate > > queries. However, UnSequenceTsFile and Overflow files do not allow this > > feature. The time spans of UnSequenceTsFile, Overflow file andTsFile may > be > > overlapped, which leads to more files involved in the query process. To > > accelerate these queries, there is a merging process to reorganize files > in > > the background. All the three kinds of files: TsFiles, UnSequenceTsFiles > > and Overflow files, are involved in the merging process. The merging > > process is implemented using multi-threading, while each thread is > > responsible for a series family. > > > After merging, only TsFiles are left. These files have non-overlapping > > time spans and support the ''fast-return'' feature. > > > > > > ==== Scalable Index Framework ==== > > > We allow users to implement indexes for faster queries. We currently > > support an index for pattern matching query (KV-Match index, ICDE 2019). > > Another index for fast aggregation (PISA index, CIKM 2016) is a > > work-in-progress. > > > > > > ==== Special Queries ==== > > > We currently support `group by time interval` aggregation queries and > > `Fill by` operations, which are similar to those of InfluxDB. Time series > > segmentation operations and frequency queries are work-in-progress. > > > > > > == Initial Goals == > > > The initial goals are to be open sourced and to integrate with the > > Apache development process. Furthermore, we plan for incremental > > development, and releases along with the Apache guidelines. > > > > > > == Current Status == > > > We have developed the system for more than 2 years. There are currently > > 13k lines of code, some of which are generated by Antlr3 and Thrift. > There > > are 230 issues which have been solved and more than 1500 commits. > > > > > > The system has been deployed in the staging environment of the State > > Grid Corporation of China to handle ~3 million time series (i.e, ~30,000 > > power generation assembly * ~100 sensors) and an equipment service > company > > in China managing ~2 million time series (i.e, ~20k devices * 100 > sensors). > > The insertion speed reaches ~2 million points/second/node, which is > faster > > than InfluxDB, OpenTSDB and Apache Cassandra in our environment. > > > > > > There are many new features in the works including those mentioned > > herein. We will add more analytics functions, improve the data file merge > > process, and finish the first released version of IoTDB. > > > > > > == Meritocracy == > > > The IoTDB project operates on meritocratic principles. Developers who > > submit more code with higher quality earn more merit. We have used > `Issues` > > and `Pull Requests` modules on Github for collecting users' suggestions > and > > patches. Users who submit issues, pull requests, documents and help the > > community management are welcomed and encouraged to become committers. > > > > > > == Community == > > > > > > The IoTDB project users communicate on Github ( > > https://github.com/thulab/tsfile) . Developers make the communication on > > a website which is similar with JIRA (Currently, only registered users > can > > apply to access the project for communication, url: > > https://tower.im/projects/36de8571a0ff4833ae9d7f1c5c400c22/). We have > > also introduced IoTDB at many technical conferences. Next, we will build > > the mailing list for more convenience, broader communication and archived > > discussions. > > > > > > If IoTDB is accepted for incubation at the Apache Software Foundation, > > the primary goal is to build a larger community. We believe that IoTDB > will > > become a key project for time series data management, and so, we will > rely > > on a large community of users and developers. > > > > > > TODO: IoTDB is currently on a private Github repository ( > > https://github.com/thulab/iotdb), while its subproject TsFile (a file > > format for storing time series data) is open sourced on Github ( > > https://github.com/thulab/tsfile). > > > > > > == Core Developers == > > > IoTDB was initially developed by 2 dozen of students and teachers at > > Tsinghua University. Now, more and more developers have joined coming > from > > other universities: Fudan University, Northwestern Polytechnical > University > > and Harbin Institute of Technology in China. Other developers come from > > business companies such as Lenovo and Microsoft. We will be working to > > bring more and more developers into the project making contributions to > > IoTDB. > > > > > > == Relationships with Other Apache Products == > > > IoTDB requires some Apache products (Apache Thrift, commons, > > collections, httpclient). > > > > > > IoTDB-Spark-connector and IoTDB-Hadoop-connector have been developed > for > > supporting analysing time series data by using Apache Spark and > MapReduce. > > > > > > Overall, IoTDB is designed as an open architecture, and it can be > > integrated with many other systems in the future. > > > > > > As mentioned before, in the IoTDB project, we designed a new columnar > > file format, called TsFile, which is similar to Apache Parquet. However, > > the new file format is optimized for time series data. > > > > > > > > > > > > == Known Risks == > > > > > > === Orphaned Products === > > > Given the current level of investment in IoTDB, the risk of the project > > being abandoned is minimal. Time series data is more and more important > and > > there are several constituents who are highly inspired to continue > > development. Tsinghua and NEL-BDS Lab relies on IoTDB as a platform for a > > large number of long-term research projects. We have deployed IoTDB in > some > > company's staging environments for future applications. > > > > > > === Inexperience with Open Source === > > > Students and researchers in Tsinghua University have been developing > and > > using open source software for a long time. It is wonderful to be guided > to > > join a formal open-source process for students. Some of our committers > > > have experiences contributing to open source, for example: > > > > > > * druid: > > > https://github.com/druid-io/druid/commit/f18cc5df97e5826c2dd8ffafba9fcb69d10a4d44 > > > * druid: > > > https://github.com/druid-io/druid/commit/aa7aee53ce524b7887b218333166941654788794 > > > * YCSB: https://github.com/brianfrankcooper/YCSB/pull/776 > > > > > > Additionally, several ASF veterans and industry veterans have agreed to > > mentor the project and are listed in this proposal. The project will rely > > on their guidance and collective wisdom to quickly transition the entire > > team of initial committers towards practicing the Apache Way. > > > > > > > > > === Reliance on Salaried Developers === > > > Most of current developers are students and researchers/professors in > > universities, and their researches focus on big data management and > > analytics. It is unlikely that they will change their research focus away > > from big data management. We will work to ensure that the ability for > the > > project to continuously be stewarded and to proceed forward independent > of > > salaried developers is continued. > > > > > > === An Excessive Fascination with the Apache Brand === > > > Most of the initial developers come from Tsinghua University with no > > intent to use the Apache brand for profit. We have no plans for making > use > > of Apache brand in press releases nor posting billboards advertising > > acceptance of IoTDB into Apache Incubator. > > > > > > > > > == Initial Source == > > > IoTDB's github address and some required dependencies: > > > > > > * The storage file format: https://github.com/thulab/tsfile > > > * Adaptor for Apache Hadoop MapReduce: > > https://github.com/thulab/tsfile-hadoop-connector > > > * Adaptor for Apache Spark: > > https://github.com/thulab/tsfile-spark-connector > > > * Adaptor for Grafana: https://github.com/thulab/iotdb-grafana > > > * The database engine: https://github.com/thulab/iotdb (private > > project up to now) > > > * The client driver: https://github.com/thulab/iotdb-jdbc > > > > > > > > > === External Dependencies === > > > To the best of our knowledge, all dependencies of IoTDB are distributed > > under Apache compatible licenses. Upon acceptance to the incubator, we > > would begin a thorough analysis of all transitive dependencies to verify > > this fact and introduce license checking into the build and release > process. > > > > > > == Documentation == > > > * Documentation for TsFile: https://github.com/thulab/tsfile/wiki > > > * Documentation for IoTDB and its JDBC: http://tsfile.org/document > > (Chinese only. An English version is in progress.) > > > > > > == Required Resources == > > > === Mailing Lists === > > > * priv...@iotdb.incubator.apache.org > > > * d...@iotdb.incubator.apache.org > > > * comm...@iotdb.incubator.apache.org > > > > > > === Git Repositories === > > > * https://git-wip-us.apache.org/repos/asf/incubator-iotdb.git > > > > > > === Issue Tracking === > > > * JIRA IoTDB (We currently use the issue management provided by > Github > > to track issues.) > > > > > > > > > == Initial Committers == > > > Tsinghua University, K2Data Company, Lenovo, Microsoft > > > > > > Jianmin Wang (jimwang at tsinghua dot edu dot cn ) > > > > > > Xiangdong Huang (sainthxd at gmail dot com) > > > > > > Jun Yuan (richard_yuan16 at 163 dot com) > > > > > > Chen Wang ( wang_chen at tsinghua dot edu dot cn) > > > > > > Jialin Qiao (qjl16 at mails dot tsinghua dot edu dot cn) > > > > > > Jinrui Zhang (jinrzhan at microsoft dot com) > > > > > > Rong Kang (kr11 at mails dot tsinghua dot edu dot cn) > > > > > > Tian Jiang(jiangtia18 at mails dot tsinghua dot edu dot cn) > > > > > > Shuo Zhang (zhangshuo at k2data dot com dot cn) > > > > > > Lei Rui (rl18 at mails dot tsinghua dot edu dot cn) > > > > > > Rui Liu (liur17 at mails dot tsinghua dot edu dot cn) > > > > > > Kun Liu (liukun16 at mails dot tsinghua dot edu dot cn) > > > > > > Gaofei Cao (cgf16 at mails dot tsinghua dot edu dot cn) > > > > > > Xinyi Zhao (xyzhao16 at mails dot tsinghua dot edu dot cn) > > > > > > Dongfang Mao (maodf17 at mails dot tsinghua dot edu dot cn) > > > > > > Tianan Li(lta18 at mails dot tsinghua dot edu dot cn) > > > > > > Yue Su (suy18 at mails dot tsinghua dot edu dot cn) > > > > > > Hui Dai (daihui_iot at lenovo dot com, yuct_iot at lenovo dot com ) > > > > > > == Sponsors == > > > === Champion === > > > Kevin A. McGrail (kmcgr...@apache.org) > > > > > > === Nominated Mentors === > > > Justin Mclean (justin at classsoftware dot com) > > > > > > Christofer Dutz (christofer.dutz at c-ware dot de) > > > > > > Willem Jiang (willem.jiang at gmail dot com) > > > > > > > > > > -- > > Kevin A. McGrail > > VP Fundraising, Apache Software Foundation > > Chair Emeritus Apache SpamAssassin Project > > https://www.linkedin.com/in/kmcgrail - 703.798.0171 > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > -- > Matt Sicker <boa...@gmail.com> >