danny0405 commented on code in PR #8093:
URL: https://github.com/apache/hudi/pull/8093#discussion_r1127367385
##########
website/docs/timeline.md:
##########
@@ -3,40 +3,384 @@ title: Timeline
toc: true
---
-## Timeline
-At its core, Hudi maintains a `timeline` of all actions performed on the table
at different `instants` of time that helps provide instantaneous views of the
table,
-while also efficiently supporting retrieval of data in the order of arrival. A
Hudi instant consists of the following components
+A Hudi table maintains all operations happened to the table in a single
timeline comprised of two parts, an active timeline and an archived timeline.
The active timeline stores all the recent instant, while the archive timeline
stores the older instants. An instant is a transaction where all respective
partitions within a base path have been successfully updated by either a writer
or a table service. Instants that get older in the active timeline are moved to
archived timeline at various times.
-* `Instant action` : Type of action performed on the table
-* `Instant time` : Instant time is typically a timestamp (e.g:
20190117010349), which monotonically increases in the order of action's begin
time.
-* `state` : current state of the instant
+An instant can alter one or many partitions:
-Hudi guarantees that the actions performed on the timeline are atomic &
timeline consistent based on the instant time.
+- If you have one batch ingestion, you’ll see that as one commit in the
active timeline. When you open that commit file, you’ll see a JSON object with
metadata about how one or more partitions were altered.
+
+- If you’re ingesting streaming data, you might see multiple commits in the
active timeline. In this case, when you open a commit file, you might see
metadata about how one or more partition files were altered.
-Key actions performed include
+We’ll go over some details and concepts about the active and archived timeline
below. All files in the timelines are immutable.
-* `COMMITS` - A commit denotes an **atomic write** of a batch of records into
a table.
-* `CLEANS` - Background activity that gets rid of older versions of files in
the table, that are no longer needed.
-* `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of
records into a MergeOnRead type table, where some/all of the data could be
just written to delta logs.
-* `COMPACTION` - Background activity to reconcile differential data structures
within Hudi e.g: moving updates from row based log files to columnar formats.
Internally, compaction manifests as a special commit on the timeline
-* `ROLLBACK` - Indicates that a commit/delta commit was unsuccessful & rolled
back, removing any partial files produced during such a write
-* `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will
not delete them. It helps restore the table to a point on the timeline, in case
of disaster/data recovery scenarios.
+**Note**: The user should never directly alter the timeline (i.e. manually
delete the commits).
-Any given instant can be
-in one of the following states
+## Active Timeline
-* `REQUESTED` - Denotes an action has been scheduled, but has not initiated
-* `INFLIGHT` - Denotes that the action is currently being performed
-* `COMPLETED` - Denotes completion of an action on the timeline
+The active timeline is a source of truth for all write operations: when an
action (described below) happens on a table, the timeline is responsible for
recording it. This guarantees a good table state, and Hudi can provide
read/write isolation based on the timeline. For example, when data is being
written to a Hudi table (i.e., requested, inflight), any data being written as
part of the transaction is not visible to a query engine until the write
transaction is completed. The query engine can still read older data, but the
data inflight won’t be exposed.
-<figure>
- <img className="docimage"
src={require("/assets/images/hudi_timeline.png").default}
alt="hudi_timeline.png" />
-</figure>
+The active timeline is under the `.hoodie` metadata folder. For example, when
you navigate to your Hudi project directory:
-Example above shows upserts happenings between 10:00 and 10:20 on a Hudi
table, roughly every 5 mins, leaving commit metadata on the Hudi timeline, along
-with other background cleaning/compactions. One key observation to make is
that the commit time indicates the `arrival time` of the data (10:20AM), while
the actual data
-organization reflects the actual time or `event time`, the data was intended
for (hourly buckets from 07:00). These are two key concepts when reasoning
about tradeoffs between latency and completeness of data.
+```sh
+cd $YOUR_HUDI_PROJECT_DIRECTORY && ls -a
+```
+
+You’ll see the `.hoodie` metadata folder:
+
+```sh
+ls -a
+. .. .hoodie americas asia
+```
+
+When you navigate inside the `.hoodie` folder, you’ll see a lot of files with
different suffixes and the archived timeline folder:
+
+```sh
+cd .hoodie && ls
+2023021018095339.commit
+20230210180953939.commit.requested
+20230210180953939.inflight
+archived
+```
+
+Before we go into what’s in the files or how the files are named, we’ll need
to cover some broader concepts:
+- actions
+- states
Review Comment:
`- states` -> `- states`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]