So flink TMs reads one line at a time from hdfs in parallel and keep filling it in memory and keep passing the records to the next operator? I just want to know how data comes in memory? How it is partition between TMs Is there a documentation i can refer how the reading is done and how data is pushed from operators to operators in both stream and batch
On Wed 12 Sep, 2018, 4:28 PM Fabian Hueske, <fhue...@gmail.com> wrote: > Actually, some parts of Flink's batch engine are similar to streaming as > well. If the data does not need to be sorted or put into a hash-table, the > data is pipelined (like in many relational database systems). > For example, if you have a job that joins two inputs with a HashJoin, only > the build side is marterialized in memory. If the build side fits in > memory, the probe side if fully pipelined. If some parts of the build side > need to be put on disk, the fraction of the probe side that would join with > the spilled part is also written to disk. If the data needs to be sorted, > Flink tries to do that in memory as well but can spill to disk if > necessary. A job that only applies a filter or simple transformation would > also be fully pipelined. > > So it depends on the job and its execution plan whether data is stored in > memory or not. > > Best, Fabian > > 2018-09-12 2:34 GMT-04:00 vino yang <yanghua1...@gmail.com>: > >> Hi Taher, >> >> Stream processing and batch processing are very different. The principle >> of batch processing determines that it needs to process bulk data, such as >> memory-based sorting, join, and so on. So, in this case, it needs to wait >> for the relevant data to arrive before it is calculated, but this does not >> mean that the data is concentrated in one node, and the calculation is >> still distributed. Flink has corresponding optimization measures for the >> execution plan of batch processing. For the storage of large data sets, it >> uses a custom-managed memory mechanism (you can use more memory by applying >> extra-heap memory). Of course, the amount of data is still stored in the >> memory. It will spill to disk when not in use. >> >> Regarding fault tolerance, the current checkpoint mechanism is only >> applicable to stream processing. Batch fault tolerance can be re-executed >> by directly playing back the complete data set. A TaskManager fails, Flink >> will kick it out of the cluster, and the Task running on it will fail, but >> the result of stream processing and batch Task failure is different. For >> stream processing, it triggers a restart of the entire job, which may only >> trigger a partial restart for batch processing. >> >> Thanks, vino. >> >> Taher Koitawala <taher.koitaw...@gslab.com> 于2018年9月12日周三 上午1:50写道: >> >>> Furthermore, how does Flink deal with Task Managers dying when it is >>> using the DataSet API. Is checkpointing done on dataset too? Or the whole >>> dataset has to re-read. >>> >>> Regards, >>> Taher Koitawala >>> GS Lab Pune >>> +91 8407979163 >>> >>> On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala < >>> taher.koitaw...@gslab.com> wrote: >>> >>>> Hi All, >>>> Just like Spark does Flink read a dataset and keep it in >>>> memory and keep applying transformations? Or all records read by Flink >>>> async parallel reads? Furthermore, how does Flink deal with >>>> >>>> Regards, >>>> Taher Koitawala >>>> GS Lab Pune >>>> +91 8407979163 >>>> >>> >>> >