[GitHub] [flink] becketqin commented on a change in pull request #13401: [FLINK-19161][file connector] Add first version of the FLIP-27 File Source

GitBox Sun, 20 Sep 2020 19:33:58 -0700


becketqin commented on a change in pull request #13401:
URL: https://github.com/apache/flink/pull/13401#discussion_r491762173




##########
File path: 
flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/impl/ContinuousFileSplitEnumerator.java
##########
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.connector.file.src.impl;
+
+import org.apache.flink.api.connector.source.SourceEvent;
+import org.apache.flink.api.connector.source.SplitEnumerator;
+import org.apache.flink.api.connector.source.SplitEnumeratorContext;
+import org.apache.flink.connector.base.source.event.RequestSplitEvent;
+import org.apache.flink.connector.file.src.FileSourceSplit;
+import org.apache.flink.connector.file.src.PendingSplitsCheckpoint;
+import org.apache.flink.connector.file.src.assigners.FileSplitAssigner;
+import org.apache.flink.connector.file.src.enumerate.FileEnumerator;
+import org.apache.flink.core.fs.Path;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Optional;
+import java.util.stream.Collectors;
+
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+/**
+ * A continuously monitoring enumerator.
+ */
+public class ContinuousFileSplitEnumerator implements 
SplitEnumerator<FileSourceSplit, PendingSplitsCheckpoint> {
+
+       private static final Logger LOG = 
LoggerFactory.getLogger(ContinuousFileSplitEnumerator.class);
+
+       private final SplitEnumeratorContext<FileSourceSplit> context;
+
+       private final FileSplitAssigner splitAssigner;
+
+       private final FileEnumerator enumerator;
+
+       private final HashSet<Path> pathsAlreadyProcessed;

Review comment:
       Sounds good.

##########
File path: 
flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/BulkFormat.java
##########
@@ -0,0 +1,158 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.connector.file.src.reader;
+
+import org.apache.flink.api.common.typeinfo.TypeInformation;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.connector.file.src.util.MutableRecordAndPosition;
+import org.apache.flink.connector.file.src.util.RecordAndPosition;
+import org.apache.flink.core.fs.Path;
+
+import javax.annotation.Nullable;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.io.Serializable;
+
+/**
+ * The {@code BulkFormat} reads and decodes batches of records at a time. 
Examples of bulk formats
+ * are formats like ORC or Parquet.
+ *
+ * <p>The actual reading is done by the {@link BulkFormat.Reader}, which is 
created in the
+ * {@link BulkFormat#createReader(Configuration, Path, long, long)} or
+ * {@link BulkFormat#createReader(Configuration, Path, long, long, long)} 
methods.
+ * The outer class acts mainly as a configuration holder and factory for the 
reader.
+ *
+ * <h2>Checkpointing</h2>
+ *
+ * <p>The bulk reader returns an iterator structure per batch. The iterator 
produces records together
+ * with a position. That position is the point from where the reading can be 
resumed AFTER
+ * the records was emitted. So that position points effectively to the record 
AFTER the current record.
+ *
+ * <p>The simplest way to return this position information is to always assume 
a zero offset in the file
+ * and simply increment the record count. Note that in this case the fist 
record would be returned with
+ * a record count of one, the second one with a record count of two, etc.
+ *
+ * <p>Formats that have the ability to efficiently seek to a record (or to 
every n-th record) by offset
+ * in the file can work with the position field to avoid having to read and 
discard many records on recovery.
+ *
+ * <h2>Serializable</h2>
+ *
+ * <p>Like many other API classes in Flink, the outer class is serializable to 
support sending instances
+ * to distributed workers for parallel execution. This is purely short-term 
serialization for RPC and
+ * no instance of this will be long-term persisted in a serialized form.
+ *
+ * <h2>Record Batching</h2>
+ *
+ * <p>Internally in the file source, the readers pass batches of records from 
the reading
+ * threads (that perform the typically blocking I/O operations) to the async 
mailbox threads that
+ * do the streaming and batch data processing. Passing records in batches 
(rather than one-at-a-time)
+ * much reduce the thread-to-thread handover overhead.
+ *
+ * <p>For the {@code BulkFormat}, one batch (as returned by {@link 
BulkFormat.Reader#readBatch()}) is
+ * handed over as one.
+ */
+public interface BulkFormat<T> extends Serializable {
+
+       /**
+        * Creates a new reader that reads from {@code filePath} starting at 
{@code offset} and reads
+        * until {@code length} bytes after the offset.
+        */
+       default BulkFormat.Reader<T> createReader(Configuration config, Path 
filePath, long offset, long length) throws IOException {
+               return createReader(config, filePath, offset, length, 0L);
+       }
+
+       /**
+        * Creates a new reader that reads from {@code filePath} starting at 
{@code offset} and reads
+        * until {@code length} bytes after the offset. A number of {@code 
recordsToSkip} records should be
+        * read and discarded after the offset. This is typically part of 
restoring a reader to a checkpointed
+        * position.
+        */
+       BulkFormat.Reader<T> createReader(Configuration config, Path filePath, 
long offset, long length, long recordsToSkip) throws IOException;
+
+       /**
+        * Gets the type produced by this format. This type will be the type 
produced by the file
+        * source as a whole.
+        */
+       TypeInformation<T> getProducedType();

Review comment:
       Personally I feel it is OK to use `ResultTypeQueryable` here. The class 
is in `org.apache.flink.api.java.typeutils` package which seems up for general 
usage.

##########
File path: 
flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/FileSourceSplit.java
##########
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.connector.file.src;
+
+import org.apache.flink.annotation.PublicEvolving;
+import org.apache.flink.api.connector.source.SourceSplit;
+import org.apache.flink.core.fs.Path;
+
+import javax.annotation.Nullable;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkNotNull;
+
+/**
+ * A {@link SourceSplit} that represents a file, or a region of a file.
+ */
+@PublicEvolving
+public class FileSourceSplit implements SourceSplit, Serializable {
+
+       private static final long serialVersionUID = 1L;
+
+       private static final String[] NO_HOSTS = new String[0];
+
+       /** The unique ID of the split. Unique within the scope of this source. 
*/
+       private final String id;
+
+       /** The path of the file referenced by this split. */
+       private final Path filePath;
+
+       /** The position of the first byte in the file to process. */
+       private final long offset;
+
+       /** The number of bytes in the file to process. */
+       private final long length;
+
+       /** The number of records to be skipped from the beginning of the split.
+        * This is for file formats that cannot pinpoint every exact record 
position via an offset,
+        * due to read buffers or bulk encoding or compression. */
+       private final long skippedRecordCount;
+
+       /** The names of the hosts storing this range of the file. Empty, if no 
host information is available. */
+       private final String[] hostnames;
+
+       /** The splits are frequently serialized into checkpoints.
+        * Caching the byte representation makes repeated serialization cheap.
+        * This field is used by {@link FileSourceSplitSerializer}. */
+       @Nullable
+       transient byte[] serializedFormCache;

Review comment:
       I am still not sure how the cached `serialziedFormCache` works.
   
   The logic in `FileSourceSplitSerializer` serializes the given 
`FileSourceSplit` instance and then sets the `serializedFormCache` in that 
`FileSourceSplit` instance for reuse later. However, the `FileSourceSplit` 
passed to the `FileSourceSplitSerializer` is always a new instance returned by 
`FileSourceSplitState#toFileSourceSplit()`. So it seems the 
`FileSourceSplitSerializer#serialize()` is not able to leverage the 
`serializedFormCache`.
   
   Also, if I understand correctly,
   1. For the active split that is being read by the reader, the 
`serializedFormCache` doesn't gain much because the position keeps changing. 
   2. For the splits that are waiting in the queue, the `serializedFormCache` 
helps to avoid serialize them repeatedly because they won't change between two 
checkpoints.
   
   In this case, when a split is polled out of the waiting queue, the cache 
should be invalidated. But I did not find that part of the logic.
   
   Am I missing something?
   
   

##########
File path: 
flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/impl/FileSourceSplitReader.java
##########
@@ -0,0 +1,110 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.connector.file.src.impl;
+
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.connector.base.source.reader.RecordsWithSplitIds;
+import org.apache.flink.connector.base.source.reader.splitreader.SplitReader;
+import 
org.apache.flink.connector.base.source.reader.splitreader.SplitsAddition;
+import org.apache.flink.connector.base.source.reader.splitreader.SplitsChange;
+import org.apache.flink.connector.file.src.FileSourceSplit;
+import org.apache.flink.connector.file.src.reader.BulkFormat;
+import org.apache.flink.connector.file.src.util.RecordAndPosition;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+
+import java.io.IOException;
+import java.util.ArrayDeque;
+import java.util.Queue;
+
+/**
+ * The {@link SplitReader} implementation for the file source.
+ */
+final class FileSourceSplitReader<T> implements 
SplitReader<RecordAndPosition<T>, FileSourceSplit> {
+
+       private static final Logger LOG = 
LoggerFactory.getLogger(FileSourceSplitReader.class);
+
+       private final Configuration config;
+       private final BulkFormat<T> readerFactory;
+
+       private final Queue<FileSourceSplit> splits;
+
+       @Nullable
+       private BulkFormat.Reader<T> currentReader;
+       @Nullable
+       private String currentSplitId;
+
+       public FileSourceSplitReader(Configuration config, BulkFormat<T> 
readerFactory) {
+               this.config = config;
+               this.readerFactory = readerFactory;
+               this.splits = new ArrayDeque<>();
+       }
+
+       @Override
+       public RecordsWithSplitIds<RecordAndPosition<T>> fetch() throws 
IOException {
+               checkSplitOrStartNext();
+
+               final BulkFormat.RecordIterator<T> nextBatch = 
currentReader.readBatch();
+               return nextBatch == null ? finishSplit() : 
FileRecords.forRecords(currentSplitId, nextBatch);
+       }
+
+       @Override
+       public void handleSplitsChanges(final SplitsChange<FileSourceSplit> 
splitChange) {
+               if (!(splitChange instanceof SplitsAddition)) {
+                       throw new UnsupportedOperationException(String.format(
+                                       "The SplitChange type of %s is not 
supported.", splitChange.getClass()));
+               }
+
+               LOG.debug("Handling split change {}", splitChange);
+               splits.addAll(splitChange.splits());
+       }
+
+       @Override
+       public void wakeUp() {}

Review comment:
       You are right. I agree the current solution is a good default behavior. 
It is tricky because we don't know whether `Reader#readBatch()` / 
`Reader#read()` method can be waken up. Maybe we can add a default wakeup 
method in the `Reader` interface. So for some of the implementation who can be 
waken up, they would behave more elegantly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] becketqin commented on a change in pull request #13401: [FLINK-19161][file connector] Add first version of the FLIP-27 File Source

Reply via email to