Volodymyr Burenin created HUDI-1532:
---------------------------------------

             Summary: Super slow magic sequence search within the log files on 
GCS 
                 Key: HUDI-1532
                 URL: https://issues.apache.org/jira/browse/HUDI-1532
             Project: Apache Hudi
          Issue Type: Bug
          Components: DeltaStreamer
            Reporter: Volodymyr Burenin


HudiDeltaStreamer freezes for a very long time(days) when scanning through the 
log file looking for the magic sequence. Java stacktrace points to this 
location:

```
"Executor task launch worker for task 183" #233 daemon prio=5 os_prio=0 
tid=0x00005629fb650000 nid=0xff runnable [0x00007f378a433000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
        at sun.security.ssl.InputRecord.read(InputRecord.java:503)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
        - locked <0x00000007ba998608> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
        - locked <0x00000007ba9995c0> (a sun.security.ssl.AppInputStream)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked <0x0000000779e6f360> (a java.io.BufferedInputStream)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
        - locked <0x0000000779e69ba0> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
        - locked <0x0000000779e69ba0> (a 
sun.net.www.protocol.https.DelegateHttpsURLConnection)
        at 
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
        at 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:144)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:79)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:995)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:549)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:482)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:510)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6981)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStream(GoogleCloudStorageReadChannel.java:967)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openContentChannel(GoogleCloudStorageReadChannel.java:772)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:763)
        at 
com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:365)
        at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:131)
        - locked <0x0000000616319fb8> (a 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at 
org.apache.hudi.common.table.log.HoodieLogFileReader.hasNextMagic(HoodieLogFileReader.java:339)
        at 
org.apache.hudi.common.table.log.HoodieLogFileReader.scanForNextAvailableBlockOffset(HoodieLogFileReader.java:280)
        at 
org.apache.hudi.common.table.log.HoodieLogFileReader.createCorruptBlock(HoodieLogFileReader.java:221)
        at 
org.apache.hudi.common.table.log.HoodieLogFileReader.readBlock(HoodieLogFileReader.java:147)
        at 
org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:347)
```

After deeper research I discovered it happens due to the unbuffered access to 
the GCS as well as due to the inefficient algorithm that searches for the magic 
sequence technically making request to GCS to read next 6 bytes, then offsets 
+1 and tries again. With 50-60ms latency and 5-6MB file sizes it may take 
forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to