Re: test flakiness with SocketException of broken pipe in HiveMetaStoreClient

Ryan Murray Mon, 11 Jan 2021 05:51:10 -0800

I have come across a few similar issues while (mis)using the HiveCatalog.
My analysis was that the HiveCatalog owns the hive client connection pool
but shares it with the underlying TableOperations. Depending on the
application the HiveCatalog can be closed (or its finalize method run after
its GC-ed) but the underlying TableOperations still are being used. When a
TableOp next tries to talk to hive it finds its already been disconnected.
I have been experimenting with a more cautious method of tracking resource
ownership by removing the finalize method and using a ReferenceQueue but it
looks like a rather disruptive change. However it fixes a lot of issues
like this and it (marginally) improves GC performance. I can respond back
here when I post a candidate PR and we can discuss its value.


Best

Ryan Murray

On Fri, Jan 8, 2021 at 7:50 PM Ryan Blue <[email protected]> wrote:

> It could be that there are two separate flaky test issues with not
> releasing connections in Flink and Spark. I don't think that the
> HiveCatalog code has been changed much recently, which would point toward
> problems elsewhere.
>
> I think one good reason to use HiveCatalog is to catch problems like
> these, where the catalog should be closed but isn't.
>
> On Fri, Jan 8, 2021 at 9:23 AM Steven Wu <[email protected]> wrote:
>
>> I use the try-with-resource pattern in the FLIP-27 dev branch. I saw this
>> problem in Flink tests with the master branch too (although less likely).
>> With the FLIP-27 dev branch and an additional DeleteReadTests, it almost
>> happened 100%.
>>
>> Also, the Spark module (in the master branch) also has this flaky test
>> problem. Like Ryan mentioned earlier, maybe some common code (or pattern)
>> causes the issue. It became more flaky recently probably because there are
>> more tests added.
>>
>> Regardless, I still have the question if most unit tests should use
>> HiveCatalog? Why not the cheaper HadoopCatalog except for tests
>> specifically targeting HiveCatalog? Can that help speed up the tests?
>>
>> On Fri, Jan 8, 2021 at 12:06 AM OpenInx <[email protected]> wrote:
>>
>>> OK, there's a  try-with-resource to close the TableLoader  in
>>> FlinkInputFormat [1].   so we don't have to do the extra try-with-resource
>>> in PR 2051 ( I will close that).
>>>
>>> Under my host,  I did not reproduce your connection leak issues when
>>> running TestFlinkInputFormatReaderDeletes.  Did you have any extra usage
>>> about the table loader and forget to close it in your flip-27 dev branch ?
>>>
>>> [1].
>>> https://github.com/apache/iceberg/blob/7645ceba65044184be192a7194a38729133b2e50/flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java#L77
>>>
>>> On Fri, Jan 8, 2021 at 3:36 PM OpenInx <[email protected]> wrote:
>>>
>>>> > I was able to almost 100% reproduce the HiveMetaStoreClient aborted
>>>> connection problem locally with Flink tests after adding
>>>> another DeleteReadTests for the new FLIP-27 source impl in my dev branch
>>>>
>>>> I think I found the cause why it's easy to fail.   The
>>>> TestFlinkInputFormatReaderDeletes will create a new CatalogLoader [1] for
>>>> loading table purposes inside the FlinkInputFormat.
>>>>
>>>> TestHelpers.readRowData(inputFormat, rowType).forEach(rowData -> {
>>>>       RowDataWrapper wrapper = new RowDataWrapper(rowType,
>>>> projected.asStruct());
>>>>       set.add(wrapper.wrap(rowData));
>>>>     });
>>>>
>>>> When TestHelpers#readRowData,  it will open a new catalog ( that means
>>>> opening a new hive connection). But after we finished the read processing,
>>>> we did not close the TableLoader, which leaks the catalog connection. I
>>>> opened a PR [2] to fix this issue,  will it work in your branch ?
>>>>
>>>> I think it's worth keeping those hive catalog unit tests so that we
>>>> could detect those connection leak issues in time.
>>>>
>>>> [1].
>>>> https://github.com/apache/iceberg/blob/4436c92928f4b3b90839a26bf6a656902733261f/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormatReaderDeletes.java#L114
>>>> [2]. https://github.com/apache/iceberg/pull/2051/files
>>>>
>>>> On Fri, Jan 8, 2021 at 5:48 AM Steven Wu <[email protected]> wrote:
>>>>
>>>>> Ryan/OpenInx, thanks a lot for the pointers.
>>>>>
>>>>> I was able to almost 100% reproduce the HiveMetaStoreClient aborted
>>>>> connection problem locally with Flink tests after adding
>>>>> another DeleteReadTests for the new FLIP-27 source impl in my dev branch. 
>>>>> I
>>>>> don't see the problem anymore after switching the Flink DeleteReadTests
>>>>> from the HiveCatalog (requiring expensive TestHiveMetastore) to
>>>>> HadoopCatalog.
>>>>>
>>>>> There is still a base test class FlinkTestBase using the HiveCatalog.
>>>>> I am wondering if there is a value for using the more expensive 
>>>>> HiveCatalog
>>>>> than the HadoopCatalog?
>>>>>
>>>>> On Wed, Jan 6, 2021 at 6:22 PM OpenInx <[email protected]> wrote:
>>>>>
>>>>>> I encountered a similar issue when supporting hive-site.xml for flink
>>>>>> hive catalog.  Here is the discussion and solution before:
>>>>>> https://github.com/apache/iceberg/pull/1586#discussion_r509453461
>>>>>>
>>>>>> It's a connection leak issue.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 7, 2021 at 10:06 AM Ryan Blue <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I've noticed this too. I haven't had a chance to track down what's
>>>>>>> causing it yet. I've seen it in Spark tests, so it looks like there may 
>>>>>>> be
>>>>>>> a problem that affects both. Probably a connection leak in the common 
>>>>>>> code.
>>>>>>>
>>>>>>> On Wed, Jan 6, 2021 at 3:44 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have noticed some flakiness with Flink and Spark tests both
>>>>>>>> locally and in CI checks. @zhangjun0x01 also reported the same problem 
>>>>>>>> with
>>>>>>>> iceberg-spark3-extensions.  Below is a full stack trace from a
>>>>>>>> local run for Flink tests.
>>>>>>>>
>>>>>>>> The flakiness might be recent regression, as the tests were stable
>>>>>>>> for me until recently. Any recent hive dep change? Anyone have any 
>>>>>>>> ideas?
>>>>>>>>
>>>>>>>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes >
>>>>>>>> testMixedPositionAndEqualityDeletes[fileFormat=ORC] FAILED
>>>>>>>>
>>>>>>>>     java.lang.RuntimeException: Failed to get table info from
>>>>>>>> metastore default.test
>>>>>>>>
>>>>>>>>         at
>>>>>>>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:142)
>>>>>>>>
>>>>>>>>         at
>>>>>>>> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:86)
>>>>>>>>
>>>>>>>>         at
>>>>>>>> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:69)
>>>>>>>>
>>>>>>>>         at
>>>>>>>> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:92)
>>>>>>>>
>>>>>>>>         at
>>>>>>>> org.apache.iceberg.flink.TableLoader$CatalogTableLoader.loadTable(TableLoader.java:113)
>>>>>>>>
>>>>>>>>         at
>>>>>>>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes.rowSet(TestIcebergSourceReaderDeletes.java:90)
>>>>>>>>
>>>>>>>>
>>>>>>>>         Caused by:
>>>>>>>>
>>>>>>>>         org.apache.thrift.transport.TTransportException:
>>>>>>>> java.net.SocketException: Broken pipe (Write failed)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_table_req(ThriftHiveMetastore.java:1561)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1553)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.iceberg.hive.HiveTableOperations.lambda$doRefresh$0(HiveTableOperations.java:130)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.iceberg.hive.ClientPool.run(ClientPool.java:65)
>>>>>>>>
>>>>>>>>             at
>>>>>>>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:130)
>>>>>>>>
>>>>>>>>             ... 5 more
>>>>>>>>
>>>>>>>>
>>>>>>>>             Caused by:
>>>>>>>>
>>>>>>>>             java.net.SocketException: Broken pipe (Write failed)
>>>>>>>>
>>>>>>>>                 at java.net.SocketOutputStream.socketWrite0(Native
>>>>>>>> Method)
>>>>>>>>
>>>>>>>>                 at
>>>>>>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>>>>>>>>
>>>>>>>>                 at
>>>>>>>> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>>>>>>>>
>>>>>>>>                 at
>>>>>>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>>>>>>>>
>>>>>>>>                 at
>>>>>>>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>>>>>>>>
>>>>>>>>                 at
>>>>>>>> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159)
>>>>>>>>
>>>>>>>>                 ... 13 more
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: test flakiness with SocketException of broken pipe in HiveMetaStoreClient

Reply via email to