RE: [C++] Arrow S3 filesystem init/finalize

Jerry Adair Sun, 01 Dec 2024 23:13:04 -0800

Hi Weston,

Thank you for the reply.

>IIRC, this is a limitation given to use by the AWS C++ SDK.  See [1].  The AWS 
>C++ SDK has static state and they do not manage it with static local 
>variables.  As
>result, the initialization and finalization order is
>(IIRC) undefined (or at least not very well defined).

OK, a few questions.  When did this limitation begin in the AWS C++ SDK?  Which 
version of the Arrow library first saw this requirement to bookend the S3 
initialization and finalization?  I seem to recall reading something in a Jira 
that stated that all of this began in version 12.0.0.  Is that correct?  In 
earlier versions of the Arrow filesystem library, this initialization and 
finalization (specifically the finalization) simply didn't exist.  All that you 
needed was to initialize the S3 system.  In that context, our S3 support worked 
without issue.  Once the initialization/finalization business became "a thing", 
that broke our implementation altogether.  We upgraded from version 8.0.0 to 
version 16.0.0 to gain support for ADLS.  In so doing we broke backwards with 
respect to our S3 support, and that is never a good thing.  This is why I am 
curious to know when this change took place within the Arrow library.  One 
strategy that we are kicking around is to downgrade to the newest version of 
the Arrow/Parquet libraries that does not have this initialization and 
finalization business, then writing a custom implementation to support ADLS.  
However, I don't know if that makes sense, it would depend on which version of 
the AWS C++ SDK was being used at the time of the given release.  If it was 
indeed version 12.0.0 when this came along and if that was the first version of 
the Arrow filesystem library in which this initialization and finalization 
business came to be, then perhaps we could revert to 11.0.0 and develop 
something else to provide ADLS support.  The hope being in that scenario that 
version 11.0.0 didn't use a version of the AWS C++ SDK that required this 
initialization/finalization business.  This is yet another reason I am asking 
these questions: we need a solution near term and it is up to me to figure 
something out.  Thus I am kicking around a number of ideas.  Difficult to make 
a decision without this background information.  So again, I appreciate any 
feedback that can be offered, thanks!

>I'm not familiar with embedded programming models.  Is there a main somewhere? 
> If so, can you pass the responsibility onto your caller (whomever has the
>main?)  Or does some kind of component-level initialization exist?

Well I might not be describing my situation correctly.  What I was/am trying to 
say is that our "overall application" if you will contains many, many separate 
components that represent products that we license to customers.  Depending on 
what the customer licenses, different "packages" are formed that comprise the 
product that they receive.  That is, our delivery to any given customer doesn't 
contain all of the components/products, far from it.  In the even that a 
customer purchases a license that *does* include the product in which my code 
resides, then what ultimately happens is that my module is dynamically loaded 
at runtime.  This is what I was trying to describe, and I was a bit hesitant to 
use the word "embedded" because that typically refers to an entirely different 
application context, one in which very small snippets of code are utilized to 
perform specific functions in a custom hardware product.  This isn't that.  No, 
in my world, I am a small, small component within a much, much larger entity.  
Thus the module in which my code lies may be loaded  or it may not.  Even if it 
is loaded, it may not be invoked.  Even if it is, it will be unloaded at some 
point and the process terminated, thus the reason I stated that all resources 
would need to be freed.  To answer your questions, yes there is ultimately a 
main() somewhere, but there is no access to it from where I live  and it 
wouldn't make sense to put anything there, given that my component may not even 
exist within a given product, depending on which components the customer 
selected.  Thus I cannot pass the responsibility onto the caller, which is C 
code anyway, which means that it would need to either call into my C++ code or 
have an equivalent body of C++ code elsewhere that does the same thing, which 
I'm not certain even makes sense.  Yes, there is component level 
initialization, which is where I placed the S3 init/finalize.  In the use cases 
that generated the error message that I listed, I'm not certain that the caller 
is freeing the resources prior to attempting to read/write from/to an S3 
server.  Thus I will verify that with them tomorrow.

>If not, then you can try and play games with static variables, but I think 
>that would violate "freeing all resources respectively".  However,
>Arrow-C++ itself has static state (e.g. CPU & I/O thread pools), so
>Arrow-C++ unless
>you are unloading the library, it's not clear that you will be freeing all 
>resources anyways.

Understood.  Yes, a complete unloading of the Arrow library seems sufficient to 
resolve the issue.  The trouble is that we (ergo I) must support multi-user, 
and the handles are not unique, instead a counter is used to keep track of 
handle loading, similar to a shared smart pointer.  Thus the scenario in which 
a given user loads the handle, then unloads the handle, but the underlying 
resources aren't unloaded is a very distinct possibility.  And in such a 
scenario, the error message that I included in my original email occurs, which 
essentially states that multiple attempts were made to initialize the S3 
library, attempts that occurred after a finalization method invocation 
occurred.  But if you don't include a finalization method invocation, then the 
abort is thrown at library unloading, which results in a crash, which is 
unacceptable to our end customers.

And so this has become a messy situation for us, no doubt.  Again, just trying 
to find a solution ASAP, and wrote to the dev group with the hope that others 
who find themselves in the same situation might benefit from the discussion in 
the future as well.

Thanks, look forward to the feedback!

Jerry

-----Original Message-----
From: Weston Pace <weston.p...@gmail.com>
Sent: Monday, December 02, 2024 12:41 AM
To: dev@arrow.apache.org
Subject: Re: [C++] Arrow S3 filesystem init/finalize

EXTERNAL

> Admittedly, I would like to know why this is being done in this
> fashion,
but that is tangential to my issue.

IIRC, this is a limitation given to use by the AWS C++ SDK.  See [1].  The AWS 
C++ SDK has static state and they do not manage it with static local variables. 
 As a result, the initialization and finalization order is
(IIRC) undefined (or at least not very well defined).

> Now for my question: this is all fine and well in the context of
developing your own stand-alone program and such.
> However, what happens when you live in an embedded world in which your
code lies many layers below main() and
> you don't have access to main(), even if you wanted to follow the
prescribed pattern?  I mean, we are expected to wind
> up and then down in an on-demand fashion, allocating and then freeing
> all
resources respectively.  I pulled the init/finalize
> out to the outermost layer that I have any involvement with, yet I see
the following error messages

I'm not familiar with embedded programming models.  Is there a main somewhere?  
If so, can you pass the responsibility onto your caller (whomever has the 
main?)  Or does some kind of component-level initialization exist?

If not, then you can try and play games with static variables, but I think that 
would violate "freeing all resources respectively".  However,
Arrow-C++ itself has static state (e.g. CPU & I/O thread pools), so
Arrow-C++ unless
you are unloading the library, it's not clear that you will be freeing all 
resources anyways.

[1]
https://protect.checkpoint.com/v2/r01/___https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/basic-use.html___.YzJ1OnNhc2luc3RpdHV0ZTpjOm86MDNmYTg4MzEwZTE0ZjVlOGY3YmIzNmQxNDRmZjExZmQ6NzpjMmVmOjZjYzIzODE0Y2EzNmJlZGEzN2RiNGQ2Yzk1MWRlMGM1NGExNzFjMTIyMmYxMWE4NWQ1ZTliNWRhZGM1ZGExN2Q6cDpUOk4

On Sun, Dec 1, 2024 at 1:02 PM Jerry Adair <jerry.ad...@sas.com.invalid>
wrote:

> Hi,
>
> I have a question regarding the initialization/finalization of the S3
> filesystem within the Arrow filesystem library.  Apologies if this
> question has been raised in the past; I did perform a search but that
> search didn't turn up anything.  I did read the thread that discussed
> the issue of init/finalize, though nothing I found made it clear when
> the addition of the finalize method surfaced.  I thought I read
> mention that it occurred around version 12.0.0, but not certain.
> That's just a side note really, I am curious to know when it came
> about, because we had been using an old version of the libraries (8.0.0) and 
> it didn't exist within that version.
> But I digress.
>
> So my issue and the question I have surrounds this notion of timing.
> The aforementioned thread that I read made it clear that the
> init/finalize should take place at the beginning and the end of main():
>
>
> // Snipped for brevity reasons
> int main()
> {
>    // More snipping
>     arrow::Status   initializeStatus = arrow::fs::InitializeS3(
> globalOptions );
> ...
>    arrow::Status   finalizeStatus = arrow::fs::FinalizeS3();
> } /* end of your main() entry point*/
>
>
> The thread also made it clear that this bookended init/finalize should
> not occur within a class definition, most likely in the
> constructor/destructor respectively.
>
> So OK.  While I am not familiar with the reason that this structure
> became "a thing" within the Arrow filesystem library, it is indeed that way 
> now.
> Admittedly, I would like to know why this is being done in this
> fashion, but that is tangential to my issue.  Now for my question:
> this is all fine and well in the context of developing your own
> stand-alone program and such.  However, what happens when you live in
> an embedded world in which your code lies many layers below main() and
> you don't have access to main(), even if you wanted to follow the
> prescribed pattern?  I mean, we are expected to wind up and then down
> in an on-demand fashion, allocating and then freeing all resources
> respectively.  I pulled the init/finalize out to the outermost layer
> that I have any involvement with, yet I see the following error messages:
>
>
> 2024-11-26T04:55:10,917 DEBUG [00000007] () App.parquet - Could not
> create a AWS filesystem object
> 2024-11-26T04:55:10,917 DEBUG [00000007] () App.parquet -
> parquetFileReader): Exception exit, reason = Unable to create a file
> system object on AWS server: Invalid: S3 subsystem is finalized
>
>
> This occurs because the first spool-up/spool-down worked successfully,
> but then when we are called sometime thereafter, the finalize method
> has already done its thing, thus we can't initialize again.
> Obviously, I know why this is occurring, that is straightforward, I
> don't need an explanation for that.  The question is what can I do
> about this in my environment where no access to main() is available and we 
> must exist/not-exist on-demand?
> Surely I am not the only one in this development scenario who has been
> faced with this issue.  So what is the solution here?  Anyone else
> faced this?  Help?
>
> Thanks,
> Jerry
>

RE: [C++] Arrow S3 filesystem init/finalize

Reply via email to