Hi Weston, Thank you for the reply.
>IIRC, this is a limitation given to use by the AWS C++ SDK. See [1]. The AWS >C++ SDK has static state and they do not manage it with static local >variables. As >result, the initialization and finalization order is >(IIRC) undefined (or at least not very well defined). OK, a few questions. When did this limitation begin in the AWS C++ SDK? Which version of the Arrow library first saw this requirement to bookend the S3 initialization and finalization? I seem to recall reading something in a Jira that stated that all of this began in version 12.0.0. Is that correct? In earlier versions of the Arrow filesystem library, this initialization and finalization (specifically the finalization) simply didn't exist. All that you needed was to initialize the S3 system. In that context, our S3 support worked without issue. Once the initialization/finalization business became "a thing", that broke our implementation altogether. We upgraded from version 8.0.0 to version 16.0.0 to gain support for ADLS. In so doing we broke backwards with respect to our S3 support, and that is never a good thing. This is why I am curious to know when this change took place within the Arrow library. One strategy that we are kicking around is to downgrade to the newest version of the Arrow/Parquet libraries that does not have this initialization and finalization business, then writing a custom implementation to support ADLS. However, I don't know if that makes sense, it would depend on which version of the AWS C++ SDK was being used at the time of the given release. If it was indeed version 12.0.0 when this came along and if that was the first version of the Arrow filesystem library in which this initialization and finalization business came to be, then perhaps we could revert to 11.0.0 and develop something else to provide ADLS support. The hope being in that scenario that version 11.0.0 didn't use a version of the AWS C++ SDK that required this initialization/finalization business. This is yet another reason I am asking these questions: we need a solution near term and it is up to me to figure something out. Thus I am kicking around a number of ideas. Difficult to make a decision without this background information. So again, I appreciate any feedback that can be offered, thanks! >I'm not familiar with embedded programming models. Is there a main somewhere? > If so, can you pass the responsibility onto your caller (whomever has the >main?) Or does some kind of component-level initialization exist? Well I might not be describing my situation correctly. What I was/am trying to say is that our "overall application" if you will contains many, many separate components that represent products that we license to customers. Depending on what the customer licenses, different "packages" are formed that comprise the product that they receive. That is, our delivery to any given customer doesn't contain all of the components/products, far from it. In the even that a customer purchases a license that *does* include the product in which my code resides, then what ultimately happens is that my module is dynamically loaded at runtime. This is what I was trying to describe, and I was a bit hesitant to use the word "embedded" because that typically refers to an entirely different application context, one in which very small snippets of code are utilized to perform specific functions in a custom hardware product. This isn't that. No, in my world, I am a small, small component within a much, much larger entity. Thus the module in which my code lies may be loaded or it may not. Even if it is loaded, it may not be invoked. Even if it is, it will be unloaded at some point and the process terminated, thus the reason I stated that all resources would need to be freed. To answer your questions, yes there is ultimately a main() somewhere, but there is no access to it from where I live and it wouldn't make sense to put anything there, given that my component may not even exist within a given product, depending on which components the customer selected. Thus I cannot pass the responsibility onto the caller, which is C code anyway, which means that it would need to either call into my C++ code or have an equivalent body of C++ code elsewhere that does the same thing, which I'm not certain even makes sense. Yes, there is component level initialization, which is where I placed the S3 init/finalize. In the use cases that generated the error message that I listed, I'm not certain that the caller is freeing the resources prior to attempting to read/write from/to an S3 server. Thus I will verify that with them tomorrow. >If not, then you can try and play games with static variables, but I think >that would violate "freeing all resources respectively". However, >Arrow-C++ itself has static state (e.g. CPU & I/O thread pools), so >Arrow-C++ unless >you are unloading the library, it's not clear that you will be freeing all >resources anyways. Understood. Yes, a complete unloading of the Arrow library seems sufficient to resolve the issue. The trouble is that we (ergo I) must support multi-user, and the handles are not unique, instead a counter is used to keep track of handle loading, similar to a shared smart pointer. Thus the scenario in which a given user loads the handle, then unloads the handle, but the underlying resources aren't unloaded is a very distinct possibility. And in such a scenario, the error message that I included in my original email occurs, which essentially states that multiple attempts were made to initialize the S3 library, attempts that occurred after a finalization method invocation occurred. But if you don't include a finalization method invocation, then the abort is thrown at library unloading, which results in a crash, which is unacceptable to our end customers. And so this has become a messy situation for us, no doubt. Again, just trying to find a solution ASAP, and wrote to the dev group with the hope that others who find themselves in the same situation might benefit from the discussion in the future as well. Thanks, look forward to the feedback! Jerry -----Original Message----- From: Weston Pace <weston.p...@gmail.com> Sent: Monday, December 02, 2024 12:41 AM To: dev@arrow.apache.org Subject: Re: [C++] Arrow S3 filesystem init/finalize EXTERNAL > Admittedly, I would like to know why this is being done in this > fashion, but that is tangential to my issue. IIRC, this is a limitation given to use by the AWS C++ SDK. See [1]. The AWS C++ SDK has static state and they do not manage it with static local variables. As a result, the initialization and finalization order is (IIRC) undefined (or at least not very well defined). > Now for my question: this is all fine and well in the context of developing your own stand-alone program and such. > However, what happens when you live in an embedded world in which your code lies many layers below main() and > you don't have access to main(), even if you wanted to follow the prescribed pattern? I mean, we are expected to wind > up and then down in an on-demand fashion, allocating and then freeing > all resources respectively. I pulled the init/finalize > out to the outermost layer that I have any involvement with, yet I see the following error messages I'm not familiar with embedded programming models. Is there a main somewhere? If so, can you pass the responsibility onto your caller (whomever has the main?) Or does some kind of component-level initialization exist? If not, then you can try and play games with static variables, but I think that would violate "freeing all resources respectively". However, Arrow-C++ itself has static state (e.g. CPU & I/O thread pools), so Arrow-C++ unless you are unloading the library, it's not clear that you will be freeing all resources anyways. [1] https://protect.checkpoint.com/v2/r01/___https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/basic-use.html___.YzJ1OnNhc2luc3RpdHV0ZTpjOm86MDNmYTg4MzEwZTE0ZjVlOGY3YmIzNmQxNDRmZjExZmQ6NzpjMmVmOjZjYzIzODE0Y2EzNmJlZGEzN2RiNGQ2Yzk1MWRlMGM1NGExNzFjMTIyMmYxMWE4NWQ1ZTliNWRhZGM1ZGExN2Q6cDpUOk4 On Sun, Dec 1, 2024 at 1:02 PM Jerry Adair <jerry.ad...@sas.com.invalid> wrote: > Hi, > > I have a question regarding the initialization/finalization of the S3 > filesystem within the Arrow filesystem library. Apologies if this > question has been raised in the past; I did perform a search but that > search didn't turn up anything. I did read the thread that discussed > the issue of init/finalize, though nothing I found made it clear when > the addition of the finalize method surfaced. I thought I read > mention that it occurred around version 12.0.0, but not certain. > That's just a side note really, I am curious to know when it came > about, because we had been using an old version of the libraries (8.0.0) and > it didn't exist within that version. > But I digress. > > So my issue and the question I have surrounds this notion of timing. > The aforementioned thread that I read made it clear that the > init/finalize should take place at the beginning and the end of main(): > > > // Snipped for brevity reasons > int main() > { > // More snipping > arrow::Status initializeStatus = arrow::fs::InitializeS3( > globalOptions ); > ... > arrow::Status finalizeStatus = arrow::fs::FinalizeS3(); > } /* end of your main() entry point*/ > > > The thread also made it clear that this bookended init/finalize should > not occur within a class definition, most likely in the > constructor/destructor respectively. > > So OK. While I am not familiar with the reason that this structure > became "a thing" within the Arrow filesystem library, it is indeed that way > now. > Admittedly, I would like to know why this is being done in this > fashion, but that is tangential to my issue. Now for my question: > this is all fine and well in the context of developing your own > stand-alone program and such. However, what happens when you live in > an embedded world in which your code lies many layers below main() and > you don't have access to main(), even if you wanted to follow the > prescribed pattern? I mean, we are expected to wind up and then down > in an on-demand fashion, allocating and then freeing all resources > respectively. I pulled the init/finalize out to the outermost layer > that I have any involvement with, yet I see the following error messages: > > > 2024-11-26T04:55:10,917 DEBUG [00000007] () App.parquet - Could not > create a AWS filesystem object > 2024-11-26T04:55:10,917 DEBUG [00000007] () App.parquet - > parquetFileReader): Exception exit, reason = Unable to create a file > system object on AWS server: Invalid: S3 subsystem is finalized > > > This occurs because the first spool-up/spool-down worked successfully, > but then when we are called sometime thereafter, the finalize method > has already done its thing, thus we can't initialize again. > Obviously, I know why this is occurring, that is straightforward, I > don't need an explanation for that. The question is what can I do > about this in my environment where no access to main() is available and we > must exist/not-exist on-demand? > Surely I am not the only one in this development scenario who has been > faced with this issue. So what is the solution here? Anyone else > faced this? Help? > > Thanks, > Jerry >