On Tue, Jul 7, 2020 at 10:53 AM Austin Cawley-Edwards < austin.caw...@gmail.com> wrote:
> Hey Xiaolong, > > Thanks for the suggestions. Just to make sure I understand, are you saying > to run the download and decompression in the Job Manager before executing > the job? > > I think another way to ensure the tar file is not downloaded more than > once is a source w/ parallelism 1. The issue I can't get past is after > decompressing the tarball, how would I pass those OutputStreams for each > entry through Flink? > > Best, > Austin > > > > On Tue, Jul 7, 2020 at 5:56 AM Xiaolong Wang <xiaolong.w...@smartnews.com> > wrote: > >> It seems like to me that it can not be done by Flink, for code will be >> run across all task managers. That way, there will be multiple downloads of >> you tar file, which is unnecessary. >> >> However, you can do it on your code before initializing a Flink runtime, >> and the code will be run only on the client side. >> >> On Tue, Jul 7, 2020 at 7:31 AM Austin Cawley-Edwards < >> austin.caw...@gmail.com> wrote: >> >>> Hey all, >>> >>> I need to ingest a tar file containing ~1GB of data in around 10 CSVs. >>> The data is fairly connected and needs some cleaning, which I'd like to do >>> with the Batch Table API + SQL (but have never used before). I've got a >>> small prototype loading the uncompressed CSVs and applying the necessary >>> SQL, which works well. >>> >>> I'm wondering about the task of downloading the tar file and unzipping >>> it into the CSVs. Does this sound like something I can/ should do in Flink, >>> or should I set up another process to download, unzip, and store in a >>> filesystem to then read with the Flink Batch job? My research is leading me >>> towards doing it separately but I'd like to do it all in the same job if >>> there's a creative way. >>> >>> Thanks! >>> Austin >>> >>