Re: Decompressing Tar Files for Batch Processing

Austin Cawley-Edwards Tue, 07 Jul 2020 07:54:59 -0700

On Tue, Jul 7, 2020 at 10:53 AM Austin Cawley-Edwards <
austin.caw...@gmail.com> wrote:


> Hey Xiaolong,
>
> Thanks for the suggestions. Just to make sure I understand, are you saying
> to run the download and decompression in the Job Manager before executing
> the job?
>
> I think another way to ensure the tar file is not downloaded more than
> once is a source w/ parallelism 1. The issue I can't get past is after
> decompressing the tarball, how would I pass those OutputStreams for each
> entry through Flink?
>
> Best,
> Austin
>
>
>
> On Tue, Jul 7, 2020 at 5:56 AM Xiaolong Wang <xiaolong.w...@smartnews.com>
> wrote:
>
>> It seems like to me that it can not be done by Flink, for code will be
>> run across all task managers. That way, there will be multiple downloads of
>> you tar file, which is unnecessary.
>>
>> However, you can do it  on your code before initializing a Flink runtime,
>> and the code will be run only on the client side.
>>
>> On Tue, Jul 7, 2020 at 7:31 AM Austin Cawley-Edwards <
>> austin.caw...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> I need to ingest a tar file containing ~1GB of data in around 10 CSVs.
>>> The data is fairly connected and needs some cleaning, which I'd like to do
>>> with the Batch Table API + SQL (but have never used before). I've got a
>>> small prototype loading the uncompressed CSVs and applying the necessary
>>> SQL, which works well.
>>>
>>> I'm wondering about the task of downloading the tar file and unzipping
>>> it into the CSVs. Does this sound like something I can/ should do in Flink,
>>> or should I set up another process to download, unzip, and store in a
>>> filesystem to then read with the Flink Batch job? My research is leading me
>>> towards doing it separately but I'd like to do it all in the same job if
>>> there's a creative way.
>>>
>>> Thanks!
>>> Austin
>>>
>>

Re: Decompressing Tar Files for Batch Processing

Reply via email to