(resending)

On Wed, Jul 25, 2018, 8:34 AM Chamikara Jayalath <[email protected]>
wrote:

>
>
> On Wed, Jul 25, 2018 at 8:13 AM Kelsey RIDER <
> [email protected]> wrote:
>
>> In my use-case, I’m reading a single CSV file, parsing the records, and
>> then inserting them into a database.
>>
>>
>>
>> The whole bounded/unbounded thing is very unclear to me: when is a
>> PCollection which? How do I make it be one or the other?
>>
>
> I'd say, as a pipeline author, you shouldn't have worry about this too
> much. Beam model allows you to easily transition from one to another. But
> your specific case is a batch pipeline that will use a bounded source.
>
>
>> More generally, how does the Pipeline work? Can multiple PTransforms be
>> running at once?
>>
>
> Yes. This is ultimately up to the runner. But many runners run PTransforms
> in parallel when possible and also shard PTransform input to parallelize
> execution using multiple workers.
>
>
>>
>> My code has a simple DoFn that takes the ReadableFile provided by FileIO,
>> opens it for streaming, and starts generating records (using apache
>> commons-csv). The next PTransform then takes those records and inserts them
>> into a DB. Will this be handled continuously, or will each PTransform block
>> until it has finished all of its processing?
>>
>
> This will be handled continuously. Sinks (at the end of each stage)
> control how much data will be buffered before emitting to next stage.
>
>
>> In other words, will JdbcIO start inserting data before the CSVParser has
>> reached the end of the file (assuming it’s a big file)?
>>
>
> Yes.
>
> Thanks,
> Cham
>
>
>>
>>
>> *From:* Chamikara Jayalath <[email protected]>
>> *Sent:* mardi 24 juillet 2018 18:34
>> *To:* [email protected]
>> *Subject:* Re: Large CSV files
>>
>>
>>
>> Are you trying to read a growing file ? I don't think this scenario is
>> well supported. You can use FileIO.MatchAll.continuously() if you want
>> to read a growing list of files (where new files get added to a given
>> directory).
>>
>>
>>
>> If you are reading a large but fixed set of files then what you need is a
>> bounded source not an unbounded source. We do not have pre-defined a source
>> for reading CSV files with multi-line records (unless you can identify a
>> record delimiter and use TextIO with withDelimiter() option). So I'd
>> suggest using FileIO.match() or FileIO.matchAll() and using a custom
>> ParDo to read records.
>>
>>
>>
>> Thanks,
>>
>> Cham
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 23, 2018 at 11:28 PM Kai Jiang <[email protected]> wrote:
>>
>> I have the same situation. If CSV is splittable, we could use SDF.
>>
>> [image: Image removed by sender.]ᐧ
>>
>>
>>
>> On Mon, Jul 23, 2018 at 1:38 PM Raghu Angadi <[email protected]> wrote:
>>
>> It might be simpler to discuss if you replicate the question here.
>>
>>
>>
>> Are your CSV files splittable? Otherwise Flink/Dataflow runners would not
>> load the entire file into memory. This is a streaming application, right?
>> MatchAll in FileIO.java is used in TextIO, AvroIO etc to read files
>> continuously in streaming applications. It is built on SDF and allows
>> reading smaller chunks of the file (as long as the file is splittable).
>>
>>
>>
>> Raghu.
>>
>>
>>
>>
>>
>> On Mon, Jul 23, 2018 at 7:16 AM Andrew Pilloud <[email protected]>
>> wrote:
>>
>> Hi Kelsey,
>>
>>
>>
>> I posted a reply on stackoverflow. It sounds like you might be using the
>> DirectRunner, which isn't meant to handle datasets that are too big to fit
>> into memory. If that is the case, have you tried the Flink local runner or
>> the Dataflow runner?
>>
>>
>>
>> Andrew
>>
>>
>>
>> On Mon, Jul 23, 2018 at 4:06 AM Kelsey RIDER <
>> [email protected]> wrote:
>>
>> Hello,
>>
>>
>>
>> SO question here :
>> https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam
>>
>> Anybody have any ideas? Am I missing something?
>>
>>
>>
>> Thanks
>>
>> *Suite à l’évolution des dispositifs de réglementation du travail, si
>> vous recevez ce mail avant 7h00, en soirée, durant le week-end ou vos
>> congés merci, sauf cas d’urgence exceptionnelle, de ne pas le traiter ni
>> d’y répondre immédiatement.*
>>
>>

Reply via email to