(resending) On Wed, Jul 25, 2018, 8:34 AM Chamikara Jayalath <[email protected]> wrote:
> > > On Wed, Jul 25, 2018 at 8:13 AM Kelsey RIDER < > [email protected]> wrote: > >> In my use-case, I’m reading a single CSV file, parsing the records, and >> then inserting them into a database. >> >> >> >> The whole bounded/unbounded thing is very unclear to me: when is a >> PCollection which? How do I make it be one or the other? >> > > I'd say, as a pipeline author, you shouldn't have worry about this too > much. Beam model allows you to easily transition from one to another. But > your specific case is a batch pipeline that will use a bounded source. > > >> More generally, how does the Pipeline work? Can multiple PTransforms be >> running at once? >> > > Yes. This is ultimately up to the runner. But many runners run PTransforms > in parallel when possible and also shard PTransform input to parallelize > execution using multiple workers. > > >> >> My code has a simple DoFn that takes the ReadableFile provided by FileIO, >> opens it for streaming, and starts generating records (using apache >> commons-csv). The next PTransform then takes those records and inserts them >> into a DB. Will this be handled continuously, or will each PTransform block >> until it has finished all of its processing? >> > > This will be handled continuously. Sinks (at the end of each stage) > control how much data will be buffered before emitting to next stage. > > >> In other words, will JdbcIO start inserting data before the CSVParser has >> reached the end of the file (assuming it’s a big file)? >> > > Yes. > > Thanks, > Cham > > >> >> >> *From:* Chamikara Jayalath <[email protected]> >> *Sent:* mardi 24 juillet 2018 18:34 >> *To:* [email protected] >> *Subject:* Re: Large CSV files >> >> >> >> Are you trying to read a growing file ? I don't think this scenario is >> well supported. You can use FileIO.MatchAll.continuously() if you want >> to read a growing list of files (where new files get added to a given >> directory). >> >> >> >> If you are reading a large but fixed set of files then what you need is a >> bounded source not an unbounded source. We do not have pre-defined a source >> for reading CSV files with multi-line records (unless you can identify a >> record delimiter and use TextIO with withDelimiter() option). So I'd >> suggest using FileIO.match() or FileIO.matchAll() and using a custom >> ParDo to read records. >> >> >> >> Thanks, >> >> Cham >> >> >> >> >> >> >> >> On Mon, Jul 23, 2018 at 11:28 PM Kai Jiang <[email protected]> wrote: >> >> I have the same situation. If CSV is splittable, we could use SDF. >> >> [image: Image removed by sender.]ᐧ >> >> >> >> On Mon, Jul 23, 2018 at 1:38 PM Raghu Angadi <[email protected]> wrote: >> >> It might be simpler to discuss if you replicate the question here. >> >> >> >> Are your CSV files splittable? Otherwise Flink/Dataflow runners would not >> load the entire file into memory. This is a streaming application, right? >> MatchAll in FileIO.java is used in TextIO, AvroIO etc to read files >> continuously in streaming applications. It is built on SDF and allows >> reading smaller chunks of the file (as long as the file is splittable). >> >> >> >> Raghu. >> >> >> >> >> >> On Mon, Jul 23, 2018 at 7:16 AM Andrew Pilloud <[email protected]> >> wrote: >> >> Hi Kelsey, >> >> >> >> I posted a reply on stackoverflow. It sounds like you might be using the >> DirectRunner, which isn't meant to handle datasets that are too big to fit >> into memory. If that is the case, have you tried the Flink local runner or >> the Dataflow runner? >> >> >> >> Andrew >> >> >> >> On Mon, Jul 23, 2018 at 4:06 AM Kelsey RIDER < >> [email protected]> wrote: >> >> Hello, >> >> >> >> SO question here : >> https://stackoverflow.com/questions/51439189/how-to-read-large-csv-with-beam >> >> Anybody have any ideas? Am I missing something? >> >> >> >> Thanks >> >> *Suite à l’évolution des dispositifs de réglementation du travail, si >> vous recevez ce mail avant 7h00, en soirée, durant le week-end ou vos >> congés merci, sauf cas d’urgence exceptionnelle, de ne pas le traiter ni >> d’y répondre immédiatement.* >> >>
