Right now, I would go with the extra field. The roadmap has pending features that improve the scheduling for plans like yours (with many data sources), but it is not yet in the code.
On Fri, Jul 17, 2015 at 11:24 AM, chan fentes <chanfen...@gmail.com> wrote: > I am testing my regex file input format, but because I have a workflow > that depends on the filename (each filename contains a number that I need), > I need to add another field to each of my tuples. What is the best way to > avoid this additional field, which I only need for grouping and one > multiplication (in a MapFunction) late in my workflow? An easy way would be > to do the multiplication in the input format, however I need the value also > for grouping. > If I were able to use many data sources (one for each file), I could avoid > the additional field (no grouping per file required) and possibly decrease > the runtime of the plan(s). > > Thanks in advance for your help. > > 2015-07-01 10:20 GMT+02:00 Stephan Ewen <se...@apache.org>: > >> How about allowing also a varArg of multiple file names for the input >> format? >> >> We'd then have the option of >> >> - File or directory >> - List of files or directories >> - Base directory + regex that matches contained file paths >> >> >> >> On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier <pomperma...@okkam.it >> > wrote: >> >>> +1 :) >>> >>> On Wed, Jul 1, 2015 at 10:08 AM, chan fentes <chanfen...@gmail.com> >>> wrote: >>> >>>> Thank you all for your help and for pointing out different >>>> possibilities. >>>> It would be nice to have an input format that takes a directory and a >>>> regex pattern (for file names) to create one data source instead of 1500. >>>> This would have helped me to avoid the problem. Maybe this can be included >>>> in one of the future releases. ;) >>>> >>>> 2015-06-30 19:02 GMT+02:00 Stephan Ewen <se...@apache.org>: >>>> >>>>> I agree with Aljoscha and Ufuk. >>>>> >>>>> As said, it will be hard for the system (currently) to handle 1500 >>>>> sources, but handling a parallel source with 1500 files will be very >>>>> efficient. >>>>> This is possible, if all sources (files) deliver the same data type >>>>> and would be unioned. >>>>> >>>>> If that is true, you can >>>>> >>>>> - Specify the input as a directory. >>>>> >>>>> - If you cannot do that, because there is no common parent directory, >>>>> you can "union" the files into one data source with a simple trick, as >>>>> described here: >>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html >>>>> >>>>> >>>>> >>>>> On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek <aljos...@apache.org >>>>> > wrote: >>>>> >>>>>> Hi Chan, >>>>>> Flink sources support giving a directory as an input path in a >>>>>> source. If you do this it will read each of the files in that directory. >>>>>> They way you do it leads to a very big plan, because the plan will be >>>>>> replicated 1500 times, this could lead to the OutOfMemoryException. >>>>>> >>>>>> Is there a specific reason why you create 1500 separate sources? >>>>>> >>>>>> Regards, >>>>>> Aljoscha >>>>>> >>>>>> On Tue, 30 Jun 2015 at 17:17 chan fentes <chanfen...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> how many data sources can I use in one Flink plan? Is there any >>>>>>> limit? I get an >>>>>>> java.lang.OutOfMemoryException: unable to create native thread >>>>>>> when having approx. 1500 files. What I basically do is the following: >>>>>>> DataSource ->Map -> Map -> GroupBy -> GroupReduce per file >>>>>>> and then >>>>>>> Union -> GroupBy -> Sum in a tree-like reduction. >>>>>>> >>>>>>> I have checked the workflow. It runs on a cluster without any >>>>>>> problem, if I only use few files. Does Flink use a thread per operator? >>>>>>> It >>>>>>> seems as if I am limited in the amount of threads I can use. How can I >>>>>>> avoid the exception mentioned above? >>>>>>> >>>>>>> Best regards >>>>>>> Chan >>>>>>> >>>>>> >>>>> >>>> >>> >>> >> >