I am testing my regex file input format, but because I have a workflow that depends on the filename (each filename contains a number that I need), I need to add another field to each of my tuples. What is the best way to avoid this additional field, which I only need for grouping and one multiplication (in a MapFunction) late in my workflow? An easy way would be to do the multiplication in the input format, however I need the value also for grouping. If I were able to use many data sources (one for each file), I could avoid the additional field (no grouping per file required) and possibly decrease the runtime of the plan(s).
Thanks in advance for your help. 2015-07-01 10:20 GMT+02:00 Stephan Ewen <se...@apache.org>: > How about allowing also a varArg of multiple file names for the input > format? > > We'd then have the option of > > - File or directory > - List of files or directories > - Base directory + regex that matches contained file paths > > > > On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> +1 :) >> >> On Wed, Jul 1, 2015 at 10:08 AM, chan fentes <chanfen...@gmail.com> >> wrote: >> >>> Thank you all for your help and for pointing out different possibilities. >>> It would be nice to have an input format that takes a directory and a >>> regex pattern (for file names) to create one data source instead of 1500. >>> This would have helped me to avoid the problem. Maybe this can be included >>> in one of the future releases. ;) >>> >>> 2015-06-30 19:02 GMT+02:00 Stephan Ewen <se...@apache.org>: >>> >>>> I agree with Aljoscha and Ufuk. >>>> >>>> As said, it will be hard for the system (currently) to handle 1500 >>>> sources, but handling a parallel source with 1500 files will be very >>>> efficient. >>>> This is possible, if all sources (files) deliver the same data type and >>>> would be unioned. >>>> >>>> If that is true, you can >>>> >>>> - Specify the input as a directory. >>>> >>>> - If you cannot do that, because there is no common parent directory, >>>> you can "union" the files into one data source with a simple trick, as >>>> described here: >>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html >>>> >>>> >>>> >>>> On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek <aljos...@apache.org> >>>> wrote: >>>> >>>>> Hi Chan, >>>>> Flink sources support giving a directory as an input path in a source. >>>>> If you do this it will read each of the files in that directory. They way >>>>> you do it leads to a very big plan, because the plan will be replicated >>>>> 1500 times, this could lead to the OutOfMemoryException. >>>>> >>>>> Is there a specific reason why you create 1500 separate sources? >>>>> >>>>> Regards, >>>>> Aljoscha >>>>> >>>>> On Tue, 30 Jun 2015 at 17:17 chan fentes <chanfen...@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> how many data sources can I use in one Flink plan? Is there any >>>>>> limit? I get an >>>>>> java.lang.OutOfMemoryException: unable to create native thread >>>>>> when having approx. 1500 files. What I basically do is the following: >>>>>> DataSource ->Map -> Map -> GroupBy -> GroupReduce per file >>>>>> and then >>>>>> Union -> GroupBy -> Sum in a tree-like reduction. >>>>>> >>>>>> I have checked the workflow. It runs on a cluster without any >>>>>> problem, if I only use few files. Does Flink use a thread per operator? >>>>>> It >>>>>> seems as if I am limited in the amount of threads I can use. How can I >>>>>> avoid the exception mentioned above? >>>>>> >>>>>> Best regards >>>>>> Chan >>>>>> >>>>> >>>> >>> >> >> >