I am testing my regex file input format, but because I have a workflow that
depends on the filename (each filename contains a number that I need), I
need to add another field to each of my tuples. What is the best way to
avoid this additional field, which I only need for grouping and one
multiplication (in a MapFunction) late in my workflow? An easy way would be
to do the multiplication in the input format, however I need the value also
for grouping.
If I were able to use many data sources (one for each file), I could avoid
the additional field (no grouping per file required) and possibly decrease
the runtime of the plan(s).

Thanks in advance for your help.

2015-07-01 10:20 GMT+02:00 Stephan Ewen <se...@apache.org>:

> How about allowing also a varArg of multiple file names for the input
> format?
>
> We'd then have the option of
>
>  - File or directory
>  - List of files or directories
>  - Base directory + regex that matches contained file paths
>
>
>
> On Wed, Jul 1, 2015 at 10:13 AM, Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
>> +1 :)
>>
>> On Wed, Jul 1, 2015 at 10:08 AM, chan fentes <chanfen...@gmail.com>
>> wrote:
>>
>>> Thank you all for your help and for pointing out different possibilities.
>>> It would be nice to have an input format that takes a directory and a
>>> regex pattern (for file names) to create one data source instead of 1500.
>>> This would have helped me to avoid the problem. Maybe this can be included
>>> in one of the future releases. ;)
>>>
>>> 2015-06-30 19:02 GMT+02:00 Stephan Ewen <se...@apache.org>:
>>>
>>>> I agree with Aljoscha and Ufuk.
>>>>
>>>> As said, it will be hard for the system (currently) to handle 1500
>>>> sources, but handling a parallel source with 1500 files will be very
>>>> efficient.
>>>> This is possible, if all sources (files) deliver the same data type and
>>>> would be unioned.
>>>>
>>>> If that is true, you can
>>>>
>>>>  - Specify the input as a directory.
>>>>
>>>>  - If you cannot do that, because there is no common parent directory,
>>>> you can "union" the files into one data source with a simple trick, as
>>>> described here:
>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/open-multiple-file-from-list-of-uri-tp1804p1807.html
>>>>
>>>>
>>>>
>>>> On Tue, Jun 30, 2015 at 5:36 PM, Aljoscha Krettek <aljos...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Chan,
>>>>> Flink sources support giving a directory as an input path in a source.
>>>>> If you do this it will read each of the files in that directory. They way
>>>>> you do it leads to a very big plan, because the plan will be replicated
>>>>> 1500 times, this could lead to the OutOfMemoryException.
>>>>>
>>>>> Is there a specific reason why you create 1500 separate sources?
>>>>>
>>>>> Regards,
>>>>> Aljoscha
>>>>>
>>>>> On Tue, 30 Jun 2015 at 17:17 chan fentes <chanfen...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> how many data sources can I use in one Flink plan? Is there any
>>>>>> limit? I get an
>>>>>> java.lang.OutOfMemoryException: unable to create native thread
>>>>>> when having approx. 1500 files. What I basically do is the following:
>>>>>> DataSource ->Map -> Map -> GroupBy -> GroupReduce per file
>>>>>> and then
>>>>>> Union -> GroupBy -> Sum in a tree-like reduction.
>>>>>>
>>>>>> I have checked the workflow. It runs on a cluster without any
>>>>>> problem, if I only use few files. Does Flink use a thread per operator? 
>>>>>> It
>>>>>> seems as if I am limited in the amount of threads I can use. How can I
>>>>>> avoid the exception mentioned above?
>>>>>>
>>>>>> Best regards
>>>>>> Chan
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Reply via email to