Hi Darshan,
This looks like a file system configuration issue to me.
Flink supports different file systems for S3 and there are also a few
tuning knobs.
Did you have a look at the docs for file system configuration [1]?
Best, Fabian
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.
Thanks for the details. I got it working. I have around 1 directory for
each month and I am running for 12-15 month data.So I created a dataset
from each month and did a union.
However, when I run I get the HTTP timeout issue. I am reading more than
120K files in total in all of months.
I am usin
Thanks all for your responses. I am now much more clearer on this.
Thanks
On Tue, Aug 14, 2018 at 9:46 AM, Fabian Hueske wrote:
> Hi,
>
> Flink InputFormats generate their InputSplits sequentially on the
> JobManager.
> These splits are stored in the heap of the JM process and handed out to
> S
Hi,
Flink InputFormats generate their InputSplits sequentially on the
JobManager.
These splits are stored in the heap of the JM process and handed out to
SourceTasks when they request them lazily.
Split assignment is done by a InputSplitAssigner, that can be customized.
FileInputFormats typically
It causes more overhead (processes etc) which might make it slower. Furthermore
if you have them stored on HDFS then the bottleneck is the namenode which will
have to answer millions of requests.
The latter point will change in future Hadoop versions with
http://ozone.hadoop.apache.org/
> On 1
Hi Darshan,
In a distributed scenario, there is no limit in theory, but there are still
some real-world conditions that will cause some constraints, such as the
size of your individual files, the memory size of your TM configuration,
and so on.
In addition, your "single" here is logical or physica
Hi Guys,
Is there a limit on number of files flink dataset can read? My question is
will there be any sort of issues if I have say millions of files to read to
create single dataset.
Thanks