Nice! TaskManagers need to announce where they listen for connections.
We do not yet block "localhost" as an acceptable address, to not prohibit local test setups. There are some routines that try to select an interface that can communicate with the outside world. Is host3 running on the same machine as the JobManager? Or did you experience a long delay until TaskManager 3 was registered? Thanks for helping us debug this, Stephan On Wed, Jun 24, 2015 at 11:58 PM, Aaron Jackson <ajack...@pobox.com> wrote: > That was it. host3 was showing localhost - looked a little further and it > was missing an entry in /etc/hosts. > > Thanks for looking into this. > > Aaron > > On Wed, Jun 24, 2015 at 2:13 PM, Stephan Ewen <se...@apache.org> wrote: > >> Aaron, >> >> Can you check how the TaskManagers register at the JobManager? When you >> look at the 'TaskManagers' section in the JobManager's web Interface (at >> port 8081), what does it say as the TaskManager host names? >> >> Does it list "host1", "host2", "host3"...? >> >> Thanks, >> Stephan >> Am 24.06.2015 20:31 schrieb "Ufuk Celebi" <u...@apache.org>: >> >>> On 24 Jun 2015, at 16:22, Aaron Jackson <ajack...@pobox.com> wrote: >>> >>> > Thanks. My setup is actually 3 task managers x 4 slots. I played >>> with the parallelism and found that at low values, the error did not >>> occur. I can only conclude that there is some form of data shuffling that >>> is occurring that is sensitive to the data source. Yes, seems a little odd >>> to me as well. OOC, did you load the file into HDFS or use it from a local >>> file system (e.g. file:///tmp/data.csv) - my results have shown that so >>> far, HDFS does not appear to be sensitive to this issue. >>> > >>> > I updated the example to include my configuration and slaves, but for >>> brevity, I'll include the configurable bits here: >>> > >>> > jobmanager.rpc.address: host01 >>> > jobmanager.rpc.port: 6123 >>> > jobmanager.heap.mb: 512 >>> > taskmanager.heap.mb: 2048 >>> > taskmanager.numberOfTaskSlots: 4 >>> > parallelization.degree.default: 1 >>> > jobmanager.web.port: 8081 >>> > webclient.port: 8080 >>> > taskmanager.network.numberOfBuffers: 8192 >>> > taskmanager.tmp.dirs: /datassd/flink/tmp >>> > >>> > And the slaves ... >>> > >>> > host01 >>> > host02 >>> > host03 >>> > >>> > I did notice an extra empty line at the end of the slaves. And while >>> I highly doubt it makes ANY difference, I'm still going to re-run with it >>> removed. >>> > >>> > Thanks for looking into it. >>> >>> Thank you for being so helpful. I've tried it with the local filesystem. >>> >>> On 23 Jun 2015, at 07:11, Aaron Jackson <ajack...@pobox.com> wrote: >>> >>> > I have 12 task managers across 3 machines - so it's a small setup. >>> >>> Sorry for my misunderstanding. I've tried it with both 12 task managers >>> and 3 as well now. What's odd is that the stack trace shows that it is >>> trying to connect to "localhost" for the remote channel although localhost >>> is not configured anywhere. Let me think about that. ;) >>> >>> – Ufuk >>> >>> >>> >>> >>> >>> >