That makes perfect sense, thanks! Am 25.06.2015 21:39 schrieb "Aaron Jackson" <ajack...@pobox.com>:
> So the JobManager was running on host1. This also explains why I didn't > see the problem until I had asked for a sizeable degree of parallelism > since it probably never assigned a task to host3. > > Thanks for your help > > On Thu, Jun 25, 2015 at 3:34 AM, Stephan Ewen <se...@apache.org> wrote: > >> Nice! >> >> TaskManagers need to announce where they listen for connections. >> >> We do not yet block "localhost" as an acceptable address, to not prohibit >> local test setups. >> >> There are some routines that try to select an interface that can >> communicate with the outside world. >> >> Is host3 running on the same machine as the JobManager? Or did you >> experience a long delay until TaskManager 3 was registered? >> >> Thanks for helping us debug this, >> Stephan >> >> >> >> >> >> >> On Wed, Jun 24, 2015 at 11:58 PM, Aaron Jackson <ajack...@pobox.com> >> wrote: >> >>> That was it. host3 was showing localhost - looked a little further and >>> it was missing an entry in /etc/hosts. >>> >>> Thanks for looking into this. >>> >>> Aaron >>> >>> On Wed, Jun 24, 2015 at 2:13 PM, Stephan Ewen <se...@apache.org> wrote: >>> >>>> Aaron, >>>> >>>> Can you check how the TaskManagers register at the JobManager? When you >>>> look at the 'TaskManagers' section in the JobManager's web Interface (at >>>> port 8081), what does it say as the TaskManager host names? >>>> >>>> Does it list "host1", "host2", "host3"...? >>>> >>>> Thanks, >>>> Stephan >>>> Am 24.06.2015 20:31 schrieb "Ufuk Celebi" <u...@apache.org>: >>>> >>>>> On 24 Jun 2015, at 16:22, Aaron Jackson <ajack...@pobox.com> wrote: >>>>> >>>>> > Thanks. My setup is actually 3 task managers x 4 slots. I played >>>>> with the parallelism and found that at low values, the error did not >>>>> occur. I can only conclude that there is some form of data shuffling that >>>>> is occurring that is sensitive to the data source. Yes, seems a little >>>>> odd >>>>> to me as well. OOC, did you load the file into HDFS or use it from a >>>>> local >>>>> file system (e.g. file:///tmp/data.csv) - my results have shown that so >>>>> far, HDFS does not appear to be sensitive to this issue. >>>>> > >>>>> > I updated the example to include my configuration and slaves, but >>>>> for brevity, I'll include the configurable bits here: >>>>> > >>>>> > jobmanager.rpc.address: host01 >>>>> > jobmanager.rpc.port: 6123 >>>>> > jobmanager.heap.mb: 512 >>>>> > taskmanager.heap.mb: 2048 >>>>> > taskmanager.numberOfTaskSlots: 4 >>>>> > parallelization.degree.default: 1 >>>>> > jobmanager.web.port: 8081 >>>>> > webclient.port: 8080 >>>>> > taskmanager.network.numberOfBuffers: 8192 >>>>> > taskmanager.tmp.dirs: /datassd/flink/tmp >>>>> > >>>>> > And the slaves ... >>>>> > >>>>> > host01 >>>>> > host02 >>>>> > host03 >>>>> > >>>>> > I did notice an extra empty line at the end of the slaves. And >>>>> while I highly doubt it makes ANY difference, I'm still going to re-run >>>>> with it removed. >>>>> > >>>>> > Thanks for looking into it. >>>>> >>>>> Thank you for being so helpful. I've tried it with the local >>>>> filesystem. >>>>> >>>>> On 23 Jun 2015, at 07:11, Aaron Jackson <ajack...@pobox.com> wrote: >>>>> >>>>> > I have 12 task managers across 3 machines - so it's a small setup. >>>>> >>>>> Sorry for my misunderstanding. I've tried it with both 12 task >>>>> managers and 3 as well now. What's odd is that the stack trace shows that >>>>> it is trying to connect to "localhost" for the remote channel although >>>>> localhost is not configured anywhere. Let me think about that. ;) >>>>> >>>>> – Ufuk >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>> >> >