[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421135#comment-15421135
 ] 

Josep Rubió commented on FLINK-1707:


Hi [~vkalavri],

I'm redoing the implementation and have some questions about it:

- I will remove the damping functionality to avoid having the old sent values 
for each node, is that right?
- I'm thinking on implementing a method inside the AffinityPropagation class to 
transform a similarity Matrix to a graph, is that right? Maybe this could be 
done in the Gelly library instead of AP class.
- Each of the I nodes should have the similarities with the other points, is 
that feasible from a memory prospective?

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421317#comment-15421317
 ] 

Josep Rubió commented on FLINK-1707:


Hi [~vkalavri]

No worries :) 
No is not, damping factor is there to avoid oscillation in some cases, about 
convergence it will actually make it converge slower but they recommend it. I 
will analyze to implement a solution where the two options are available. Like 
0 damping will not create the array with the old values.

Makes sense. Maybe I will need some help to improve what is done now to convert 
from the input graph to the binary AP graph.

I will continue with it. Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425389#comment-15425389
 ] 

Josep Rubió commented on FLINK-1707:


Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
* If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
- There is no damping factor applied to messages
- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425389#comment-15425389
 ] 

Josep Rubió edited comment on FLINK-1707 at 8/17/16 9:12 PM:
-

Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
* If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
- There is no damping factor applied to messages
- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!


was (Author: joseprupi):
Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
* If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
- There is no damping factor applied to messages
- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425389#comment-15425389
 ] 

Josep Rubió edited comment on FLINK-1707 at 8/17/16 9:12 PM:
-

Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
*If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
-- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
-- There is no damping factor applied to messages
-- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!


was (Author: joseprupi):
Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

** The values hash map is no longer needed
** If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
** If a damping factor of 0 is provided to the constructor:
- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
- There is no damping factor applied to messages
- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425389#comment-15425389
 ] 

Josep Rubió edited comment on FLINK-1707 at 8/17/16 9:12 PM:
-

Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

** The values hash map is no longer needed
** If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
** If a damping factor of 0 is provided to the constructor:
- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
- There is no damping factor applied to messages
- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!


was (Author: joseprupi):
Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
* If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
- There is no damping factor applied to messages
- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425389#comment-15425389
 ] 

Josep Rubió edited comment on FLINK-1707 at 8/17/16 9:13 PM:
-

Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
* If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
-- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
-- There is no damping factor applied to messages
-- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!


was (Author: joseprupi):
Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
*If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
-- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
-- There is no damping factor applied to messages
-- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3298) Streaming connector for ActiveMQ

2016-08-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430445#comment-15430445
 ] 

Jean-Baptiste Onofré commented on FLINK-3298:
-

By the way guys, why only ActiveMQ when we can do JMS generic connector ? FYI, 
it's what I did in Beam (JMSIO).

> Streaming connector for ActiveMQ
> 
>
> Key: FLINK-3298
> URL: https://issues.apache.org/jira/browse/FLINK-3298
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming Connectors
>Reporter: Mohit Sethi
>Assignee: Ivan Mushketyk
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4326) Flink start-up scripts should optionally start services on the foreground

2016-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434446#comment-15434446
 ] 

Ismaël Mejía commented on FLINK-4326:
-

Just to make some progress here, I see three options:

1. I reopen my PR and we let it like it was (it already works and makes the 
intended goal but has the weirdness of calling flink-daemon.sh start-foreground 
...).
2. I reopen my PR but rename flink-daemon to flink-service to avoid the mix of 
daemon+foreground (the negative part of this is that it may break some scripts 
for people using flink-daemon directly).
3. We create the additional flink-console script with the issue of copy pasting 
a small chunk of code at the beginning and keeping both in the future (well we 
can create some sort of common bash code library, but remember bash is not the 
nicest language for reuse).

I prefer option 1 because it reuses all the code and it is already done and 
tested. and I probably could do option 2 or 3 if most people agree on one of 
those, but I will still need help from your part for the new review + some 
testing (in particular for case 3).

WDYT or do you have other suggestions ?


> Flink start-up scripts should optionally start services on the foreground
> -
>
> Key: FLINK-4326
> URL: https://issues.apache.org/jira/browse/FLINK-4326
> Project: Flink
>  Issue Type: Improvement
>  Components: Startup Shell Scripts
>Affects Versions: 1.0.3
>Reporter: Elias Levy
>
> This has previously been mentioned in the mailing list, but has not been 
> addressed.  Flink start-up scripts start the job and task managers in the 
> background.  This makes it difficult to integrate Flink with most processes 
> supervisory tools and init systems, including Docker.  One can get around 
> this via hacking the scripts or manually starting the right classes via Java, 
> but it is a brittle solution.
> In addition to starting the daemons in the foreground, the start up scripts 
> should use exec instead of running the commends, so as to avoid forks.  Many 
> supervisory tools assume the PID of the process to be monitored is that of 
> the process it first executes, and fork chains make it difficult for the 
> supervisor to figure out what process to monitor.  Specifically, 
> jobmanager.sh and taskmanager.sh should exec flink-daemon.sh, and 
> flink-daemon.sh should exec java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15442548#comment-15442548
 ] 

Josep Rubió commented on FLINK-1707:


Hi [~vkalavri]

I've pushed a new version of AP with following changes:

https://github.com/joseprupi/flink/blob/vertexcentric/flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/AffinityPropagation.java

- I've changed the model to the vertex centric avoiding having the values in 
the the vertices

- I've added the option of no damping. When a 0 factor of damping is passed to 
the constructor the convergence condition is no changes on the exemplars on a 
certain number of iterations. This number of iterations is the last parameter 
of the constructor. If a damping factor different to 0 is used it keep working 
as before, having to hold the old sent values in the vertex. 

To do:

- I have not changed the initialization of the graph yet. I posted a question 
in dev thread with no much luck. Maybe I will implement an initialization with 
a similarity matrix for now and will see how I can do it using gelly 
functionality later

- I will try to change where are the weight values to be in the edges instead 
of vertices (I've created a new version of the design document with the 
diagrams too). This way vertices will only have the old values in case the 
damping factor has to be used.


> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15442548#comment-15442548
 ] 

Josep Rubió edited comment on FLINK-1707 at 8/28/16 1:47 AM:
-

Hi [~vkalavri]

I've pushed a new version of AP with following changes:

https://github.com/joseprupi/flink/blob/vertexcentric/flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/AffinityPropagation.java

- I've changed the model to the vertex centric avoiding having the values in 
the the vertices

- I've added the option of no damping. When a 0 factor of damping is passed to 
the constructor the convergence condition is no changes on the exemplars on a 
certain number of iterations, avoiding to have the old values in vertices. This 
number of iterations is the last parameter of the constructor. If a damping 
factor different to 0 is used it keeps working as before, having to hold the 
old sent values in the vertex. 

To do:

- I have not changed the initialization of the graph yet. I posted a question 
in dev thread with no much luck. Maybe I will implement an initialization with 
a similarity matrix for now and will see how I can do it using gelly 
functionality later

- I will try to change where are the weight values to be in the edges instead 
of vertices (I've created a new version of the design document with the 
diagrams too). This way vertices will only have the old values in case the 
damping factor has to be used.



was (Author: joseprupi):
Hi [~vkalavri]

I've pushed a new version of AP with following changes:

https://github.com/joseprupi/flink/blob/vertexcentric/flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/AffinityPropagation.java

- I've changed the model to the vertex centric avoiding having the values in 
the the vertices

- I've added the option of no damping. When a 0 factor of damping is passed to 
the constructor the convergence condition is no changes on the exemplars on a 
certain number of iterations. This number of iterations is the last parameter 
of the constructor. If a damping factor different to 0 is used it keep working 
as before, having to hold the old sent values in the vertex. 

To do:

- I have not changed the initialization of the graph yet. I posted a question 
in dev thread with no much luck. Maybe I will implement an initialization with 
a similarity matrix for now and will see how I can do it using gelly 
functionality later

- I will try to change where are the weight values to be in the edges instead 
of vertices (I've created a new version of the design document with the 
diagrams too). This way vertices will only have the old values in case the 
damping factor has to be used.


> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15442548#comment-15442548
 ] 

Josep Rubió edited comment on FLINK-1707 at 8/28/16 1:48 AM:
-

Hi [~vkalavri],

I've pushed a new version of AP with following changes:

https://github.com/joseprupi/flink/blob/vertexcentric/flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/AffinityPropagation.java

- I've changed the model to the vertex centric avoiding having the values in 
the the vertices

- I've added the option of no damping. When a 0 factor of damping is passed to 
the constructor the convergence condition is no changes on the exemplars on a 
certain number of iterations, avoiding to have the old values in vertices. This 
number of iterations is the last parameter of the constructor. If a damping 
factor different to 0 is used it keeps working as before, having to hold the 
old sent values in the vertex. 

To do:

- I have not changed the initialization of the graph yet. I posted a question 
in dev thread with no much luck. Maybe I will implement an initialization with 
a similarity matrix for now and will see how I can do it using gelly 
functionality later

- I will try to change where are the weight values to be in the edges instead 
of vertices (I've created a new version of the design document with the 
diagrams too). This way vertices will only have the old values in case the 
damping factor has to be used.

Thanks!!


was (Author: joseprupi):
Hi [~vkalavri]

I've pushed a new version of AP with following changes:

https://github.com/joseprupi/flink/blob/vertexcentric/flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/AffinityPropagation.java

- I've changed the model to the vertex centric avoiding having the values in 
the the vertices

- I've added the option of no damping. When a 0 factor of damping is passed to 
the constructor the convergence condition is no changes on the exemplars on a 
certain number of iterations, avoiding to have the old values in vertices. This 
number of iterations is the last parameter of the constructor. If a damping 
factor different to 0 is used it keeps working as before, having to hold the 
old sent values in the vertex. 

To do:

- I have not changed the initialization of the graph yet. I posted a question 
in dev thread with no much luck. Maybe I will implement an initialization with 
a similarity matrix for now and will see how I can do it using gelly 
functionality later

- I will try to change where are the weight values to be in the edges instead 
of vertices (I've created a new version of the design document with the 
diagrams too). This way vertices will only have the old values in case the 
damping factor has to be used.


> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15443415#comment-15443415
 ] 

Josep Rubió commented on FLINK-1707:


Hi [~vkalavri],

Yeps, I'll do that. I thought you said you did not want it that way :(

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2662) CompilerException: "Bug: Plan generation for Unions picked a ship strategy between binary plan operators."

2016-09-05 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lorenz Bühmann updated FLINK-2662:
--
Attachment: FlinkBug.scala

A minimal example that leads to the reported exception. While keeping it as 
minimal as possible, the whole logic behind the program got lost - so please 
don't think about it's meaning.

> CompilerException: "Bug: Plan generation for Unions picked a ship strategy 
> between binary plan operators."
> --
>
> Key: FLINK-2662
>     URL: https://issues.apache.org/jira/browse/FLINK-2662
> Project: Flink
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Gabor Gevay
>Priority: Critical
> Fix For: 1.0.0
>
> Attachments: FlinkBug.scala
>
>
> I have a Flink program which throws the exception in the jira title. Full 
> text:
> Exception in thread "main" org.apache.flink.optimizer.CompilerException: Bug: 
> Plan generation for Unions picked a ship strategy between binary plan 
> operators.
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.collect(BinaryUnionReplacer.java:113)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:72)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:170)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.WorksetIterationPlanNode.acceptForStepFunction(WorksetIterationPlanNode.java:194)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:49)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:162)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.OptimizedPlan.accept(OptimizedPlan.java:127)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:520)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:402)
>   at 
> org.apache.flink.client.LocalExecutor.getOptimizerPlanAsJSON(LocalExecutor.java:202)
>   at 
> org.apache.flink.api.java.LocalEnvironment.getExecutionPlan(LocalEnvironment.java:63)
>   at malom.Solver.main(Solver.java:66)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> The execution plan:
> http://compalg.inf.elte.hu/~ggevay/flink/plan_3_4_0_0_without_verif.txt
> (I obtained this by commenting out the line that throws the exception)
> The code is here:
> https://github.com/ggevay/flink/tree/plan-generation-bug
> The class to run is "Solver". It needs a command line argument, which is a 
> directory where it would write output. (On first run, it generates some 
> lookuptables for a few minutes, which are then placed to /tmp/movegen)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2662) CompilerException: "Bug: Plan generation for Unions picked a ship strategy between binary plan operators."

2016-09-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15464487#comment-15464487
 ] 

Lorenz Bühmann edited comment on FLINK-2662 at 9/5/16 8:53 AM:
---

[~fhueske] I attached a file. It contains (hopefully) a minimal example that 
leads to the reported exception. While keeping it as minimal as possible, the 
whole logic behind the program got lost - so please don't think about it's 
meaning.


was (Author: lorenzb):
A minimal example that leads to the reported exception. While keeping it as 
minimal as possible, the whole logic behind the program got lost - so please 
don't think about it's meaning.

> CompilerException: "Bug: Plan generation for Unions picked a ship strategy 
> between binary plan operators."
> --
>
> Key: FLINK-2662
> URL: https://issues.apache.org/jira/browse/FLINK-2662
> Project: Flink
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Gabor Gevay
>Priority: Critical
> Fix For: 1.0.0
>
> Attachments: FlinkBug.scala
>
>
> I have a Flink program which throws the exception in the jira title. Full 
> text:
> Exception in thread "main" org.apache.flink.optimizer.CompilerException: Bug: 
> Plan generation for Unions picked a ship strategy between binary plan 
> operators.
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.collect(BinaryUnionReplacer.java:113)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:72)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:170)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.WorksetIterationPlanNode.acceptForStepFunction(WorksetIterationPlanNode.java:194)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:49)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:162)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.OptimizedPlan.accept(OptimizedPlan.java:127)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:520)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:402)
>   at 
> org.apache.flink.client.LocalExecutor.getOptimizerPlanAsJSON(LocalExecutor.java:202)
>   at 
> org.apache.flink.api.java.LocalEnvironment.getExecutionPlan(LocalEnvironment.java:63)
>   at malom.Solver.main(Solver.java:66)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> The execution plan:
> http://compalg.inf.elte.hu/~ggevay/flink/plan_3_4_0_0_without_verif.txt
> (I obtained this by commenting out the line that throws the exception)
> The code is here:
> https://github.com/ggevay/flink/tree/plan-generation-bug
> The class to run is "Solver". It needs a command line argument, which is a 
> directory where it would write output. (On first run, it generates some 
> lookuptables for a few minutes, which are then placed to /tmp/movegen)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2662) CompilerException: "Bug: Plan generation for Unions picked a ship strategy between binary plan operators."

2016-09-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15464487#comment-15464487
 ] 

Lorenz Bühmann edited comment on FLINK-2662 at 9/5/16 8:55 AM:
---

[~fhueske] I attached a file. It (hopefully) contains a minimal example that 
leads to the reported exception. While keeping it as minimal as possible, the 
whole logic behind the program got lost - so please don't think about it's 
meaning. Flink version used was 1.1.0 via Maven


was (Author: lorenzb):
[~fhueske] I attached a file. It contains (hopefully) a minimal example that 
leads to the reported exception. While keeping it as minimal as possible, the 
whole logic behind the program got lost - so please don't think about it's 
meaning.

> CompilerException: "Bug: Plan generation for Unions picked a ship strategy 
> between binary plan operators."
> --
>
> Key: FLINK-2662
> URL: https://issues.apache.org/jira/browse/FLINK-2662
> Project: Flink
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Gabor Gevay
>Priority: Critical
> Fix For: 1.0.0
>
> Attachments: FlinkBug.scala
>
>
> I have a Flink program which throws the exception in the jira title. Full 
> text:
> Exception in thread "main" org.apache.flink.optimizer.CompilerException: Bug: 
> Plan generation for Unions picked a ship strategy between binary plan 
> operators.
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.collect(BinaryUnionReplacer.java:113)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:72)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:170)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.WorksetIterationPlanNode.acceptForStepFunction(WorksetIterationPlanNode.java:194)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:49)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:162)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.OptimizedPlan.accept(OptimizedPlan.java:127)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:520)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:402)
>   at 
> org.apache.flink.client.LocalExecutor.getOptimizerPlanAsJSON(LocalExecutor.java:202)
>   at 
> org.apache.flink.api.java.LocalEnvironment.getExecutionPlan(LocalEnvironment.java:63)
>   at malom.Solver.main(Solver.java:66)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> The execution plan:
> http://compalg.inf.elte.hu/~ggevay/flink/plan_3_4_0_0_without_verif.txt
> (I obtained this by commenting out the line that throws the exception)
> The code is here:
> https://github.com/ggevay/flink/tree/plan-generation-bug
> The class to run is "Solver". It needs a command line argument, which is a 
> directory where it would write output. (On first run, it generates some 
> lookuptables for a few minutes, which are then placed to /tmp/movegen)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-09-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15466242#comment-15466242
 ] 

Josep Rubió commented on FLINK-1707:


Hi [~vkalavri],

I have updated the code if you want to take a look to it. I have added a member 
function to create the AP graph from an array.

I'll try to update the document this week.

https://github.com/joseprupi/flink/blob/master/flink-libraries/flink-gelly/src/main/java/org/apache/flink/graph/library/AffinityPropagation.java

Thanks!!



> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-1707) Add an Affinity Propagation Library Method

2016-09-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425389#comment-15425389
 ] 

Josep Rubió edited comment on FLINK-1707 at 9/8/16 3:49 AM:


Hi [~vkalavri],

I changed the implementation, now: 

* The convergence condition changes to local modifications to the vertices that 
decide to be exemplars instead of no modifications in messages. This should 
remain the same for certain number of steps that are defined in the constructor
* The values hash map is no longer needed
* If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and messages will be damped
* If a damping factor of 0 is provided to the constructor the oldValues HashMap 
will not be created

I will update the document

Thanks!


was (Author: joseprupi):
Hi [~vkalavri],

I changed the implementation now having these members for E vertices:

private HashMap weights;
private HashMap oldValues;
private long exemplar;
private int convergenceFactor;
private int convergenceFactorCounter;

where before it was:

private HashMap weights;
private HashMap values;
private HashMap oldValues;
private long exemplar;

So:

* The values hash map is no longer needed
* If a damping factor different to 0 is provided to the AP constructor, the 
oldValues HashMap will be populated and the algorithm will work the same way it 
was before. 
* If a damping factor of 0 is provided to the constructor:
-- The convergence condition changes to local modifications to the 
vertices that decide to be exemplars instead of no modifications in messages. 
This should remain the same for certain number of steps that are defined in the 
constructor
-- There is no damping factor applied to messages
-- The oldValues HashMap is not used

Now I need to change the initialization of the graph, I will post the question 
to the dev group and see if someone can help me on doing it the best way. Once 
I finish it I will clean and comment the code, update the document and submit a 
new pull request.

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2662) CompilerException: "Bug: Plan generation for Unions picked a ship strategy between binary plan operators."

2016-09-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473506#comment-15473506
 ] 

Lorenz Bühmann commented on FLINK-2662:
---

No problem, just let me know if you need something more [~fhueske]

> CompilerException: "Bug: Plan generation for Unions picked a ship strategy 
> between binary plan operators."
> --
>
> Key: FLINK-2662
> URL: https://issues.apache.org/jira/browse/FLINK-2662
> Project: Flink
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Gabor Gevay
>Priority: Critical
> Fix For: 1.0.0
>
> Attachments: FlinkBug.scala
>
>
> I have a Flink program which throws the exception in the jira title. Full 
> text:
> Exception in thread "main" org.apache.flink.optimizer.CompilerException: Bug: 
> Plan generation for Unions picked a ship strategy between binary plan 
> operators.
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.collect(BinaryUnionReplacer.java:113)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:72)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:170)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.WorksetIterationPlanNode.acceptForStepFunction(WorksetIterationPlanNode.java:194)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:49)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:162)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.OptimizedPlan.accept(OptimizedPlan.java:127)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:520)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:402)
>   at 
> org.apache.flink.client.LocalExecutor.getOptimizerPlanAsJSON(LocalExecutor.java:202)
>   at 
> org.apache.flink.api.java.LocalEnvironment.getExecutionPlan(LocalEnvironment.java:63)
>   at malom.Solver.main(Solver.java:66)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> The execution plan:
> http://compalg.inf.elte.hu/~ggevay/flink/plan_3_4_0_0_without_verif.txt
> (I obtained this by commenting out the line that throws the exception)
> The code is here:
> https://github.com/ggevay/flink/tree/plan-generation-bug
> The class to run is "Solver". It needs a command line argument, which is a 
> directory where it would write output. (On first run, it generates some 
> lookuptables for a few minutes, which are then placed to /tmp/movegen)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4613) Extend ALS to handle implicit feedback datasets

2016-09-12 Thread JIRA
Gábor Hermann created FLINK-4613:


 Summary: Extend ALS to handle implicit feedback datasets
 Key: FLINK-4613
 URL: https://issues.apache.org/jira/browse/FLINK-4613
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Gábor Hermann
Assignee: Gábor Hermann


The Alternating Least Squares implementation should be extended to handle 
_implicit feedback_ datasets. These datasets do not contain explicit ratings by 
users, they are rather built by collecting user behavior (e.g. user listened to 
artist X for Y minutes), and they require a slightly different optimization 
objective. See details by [Hu et al|http://dx.doi.org/10.1109/ICDM.2008.22].

We do not need to modify much in the original ALS algorithm. See [Spark ALS 
implementation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala],
 which could be a basis for this extension. Only the updating factor part is 
modified, and most of the changes are in the local parts of the algorithm (i.e. 
UDFs). In fact, the only modification that is not local, is precomputing a 
matrix product Y^T * Y and broadcasting it to all the nodes, which we can do 
with broadcast DataSets. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-09-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15488555#comment-15488555
 ] 

Josep Rubió commented on FLINK-1707:


Hi [~vkalavri],

I'm not sure I understand what you mean. Will not the matrix need to be read 
anyway?

Would providing a Dataset of Tuple3 (id1, id2, similarity) and create the 
vertices and edges through transformations work? The function should check no 
duplicates exist and fill non existing similarities with 0 value.

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-18 Thread JIRA
刘喆 created FLINK-4632:
-

 Summary: when yarn nodemanager lost,  flink hung
 Key: FLINK-4632
 URL: https://issues.apache.org/jira/browse/FLINK-4632
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination, Streaming
Affects Versions: 1.1.2, 1.2.0
 Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
Reporter: 刘喆


When run flink streaming on yarn,  using kafka as source,  it runs well when 
start. But after long run, for example  8 hours, dealing 60,000,000+ messages, 
it hung: no messages consumed,   one taskmanager is CANCELING, the exception 
show:

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
connection timeout
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: 连接超时
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more


after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
it show:

java.lang.Exception: TaskManager was lost/killed: 
ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
38.slave.adh (dataPort=45349)
at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:162)
at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:138)
at 
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
at 
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:224)
at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1054)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:458)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505243#comment-15505243
 ] 

刘喆 commented on FLINK-4632:
---

Yes. The job's status is canceling, then hung. Web page is ok,  client process 
is ok, but  kafka consumer canceled. I invoke 700 mapper,  698 canceled,  2 is 
canceling.

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh (dataPort=45349)
>   at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:162)
>   at 
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505243#comment-15505243
 ] 

刘喆 edited comment on FLINK-4632 at 9/20/16 1:26 AM:


Yes. The job's status is canceling, then hung. Web page is ok,  client process 
is ok, but  kafka consumer canceled. I invoke 700 mapper,  698 canceled,  2 are 
canceling.
When logging on the yarn nodemanager,  it is running. Some log related as below:

2016-09-11 22:21:17,061 INFO  
org.apache.flink.streaming.runtime.tasks.StreamTask   - State backend 
is set to heap memory (checkpoint to jobmanager)
2016-09-12 02:40:10,357 INFO  org.apache.flink.yarn.YarnTaskManagerRunner   
- RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2016-09-12 02:40:10,533 INFO  org.apache.flink.runtime.blob.BlobCache   
- Shutting down BlobCache
2016-09-12 02:40:10,821 INFO  
org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager 
removed spill file directory 
/data1/yarn/usercache/spa/appcache/application_1471620986643_563465/flink-io-ec6db7ad-e263-49fe-84a6-3dce983dc033
2016-09-12 02:40:10,895 INFO  
org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager 
removed spill file directory 
/data2/yarn/usercache/spa/appcache/application_1471620986643_563465/flink-io-4efbde55-1f64-44d2-87b8-85ed686c73c9

On ResourceManager, some log as below:
[r...@4.master.adh hadooplogs]# grep _1471620986643_563465_01_90 
yarn-yarn-resourcemanager-4.master.adh.log.31
2016-09-12 02:40:07,988 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1471620986643_563465_01_90 Container Transitioned from RUNNING to 
KILLED
2016-09-12 02:40:07,988 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: 
Completed container: container_1471620986643_563465_01_90 in state: KILLED 
event:KILL
2016-09-12 02:40:07,988 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=spa  
OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS  
APPID=application_1471620986643_563465  
CONTAINERID=container_1471620986643_563465_01_90
2016-09-12 02:40:07,988 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released 
container container_1471620986643_563465_01_90 of capacity  on host 105.slave.adh:39890, which currently has 2 containers, 
 used and  available, release 
resources=true
2016-09-12 02:40:07,988 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Application attempt appattempt_1471620986643_563465_01 released container 
container_1471620986643_563465_01_90 on node: host: 105.slave.adh:39890 
#containers=2 available=45904 used=4096 with event: KILL


was (Author: liuzhe):
Yes. The job's status is canceling, then hung. Web page is ok,  client process 
is ok, but  kafka consumer canceled. I invoke 700 mapper,  698 canceled,  2 is 
canceling.

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>

[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508441#comment-15508441
 ] 

刘喆 commented on FLINK-4632:
---

The container is killed by two reasons:
1,  yarn preemption
2,  yarn nodemanager unhealthy,  kill all java children

I killed one taskmanager manually,  sometimes reproduct this hung, but 
sometimes it restart successfully.

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh (dataPort=45349)
>   at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:162)
>   at 
> org.apache.flink.runtime.instance.SlotSharingG

[jira] [Created] (FLINK-4652) Don't pass credentials explicitly to AmazonClient - use credentials provider instead

2016-09-21 Thread JIRA
Kristian Frøhlich Hansen created FLINK-4652:
---

 Summary: Don't pass credentials explicitly to AmazonClient - use 
credentials provider instead
 Key: FLINK-4652
 URL: https://issues.apache.org/jira/browse/FLINK-4652
 Project: Flink
  Issue Type: Bug
  Components: Kinesis Connector
Affects Versions: 1.2.0
Reporter: Kristian Frøhlich Hansen
Priority: Minor
 Fix For: 1.2.0


By using the credentials explicitly we are responsible for checking and 
refreshing credentials before they expire. If no refreshment is done we will 
encounter AmazonServiceException: 'The security token included in the request 
is expired'. To utilize automatic refreshment of credentials pass the 
AWSCredentialsProvider direclty to AmazonClient by removing the 
getCredentials() call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-4663) Flink JDBCOutputFormat logs wrong WARN message

2016-09-22 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Márton Balassi updated FLINK-4663:
--
Assignee: Swapnil Chougule

> Flink JDBCOutputFormat logs wrong WARN message
> --
>
> Key: FLINK-4663
> URL: https://issues.apache.org/jira/browse/FLINK-4663
> Project: Flink
>  Issue Type: Bug
>  Components: Batch Connectors and Input/Output Formats
>Affects Versions: 1.1.1, 1.1.2
> Environment: Across Platform
>Reporter: Swapnil Chougule
>Assignee: Swapnil Chougule
> Fix For: 1.1.3
>
>
> Flink JDBCOutputFormat logs wrong WARN message as 
> "Column SQL types array doesn't match arity of passed Row! Check the passed 
> array..."
> even if there is no mismatch is SQL types array & arity of passed Row.
>  
> (flink/flink-batch-connectors/flink-jdbc/src/main/java/org/apache/flink/api/java/io/jdbc/JDBCOutputFormat.java)
> It logs lot of unnecessary warning messages (one per row passed) in log files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516530#comment-15516530
 ] 

刘喆 commented on FLINK-4632:
---

It is (2).
I make a new idle hadoop yarn cluster, and use the source from github September 
22,  and when the application running well, so it's order must be :   kill jvm( 
yarn peemption or unhealthy),  cancel application.
But when I run it on busy yarn cluster, the problem comes up again.

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh (dataPort=45349)
>   at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSl

[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry

2017-03-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903272#comment-15903272
 ] 

Ismaël Mejía commented on FLINK-3026:
-

Great, taking a look right now, I will rebase my PR adding the fix I see there, 
can I push directly in the docker-flink repo ?

> Publish the flink docker container to the docker registry
> -
>
> Key: FLINK-3026
> URL: https://issues.apache.org/jira/browse/FLINK-3026
> Project: Flink
>  Issue Type: Task
>  Components: Build System, Docker
>Reporter: Omer Katz
>Assignee: Patrick Lucas
>  Labels: Deployment, Docker
>
> There's a dockerfile that can be used to build a docker container already in 
> the repository. It'd be awesome to just be able to pull it instead of 
> building it ourselves.
> The dockerfile can be found at 
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
> It also doesn't point to the latest version of Flink which I fixed in 
> https://github.com/apache/flink/pull/1366



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry

2017-03-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906513#comment-15906513
 ] 

Ismaël Mejía commented on FLINK-3026:
-

Ok, we are going progress fast, that’s great. I agree with you I am 100% 
convinced now that those should not live in the Apache repo, but I still want 
to guarantee that the official image does not differ radically from the 
reference one in the apache repo.

My proposition more concretely is that we (the docker image maintainers) 
guarantee that the order of changes in the future come from the apache/flink 
repo towards the docker-flink repo, if you see my refactorings in the 
docker-flink repo go in this direction, they are mostly to make this job easier 
(but currently they come from the future with your unmerged changes and my 
non-alpine version).

I agree with you, in the future these will be minor changes, even if you look 
at the history of the image, we have had changes like each release and some 
small fixes in between. So changing the template should be rare.

> Publish the flink docker container to the docker registry
> -
>
> Key: FLINK-3026
> URL: https://issues.apache.org/jira/browse/FLINK-3026
> Project: Flink
>  Issue Type: Task
>  Components: Build System, Docker
>Reporter: Omer Katz
>Assignee: Patrick Lucas
>  Labels: Deployment, Docker
>
> There's a dockerfile that can be used to build a docker container already in 
> the repository. It'd be awesome to just be able to pull it instead of 
> building it ourselves.
> The dockerfile can be found at 
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
> It also doesn't point to the latest version of Flink which I fixed in 
> https://github.com/apache/flink/pull/1366



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry

2017-03-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906514#comment-15906514
 ] 

Ismaël Mejía commented on FLINK-3026:
-

One thing we need to discuss is the inclusion of the start-foreground scripts 
(btw, sorry if I removed that entrypoint from the docker-flink repo), I think 
we must not change any file of the release, because this can become a 
maintainance burden in the future, I prefer to do the least amount of changes 
possible, and concretely for the start-foreground case I prefer to include this 
only if it is part of the official distribution, e.g. by backporting this to 
flink 1.2.1, or start supporting this once it is released in flink 1.3.

> Publish the flink docker container to the docker registry
> -
>
> Key: FLINK-3026
> URL: https://issues.apache.org/jira/browse/FLINK-3026
> Project: Flink
>  Issue Type: Task
>  Components: Build System, Docker
>Reporter: Omer Katz
>Assignee: Patrick Lucas
>  Labels: Deployment, Docker
>
> There's a dockerfile that can be used to build a docker container already in 
> the repository. It'd be awesome to just be able to pull it instead of 
> building it ourselves.
> The dockerfile can be found at 
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
> It also doesn't point to the latest version of Flink which I fixed in 
> https://github.com/apache/flink/pull/1366



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry

2017-03-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906515#comment-15906515
 ] 

Ismaël Mejía commented on FLINK-3026:
-

I saw the images you built, probably it is a better idea to enable automated 
builds for the repo so those are rebuilt once we push the changes in github, 
have you tried this?

And, I have another proposition to make, maybe it is a better idea to create a 
github organization for this, and move the repo there, to end up with something 
like (docker-flink/docker-flink) instead of your personal account or the 
dataartisans one, that way we can share the maintainance, and I or other future 
maintainer can trigger the builds too from docker hub if he has the 
permissions. Currently I can’t trigger the builds of docker-flink because they 
only let your own account and the organizations you belong too.


> Publish the flink docker container to the docker registry
> -
>
> Key: FLINK-3026
> URL: https://issues.apache.org/jira/browse/FLINK-3026
> Project: Flink
>  Issue Type: Task
>  Components: Build System, Docker
>Reporter: Omer Katz
>Assignee: Patrick Lucas
>  Labels: Deployment, Docker
>
> There's a dockerfile that can be used to build a docker container already in 
> the repository. It'd be awesome to just be able to pull it instead of 
> building it ourselves.
> The dockerfile can be found at 
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
> It also doesn't point to the latest version of Flink which I fixed in 
> https://github.com/apache/flink/pull/1366



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6003) Add docker image based on the openjdk one (debian+alpine)

2017-03-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940298#comment-15940298
 ] 

Ismaël Mejía commented on FLINK-6003:
-

Just for reference, the current work on this JIRA is pending because we are 
prioritizing for the moment the official docker image work ongoing here 
FLINK-3026.
https://github.com/docker-flink/docker-flink/

We will probably backport changes here once they are stable in the other repo.

> Add docker image based on the openjdk one (debian+alpine)
> -
>
> Key: FLINK-6003
> URL: https://issues.apache.org/jira/browse/FLINK-6003
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System, Docker
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
>
> The base docker image 'java' is deprecated since the end of the last year.
> This issue will also re-introduce a Dockerfile for the debian-based openjdk 
> one in addition to the current one based on alpine.
> Additional refactorings included:
> - Move default version to 1.2.0
> - Refactor Dockerfile for consistency reasons (between both versions)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6203) DataSet Transformations

2017-03-28 Thread JIRA
苏拓 created FLINK-6203:
-

 Summary: DataSet Transformations
 Key: FLINK-6203
 URL: https://issues.apache.org/jira/browse/FLINK-6203
 Project: Flink
  Issue Type: Bug
  Components: DataSet API
Affects Versions: 1.2.0
Reporter: 苏拓
Priority: Minor
 Fix For: 1.2.0


the example of GroupReduce on sorted groups can't remove duplicate Strings in a 
DataSet.
need to add  "prev=t"
such as:
val output = input.groupBy(0).sortGroup(1, Order.ASCENDING).reduceGroup {
  (in, out: Collector[(Int, String)]) =>
var prev: (Int, String) = null
for (t <- in) {
  if (prev == null || prev != t)
out.collect(t)
prev=t
}
}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-4900) Implement Docker image support

2016-11-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mischa Krüger reassigned FLINK-4900:


Assignee: Mischa Krüger  (was: Eron Wright )

> Implement Docker image support
> --
>
> Key: FLINK-4900
> URL: https://issues.apache.org/jira/browse/FLINK-4900
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cluster Management
>Reporter: Eron Wright 
>Assignee: Mischa Krüger
>
> Support the use of a docker image, with both the unified containerizer and 
> the Docker containerizer.
> Use a configuration setting to explicitly configure which image and 
> containerizer to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-4900) Implement Docker image support

2016-11-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mischa Krüger updated FLINK-4900:
-
Labels: release  (was: )

> Implement Docker image support
> --
>
> Key: FLINK-4900
> URL: https://issues.apache.org/jira/browse/FLINK-4900
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cluster Management
>Reporter: Eron Wright 
>Assignee: Mischa Krüger
>  Labels: release
>
> Support the use of a docker image, with both the unified containerizer and 
> the Docker containerizer.
> Use a configuration setting to explicitly configure which image and 
> containerizer to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-4900) Implement Docker image support

2016-11-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mischa Krüger updated FLINK-4900:
-
Labels: release review  (was: release)

> Implement Docker image support
> --
>
> Key: FLINK-4900
> URL: https://issues.apache.org/jira/browse/FLINK-4900
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cluster Management
>Reporter: Eron Wright 
>Assignee: Mischa Krüger
>  Labels: release, review
>
> Support the use of a docker image, with both the unified containerizer and 
> the Docker containerizer.
> Use a configuration setting to explicitly configure which image and 
> containerizer to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-4900) Implement Docker image support

2016-11-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mischa Krüger updated FLINK-4900:
-
Labels: release  (was: release review)

> Implement Docker image support
> --
>
> Key: FLINK-4900
> URL: https://issues.apache.org/jira/browse/FLINK-4900
> Project: Flink
>  Issue Type: Sub-task
>  Components: Cluster Management
>Reporter: Eron Wright 
>Assignee: Mischa Krüger
>  Labels: release
>
> Support the use of a docker image, with both the unified containerizer and 
> the Docker containerizer.
> Use a configuration setting to explicitly configure which image and 
> containerizer to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-5109) Invalid Content-Encoding Header in REST API responses

2016-11-21 Thread JIRA
Móger Tibor László created FLINK-5109:
-

 Summary: Invalid Content-Encoding Header in REST API responses
 Key: FLINK-5109
 URL: https://issues.apache.org/jira/browse/FLINK-5109
 Project: Flink
  Issue Type: Bug
  Components: Web Client, Webfrontend
Affects Versions: 1.2.0
Reporter: Móger Tibor László


On REST API calls the Flink runtime responds with the header Content-Encoding, 
containing the value "utf-8". According to the HTTP/1.1 standard this header is 
invalid. ( https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5 ) 
Possible acceptable values are: gzip, compress, deflate. Or it should be 
omitted.
The invalid header may cause malfunction in projects building against Flink.

The invalid header may be present in earlier versions aswell.

Proposed solution: Remove lines from the project, where CONTENT_ENCODING header 
is set to "utf-8". (I could do this in a PR.)

Possible solution but may need further knowledge and skills than mine: 
Introduce a content-encoding. Doing so may need some configuration beacuse then 
Flink would have to encode the responses properly (even paying attention to the 
request's Accept-Encoding headers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5109) Invalid Content-Encoding Header in REST API responses

2016-11-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Móger Tibor László updated FLINK-5109:
--
Description: 
On REST API calls the Flink runtime responds with the header Content-Encoding, 
containing the value "utf-8". According to the HTTP/1.1 standard this header is 
invalid. ( https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5 ) 
Possible acceptable values are: gzip, compress, deflate. Or it should be 
omitted.
The invalid header may cause malfunction in projects building against Flink.

The invalid header may be present in earlier versions aswell.

Proposed solution: Remove lines from the project, where CONTENT_ENCODING header 
is set to "utf-8". (I could do this in a PR.)

Possible solution but may need further knowledge and skills than mine: 
Introduce content-encoding. Doing so may need some configuration beacuse then 
Flink would have to encode the responses properly (even paying attention to the 
request's Accept-Encoding headers).

  was:
On REST API calls the Flink runtime responds with the header Content-Encoding, 
containing the value "utf-8". According to the HTTP/1.1 standard this header is 
invalid. ( https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5 ) 
Possible acceptable values are: gzip, compress, deflate. Or it should be 
omitted.
The invalid header may cause malfunction in projects building against Flink.

The invalid header may be present in earlier versions aswell.

Proposed solution: Remove lines from the project, where CONTENT_ENCODING header 
is set to "utf-8". (I could do this in a PR.)

Possible solution but may need further knowledge and skills than mine: 
Introduce a content-encoding. Doing so may need some configuration beacuse then 
Flink would have to encode the responses properly (even paying attention to the 
request's Accept-Encoding headers).


> Invalid Content-Encoding Header in REST API responses
> -
>
> Key: FLINK-5109
> URL: https://issues.apache.org/jira/browse/FLINK-5109
> Project: Flink
>  Issue Type: Bug
>  Components: Web Client, Webfrontend
>Affects Versions: 1.2.0
>Reporter: Móger Tibor László
>  Labels: http-headers, rest_api
>
> On REST API calls the Flink runtime responds with the header 
> Content-Encoding, containing the value "utf-8". According to the HTTP/1.1 
> standard this header is invalid. ( 
> https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5 ) 
> Possible acceptable values are: gzip, compress, deflate. Or it should be 
> omitted.
> The invalid header may cause malfunction in projects building against Flink.
> The invalid header may be present in earlier versions aswell.
> Proposed solution: Remove lines from the project, where CONTENT_ENCODING 
> header is set to "utf-8". (I could do this in a PR.)
> Possible solution but may need further knowledge and skills than mine: 
> Introduce content-encoding. Doing so may need some configuration beacuse then 
> Flink would have to encode the responses properly (even paying attention to 
> the request's Accept-Encoding headers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5109) Invalid Content-Encoding Header in REST API responses

2016-11-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Móger Tibor László updated FLINK-5109:
--
Affects Version/s: 1.1.0
   1.1.1
   1.1.2
   1.1.3

> Invalid Content-Encoding Header in REST API responses
> -
>
> Key: FLINK-5109
> URL: https://issues.apache.org/jira/browse/FLINK-5109
> Project: Flink
>  Issue Type: Bug
>  Components: Web Client, Webfrontend
>Affects Versions: 1.1.0, 1.2.0, 1.1.1, 1.1.2, 1.1.3
>Reporter: Móger Tibor László
>  Labels: http-headers, rest_api
>
> On REST API calls the Flink runtime responds with the header 
> Content-Encoding, containing the value "utf-8". According to the HTTP/1.1 
> standard this header is invalid. ( 
> https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5 ) 
> Possible acceptable values are: gzip, compress, deflate. Or it should be 
> omitted.
> The invalid header may cause malfunction in projects building against Flink.
> The invalid header may be present in earlier versions aswell.
> Proposed solution: Remove lines from the project, where CONTENT_ENCODING 
> header is set to "utf-8". (I could do this in a PR.)
> Possible solution but may need further knowledge and skills than mine: 
> Introduce content-encoding. Doing so may need some configuration beacuse then 
> Flink would have to encode the responses properly (even paying attention to 
> the request's Accept-Encoding headers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-5156) Consolidate streaming FieldAccessor functionality

2016-11-24 Thread JIRA
Márton Balassi created FLINK-5156:
-

 Summary: Consolidate streaming FieldAccessor functionality
 Key: FLINK-5156
 URL: https://issues.apache.org/jira/browse/FLINK-5156
 Project: Flink
  Issue Type: Task
Reporter: Márton Balassi


The streaming FieldAccessors (keyedStream.keyBy(...)) have slightly different 
semantics compared to their batch counterparts. 

Currently the streaming ones allow selecting a field within an array (which 
might be dangerous as the array typeinfo does not contain the length of the 
array, thus leading to a potential index out of bounds) and accept not only 
"*", but also "0" to select a whole type.

This functionality should be either removed or documented. The latter can be 
achieved by effectively reverting [1]. Note that said commit was squashed 
before merging.

[1] 
https://github.com/mbalassi/flink/commit/237f07eb113508703c980b14587d66970e7f6251



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-3702) DataStream API PojoFieldAccessor doesn't support nested POJOs

2016-11-24 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Márton Balassi resolved FLINK-3702.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed via 1f04542 and 870e219.

> DataStream API PojoFieldAccessor doesn't support nested POJOs
> -
>
> Key: FLINK-3702
> URL: https://issues.apache.org/jira/browse/FLINK-3702
> Project: Flink
>  Issue Type: Improvement
>  Components: DataStream API
>Affects Versions: 1.0.0
>Reporter: Robert Metzger
>Assignee: Gabor Gevay
> Fix For: 1.2.0
>
>
> The {{PojoFieldAccessor}} (which is used by {{.sum(String)}} and similar 
> methods) doesn't support nested POJOs right now.
> As part of FLINK-3697 I'll add a check for a nested POJO and fail with an 
> exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1707) Add an Affinity Propagation Library Method

2016-11-26 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josep Rubió updated FLINK-1707:
---
Description: 
This issue proposes adding the an implementation of the Affinity Propagation 
algorithm as a Gelly library method and a corresponding example.
The algorithm is described in paper [1] and a description of a vertex-centric 
implementation can be found is [2].

[1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
[2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf

Design doc:
https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing

Example spreadsheet:
https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing

Graph:
https://docs.google.com/drawings/d/1PC3S-6AEt2Gp_TGrSfiWzkTcL7vXhHSxvM6b9HglmtA/edit?usp=sharing

  was:
This issue proposes adding the an implementation of the Affinity Propagation 
algorithm as a Gelly library method and a corresponding example.
The algorithm is described in paper [1] and a description of a vertex-centric 
implementation can be found is [2].

[1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
[2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf

Design doc:
https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing

Example spreadsheet:
https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing


> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing
> Graph:
> https://docs.google.com/drawings/d/1PC3S-6AEt2Gp_TGrSfiWzkTcL7vXhHSxvM6b9HglmtA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714684#comment-15714684
 ] 

刘喆 commented on FLINK-4632:
---

It appears again.
This time I get the logs.

On JobManager:
2016-12-02 16:27:17,275 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(1/10) (ca18e99f339595bf987913f543ef3f54) switched from DEPLOYING to RUNNING
2016-12-02 16:27:17,292 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(5/10) (733ebe58c797aa6e6ed9e85004d6c201) switched from DEPLOYING to RUNNING
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(9/10) (fa903c45717171026fae9997ec22b594) switched from RUNNING to FAILED
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink 
Streaming Java API Skeleton (c7b96c59cb04fd2cd096b23f85cebdb4) switched from 
state RUNNING to FAILING.
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
Connection timed out
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (1/10) (d99ba0f0a16c31e717c8044e92e0e8c4) switched from RUNNING 
to CANCELING
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (2/10) (294b46febec5e49a8a859d98ceca2a9e) switched from RUNNING 
to CANCELING



on  TaskManager:
2016-12-02 16:27:18,177 INFO  org.apache.zookeeper.ClientCnxn   
- Socket connection established to 10.10.10.203/10.10.10.203:2181, 
initiating session
2016-12-02 16:27:18,183 INFO  org.apache.zookeeper.ClientCnxn   
- Session establishment complete on server 
10.10.10.203/10.10.10.203:2181, sess

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714684#comment-15714684
 ] 

刘喆 edited comment on FLINK-4632 at 12/2/16 10:10 AM:
-

It appears again.  This time, there is no container lost.

This time I get the logs.

On JobManager:
2016-12-02 16:27:17,275 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(1/10) (ca18e99f339595bf987913f543ef3f54) switched from DEPLOYING to RUNNING
2016-12-02 16:27:17,292 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(5/10) (733ebe58c797aa6e6ed9e85004d6c201) switched from DEPLOYING to RUNNING
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(9/10) (fa903c45717171026fae9997ec22b594) switched from RUNNING to FAILED
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink 
Streaming Java API Skeleton (c7b96c59cb04fd2cd096b23f85cebdb4) switched from 
state RUNNING to FAILING.
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
Connection timed out
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (1/10) (d99ba0f0a16c31e717c8044e92e0e8c4) switched from RUNNING 
to CANCELING
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (2/10) (294b46febec5e49a8a859d98ceca2a9e) switched from RUNNING 
to CANCELING



on  TaskManager:
2016-12-02 16:27:18,177 INFO  org.apache.zookeeper.ClientCnxn   
- Socket connection established to 10.10.10.203/10.10.10.203:2181, 
initiating session
2016-12-02 16:27:18,183 INFO  org.apache.zookeeper.ClientCnxn   
   

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714684#comment-15714684
 ] 

刘喆 edited comment on FLINK-4632 at 12/2/16 10:10 AM:
-

It appears again.  This time, there is no container lost, and the job restarted.

This time I get the logs.

On JobManager:
2016-12-02 16:27:17,275 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(1/10) (ca18e99f339595bf987913f543ef3f54) switched from DEPLOYING to RUNNING
2016-12-02 16:27:17,292 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(5/10) (733ebe58c797aa6e6ed9e85004d6c201) switched from DEPLOYING to RUNNING
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(9/10) (fa903c45717171026fae9997ec22b594) switched from RUNNING to FAILED
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink 
Streaming Java API Skeleton (c7b96c59cb04fd2cd096b23f85cebdb4) switched from 
state RUNNING to FAILING.
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
Connection timed out
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (1/10) (d99ba0f0a16c31e717c8044e92e0e8c4) switched from RUNNING 
to CANCELING
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (2/10) (294b46febec5e49a8a859d98ceca2a9e) switched from RUNNING 
to CANCELING



on  TaskManager:
2016-12-02 16:27:18,177 INFO  org.apache.zookeeper.ClientCnxn   
- Socket connection established to 10.10.10.203/10.10.10.203:2181, 
initiating session
2016-12-02 16:27:18,183 INFO  org.apache.zookeeper.

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714684#comment-15714684
 ] 

刘喆 edited comment on FLINK-4632 at 12/2/16 10:11 AM:
-

This time, there is no container lost, and the job is not hung  but restarted.

This time I get the logs.

On JobManager:
2016-12-02 16:27:17,275 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(1/10) (ca18e99f339595bf987913f543ef3f54) switched from DEPLOYING to RUNNING
2016-12-02 16:27:17,292 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(5/10) (733ebe58c797aa6e6ed9e85004d6c201) switched from DEPLOYING to RUNNING
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(9/10) (fa903c45717171026fae9997ec22b594) switched from RUNNING to FAILED
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink 
Streaming Java API Skeleton (c7b96c59cb04fd2cd096b23f85cebdb4) switched from 
state RUNNING to FAILING.
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
Connection timed out
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (1/10) (d99ba0f0a16c31e717c8044e92e0e8c4) switched from RUNNING 
to CANCELING
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (2/10) (294b46febec5e49a8a859d98ceca2a9e) switched from RUNNING 
to CANCELING



on  TaskManager:
2016-12-02 16:27:18,177 INFO  org.apache.zookeeper.ClientCnxn   
- Socket connection established to 10.10.10.203/10.10.10.203:2181, 
initiating session
2016-12-02 16:27:18,183 INFO  org.apache.zookeeper.

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714684#comment-15714684
 ] 

刘喆 edited comment on FLINK-4632 at 12/2/16 10:32 AM:
-

This time, there is no container lost, and the job is not hung  but restarted.

flink-1.2-SNAPSHOT   commit : dbe70732434ff1396e1829080cd14f26b691489a
scala 2.11
hadoop   2.6.0-cdh5.5.1
java   jdk1.8.0_112


This time I get the logs.

On JobManager:
{noformat}
2016-12-02 16:27:17,275 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(1/10) (ca18e99f339595bf987913f543ef3f54) switched from DEPLOYING to RUNNING
2016-12-02 16:27:17,292 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(5/10) (733ebe58c797aa6e6ed9e85004d6c201) switched from DEPLOYING to RUNNING
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Sink: Unnamed 
(9/10) (fa903c45717171026fae9997ec22b594) switched from RUNNING to FAILED
2016-12-02 16:27:48,523 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink 
Streaming Java API Skeleton (c7b96c59cb04fd2cd096b23f85cebdb4) switched from 
state RUNNING to FAILING.
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
Connection timed out
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at 
io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (1/10) (d99ba0f0a16c31e717c8044e92e0e8c4) switched from RUNNING 
to CANCELING
2016-12-02 16:27:48,524 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Source: Custom 
Source -> Map (2/10) (294b46febec5e49a8a859d98ceca2a9e) switched from RUNNING 
to CANCELING
{noformat}


on  TaskManager:
{noformat}
2016-12-02 16:27:18,177 INFO  org.apache.zookeeper.ClientCnxn   
  

[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715048#comment-15715048
 ] 

刘喆 commented on FLINK-4632:
---

Thank you Ufuk Celebi.

It depends on the strategy of restarting.
Here I use auto restarting stategy, after restarting it works well for a while, 
and then restarts again.
For it's running on a busy YARN cluster, the network may sometime under stress. 
It seams than Flink is very weak on this YARN cluster. 
At the same time, some Spark streaming job on my cluster work well.

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh

[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715047#comment-15715047
 ] 

刘喆 commented on FLINK-4632:
---

Thank you Ufuk Celebi.

It depends on the strategy of restarting.
Here I use auto restarting stategy, after restarting it works well for a while, 
and then restarts again.
For it's running on a busy YARN cluster, the network may sometime under stress. 
It seams than Flink is very weak on this YARN cluster. 
At the same time, some Spark streaming job on my cluster work well.

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh

[jira] [Issue Comment Deleted] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-12-02 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

刘喆 updated FLINK-4632:
--
Comment: was deleted

(was: Thank you Ufuk Celebi.

It depends on the strategy of restarting.
Here I use auto restarting stategy, after restarting it works well for a while, 
and then restarts again.
For it's running on a busy YARN cluster, the network may sometime under stress. 
It seams than Flink is very weak on this YARN cluster. 
At the same time, some Spark streaming job on my cluster work well.)

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh (dataPort=45349)
>   at 
> org.apache.flink.runtime.instance.SimpleSl

[jira] [Commented] (FLINK-4300) Improve error message for different Scala versions of JM and Client

2016-12-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721686#comment-15721686
 ] 

Sergio Fernández commented on FLINK-4300:
-

Any progress on this?

As 
[docker-flink|https://github.com/apache/flink/tree/master/flink-contrib/docker-flink]
 uses {{SCALA_VERSION=2.11}}, I locally have {{2.11.8}}, but I still  have 
issues.

> Improve error message for different Scala versions of JM and Client
> ---
>
> Key: FLINK-4300
> URL: https://issues.apache.org/jira/browse/FLINK-4300
> Project: Flink
>  Issue Type: Improvement
>  Components: Client, JobManager
>Affects Versions: 1.1.0
>Reporter: Timo Walther
>
> If a user runs a job (e.g. via RemoteEnvironment) with different Scala 
> versions of JobManager and Client, the job is not executed and has no proper 
> error message. 
> The Client fails only with a meaningless warning:
> {code}
> 16:59:58,677 WARN  akka.remote.ReliableDeliverySupervisor 
>- Association with remote system [akka.tcp://flink@127.0.0.1:6123] has 
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> {code}
> JobManager log only contains the following warning:
> {code}
> 2016-08-01 16:59:58,664 WARN  akka.remote.ReliableDeliverySupervisor  
>   - Association with remote system 
> [akka.tcp://flink@192.168.1.142:63372] has failed, address is now gated for 
> [5000] ms. Reason is: [scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -2062608324514658839, local class 
> serialVersionUID = -114498752079829388].
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4982) quickstart example: Can't see the output

2016-12-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725705#comment-15725705
 ] 

Mischa Krüger commented on FLINK-4982:
--

Ping, still can't find the output^^

> quickstart example: Can't see the output
> 
>
> Key: FLINK-4982
> URL: https://issues.apache.org/jira/browse/FLINK-4982
> Project: Flink
>  Issue Type: Bug
>  Components: Quickstarts
>Reporter: Mischa Krüger
>
> When compiling and running the quickstart WordCount example on Flink, I can't 
> find the output anywhere. The java file explicitly write the counted words to 
> stdout, I searched (I hope) every log file, but couldn't find it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-5267) TaskManager logs not scrollable

2016-12-06 Thread JIRA
Mischa Krüger created FLINK-5267:


 Summary: TaskManager logs not scrollable
 Key: FLINK-5267
 URL: https://issues.apache.org/jira/browse/FLINK-5267
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
 Environment: DC/OS 1.8
Reporter: Mischa Krüger


Latest master build, have run the quickstart example successfully and wanted to 
see the TM logs, but they can't be scrolled. Download the log is luckily still 
possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-5298) Web UI crashes when TM log not existant

2016-12-08 Thread JIRA
Mischa Krüger created FLINK-5298:


 Summary: Web UI crashes when TM log not existant
 Key: FLINK-5298
 URL: https://issues.apache.org/jira/browse/FLINK-5298
 Project: Flink
  Issue Type: Bug
Reporter: Mischa Krüger


{code}
java.io.FileNotFoundException: flink-taskmanager.out (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at 
org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleRequestTaskManagerLog(TaskManager.scala:833)
at 
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:340)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-12-08 16:45:14,995 INFO  
org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - Stopping 
TaskManager akka://flink/user/taskmanager#1361882659.
2016-12-08 16:45:14,995 INFO  
org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - 
Disassociating from JobManager
2016-12-08 16:45:14,997 INFO  org.apache.flink.runtime.blob.BlobCache   
- Shutting down BlobCache
2016-12-08 16:45:15,006 INFO  
org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager 
removed spill file directory /tmp/flink-io-e61f717b-630c-4a2a-b3e3-62ccb40aa2f9
2016-12-08 16:45:15,006 INFO  
org.apache.flink.runtime.io.network.NetworkEnvironment- Shutting down 
the network environment and its components.
2016-12-08 16:45:15,008 INFO  
org.apache.flink.runtime.io.network.netty.NettyClient - Successful 
shutdown (took 1 ms).
2016-12-08 16:45:15,009 INFO  
org.apache.flink.runtime.io.network.netty.NettyServer - Successful 
shutdown (took 0 ms).
2016-12-08 16:45:15,020 INFO  
org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - Task 
manager akka://flink/user/taskmanager is completely shut down.
2016-12-08 16:45:15,023 ERROR org.apache.flink.runtime.taskmanager.TaskManager  
- Actor akka://flink/user/taskmanager#1361882659 terminated, 
stopping process...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5298) TaskManager crashes when TM log not existant

2016-12-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mischa Krüger updated FLINK-5298:
-
Summary: TaskManager crashes when TM log not existant  (was: Web UI crashes 
when TM log not existant)

> TaskManager crashes when TM log not existant
> 
>
> Key: FLINK-5298
> URL: https://issues.apache.org/jira/browse/FLINK-5298
> Project: Flink
>  Issue Type: Bug
>Reporter: Mischa Krüger
>
> {code}
> java.io.FileNotFoundException: flink-taskmanager.out (No such file or 
> directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.(FileInputStream.java:138)
> at 
> org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleRequestTaskManagerLog(TaskManager.scala:833)
> at 
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:340)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2016-12-08 16:45:14,995 INFO  
> org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - Stopping 
> TaskManager akka://flink/user/taskmanager#1361882659.
> 2016-12-08 16:45:14,995 INFO  
> org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - 
> Disassociating from JobManager
> 2016-12-08 16:45:14,997 INFO  org.apache.flink.runtime.blob.BlobCache 
>   - Shutting down BlobCache
> 2016-12-08 16:45:15,006 INFO  
> org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager 
> removed spill file directory 
> /tmp/flink-io-e61f717b-630c-4a2a-b3e3-62ccb40aa2f9
> 2016-12-08 16:45:15,006 INFO  
> org.apache.flink.runtime.io.network.NetworkEnvironment- Shutting down 
> the network environment and its components.
> 2016-12-08 16:45:15,008 INFO  
> org.apache.flink.runtime.io.network.netty.NettyClient - Successful 
> shutdown (took 1 ms).
> 2016-12-08 16:45:15,009 INFO  
> org.apache.flink.runtime.io.network.netty.NettyServer - Successful 
> shutdown (took 0 ms).
> 2016-12-08 16:45:15,020 INFO  
> org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - Task 
> manager akka://flink/user/taskmanager is completely shut down.
> 2016-12-08 16:45:15,023 ERROR 
> org.apache.flink.runtime.taskmanager.TaskManager  - Actor 
> akka://flink/user/taskmanager#1361882659 terminated, stopping process...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5298) TaskManager crashes when TM log not existant

2016-12-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15732728#comment-15732728
 ] 

Mischa Krüger commented on FLINK-5298:
--

This happened when the log was requested via the web UI.

> TaskManager crashes when TM log not existant
> 
>
> Key: FLINK-5298
> URL: https://issues.apache.org/jira/browse/FLINK-5298
> Project: Flink
>  Issue Type: Bug
>Reporter: Mischa Krüger
>
> {code}
> java.io.FileNotFoundException: flink-taskmanager.out (No such file or 
> directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.(FileInputStream.java:138)
> at 
> org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleRequestTaskManagerLog(TaskManager.scala:833)
> at 
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:340)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at 
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2016-12-08 16:45:14,995 INFO  
> org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - Stopping 
> TaskManager akka://flink/user/taskmanager#1361882659.
> 2016-12-08 16:45:14,995 INFO  
> org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - 
> Disassociating from JobManager
> 2016-12-08 16:45:14,997 INFO  org.apache.flink.runtime.blob.BlobCache 
>   - Shutting down BlobCache
> 2016-12-08 16:45:15,006 INFO  
> org.apache.flink.runtime.io.disk.iomanager.IOManager  - I/O manager 
> removed spill file directory 
> /tmp/flink-io-e61f717b-630c-4a2a-b3e3-62ccb40aa2f9
> 2016-12-08 16:45:15,006 INFO  
> org.apache.flink.runtime.io.network.NetworkEnvironment- Shutting down 
> the network environment and its components.
> 2016-12-08 16:45:15,008 INFO  
> org.apache.flink.runtime.io.network.netty.NettyClient - Successful 
> shutdown (took 1 ms).
> 2016-12-08 16:45:15,009 INFO  
> org.apache.flink.runtime.io.network.netty.NettyServer - Successful 
> shutdown (took 0 ms).
> 2016-12-08 16:45:15,020 INFO  
> org.apache.flink.mesos.runtime.clusterframework.MesosTaskManager  - Task 
> manager akka://flink/user/taskmanager is completely shut down.
> 2016-12-08 16:45:15,023 ERROR 
> org.apache.flink.runtime.taskmanager.TaskManager  - Actor 
> akka://flink/user/taskmanager#1361882659 terminated, stopping process...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-5301) Can't upload job via Web UI when using a proxy

2016-12-08 Thread JIRA
Mischa Krüger created FLINK-5301:


 Summary: Can't upload job via Web UI when using a proxy
 Key: FLINK-5301
 URL: https://issues.apache.org/jira/browse/FLINK-5301
 Project: Flink
  Issue Type: Bug
  Components: Webfrontend
Reporter: Mischa Krüger


Using DC/OS with Flink service in current development 
(https://github.com/mesosphere/dcos-flink-service). For reproduction:
1. Install a DC/OS cluster
2. Follow the instruction on mentioned repo for setting up a universe server 
with the flink app.
3. Install the flink app via the universe
4. Access the Web UI
5. Upload a job

Experience:
The upload reaches 100%, and then says "Saving..." forever.

Upload works when using ssh forwarding to access the node directly serving the 
Flink Web UI.

DC/OS uses a proxy to access the Web UI. The webpage is delivered by a 
component called the "Admin Router".

Side note:
Interestingly also the new favicon does not appear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4118) The docker-flink image is outdated (1.0.2) and can be slimmed down

2016-06-24 Thread JIRA
Ismaël Mejía created FLINK-4118:
---

 Summary: The docker-flink image is outdated (1.0.2) and can be 
slimmed down
 Key: FLINK-4118
 URL: https://issues.apache.org/jira/browse/FLINK-4118
 Project: Flink
  Issue Type: Improvement
Reporter: Ismaël Mejía
Priority: Minor


This issue is to upgrade the docker image and polish some details in it (e.g. 
it can be slimmed down if we remove some unneeded dependencies, and the code 
can be polished).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4118) The docker-flink image is outdated (1.0.2) and can be slimmed down

2016-06-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15349470#comment-15349470
 ] 

Ismaël Mejía commented on FLINK-4118:
-

I am working on this and I have a PR ready for review can somebody please 
assign this to me, Thanks.

> The docker-flink image is outdated (1.0.2) and can be slimmed down
> --
>
> Key: FLINK-4118
> URL: https://issues.apache.org/jira/browse/FLINK-4118
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ismaël Mejía
>Priority: Minor
>
> This issue is to upgrade the docker image and polish some details in it (e.g. 
> it can be slimmed down if we remove some unneeded dependencies, and the code 
> can be polished).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-06-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15349699#comment-15349699
 ] 

Josep Rubió commented on FLINK-1707:


Hi Vasia,

Maybe does not make sense to continue with this implementation. Even being a 
"graph" algorithm it does not seem to fit good to distributed graph platforms. 
I know there are some implementations of the original AP and they should be 
working good (I guess you know them), maybe this is what you need for Gelly.

I also think performance should be tested but I don't have access to a real 
cluster. I've done some tests before for Hadoop with a cluster mounted in my 
laptop, but 4 nodes of 3gb of memory is the maximum I can reach. Not much 
useful :(

By the way, before doing anything I'll document an example with some iterations 
and ask some concrete doubts about the implementation.

Thanks Vasia!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4120) Lightweight fault tolerance through recomputing lost state

2016-06-26 Thread JIRA
Dénes Vadász created FLINK-4120:
---

 Summary: Lightweight fault tolerance through recomputing lost state
 Key: FLINK-4120
 URL: https://issues.apache.org/jira/browse/FLINK-4120
 Project: Flink
  Issue Type: New Feature
  Components: Core
Reporter: Dénes Vadász


The current fault tolerance mechanism requires that stateful operators write 
their internal state to stable storage during a checkpoint. 

This proposal aims at optimizing out this operation in the cases where the 
operator state can be recomputed from a finite (practically: small) set of 
source records, and those records are already on checkpoint-aware persistent 
storage (e.g. in Kafka). 

The rationale behind the proposal is that the cost of reprocessing is paid only 
on recovery from (presumably rare) failures, whereas the cost of persisting the 
state is paid on every checkpoint. Eliminating the need for persistent storage 
will also simplify system setup and operation.

In the cases where this optimisation is applicable, the state of the operators 
can be restored by restarting the pipeline from a checkpoint taken before the 
pipeline ingested any of the records required for the state re-computation of 
the operators (call this the "null-state checkpoint"), as opposed to a restart 
from the "latest checkpoint". 
The "latest checkpoint" is still relevant for the recovery: the barriers 
belonging to that checkpoint must be inserted into the source streams in the 
position they were originally inserted. Sinks must discard all records until 
this barrier reaches them.

Note the inherent relationship between the "latest" and the "null-state" 
checkpoints: the pipeline must be restarted from the latter to restore the 
state at the former.

For the stateful operators for which this optimization is applicable we can 
define the notion of "current null-state watermark" as the watermark such that 
the operator can correctly (re)compute its current state merely from records 
after this watermark. 
 
For the checkpoint-coordinator to be able to compute the null-state checkpoint, 
each stateful operator should report its "current null-state watermark" as part 
of acknowledging the ongoing checkpoint. The null-state checkpoint of the 
ongoing checkpoint is the most recent checkpoint preceding all the received 
null-state watermarks (assuming the pipeline preserves the relative order of 
barriers and watermarks).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-06-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351211#comment-15351211
 ] 

Josep Rubió commented on FLINK-1707:


Agree, I put this same statement in the design doc. But the paper does not 
explain that running in parallel the calculations of E and I vertices of the 
binary model gives different intermediate results as it has a different 
scheduling but I think it should give same clusters (but I'm not an expert in 
bayesian networks). 

About the Capacitated Affinity Propagation, one of the advantages of the binary 
model is that adding constraints to clusters means you just need different 
calculations for E and I messages. In CAP case only the α(i,j) calculation is 
different so this means you just need to modify the functions that updates E 
vertices.

By the way, I'm not sure I understand your answer. Do you mean we should work 
with the original AP algorithm?

Thanks! 

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4120) Lightweight fault tolerance through recomputing lost state

2016-06-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351739#comment-15351739
 ] 

Dénes Vadász commented on FLINK-4120:
-

Whether the state is re-computed or restored from stable storage should be a 
trade-off controlled by the designer of the pipeline. If the input records are 
persisted before the pipeline ingests them and how long they are preserved 
there should again be at the discretion of the designer of the pipeline.
I am arguing in favour of making the re-computation of the state possible; not 
in favour of mandating it.

The use case I am coming with is correlation of log records to restore the 
message flow in a distributed computer system. In such a case the typical 
lifetime of operator state is a few minutes at most. (In fact we wanted to port 
a solution using this optimization (and written long before Spark and Flink 
were born) to Flink, and discovered that Flink currently lacks this.)

Note that if you want to mix recomputing and restoring operators in a pipeline, 
the restoring operators have to restore the state they saved during the 
null-state checkpoint and participate in the re-computation phase in the same 
way they behave during normal operation, unaffected by the latest checkpoint 
barrier. This barrier influences the behaviour of sinks only: before the 
barrier sinks consume and discard their input (as these records were already 
emitted before the failure); after the barrier they emit normally.

I do not quite understood the statement relating to barriers being 
non-deterministic; if in view of my explanations this is still relevant, can 
you please elaborate on it?

Regards
Dénes

> Lightweight fault tolerance through recomputing lost state
> --
>
> Key: FLINK-4120
> URL: https://issues.apache.org/jira/browse/FLINK-4120
> Project: Flink
>  Issue Type: New Feature
>  Components: State Backends, Checkpointing
>Reporter: Dénes Vadász
>Priority: Minor
>
> The current fault tolerance mechanism requires that stateful operators write 
> their internal state to stable storage during a checkpoint. 
> This proposal aims at optimizing out this operation in the cases where the 
> operator state can be recomputed from a finite (practically: small) set of 
> source records, and those records are already on checkpoint-aware persistent 
> storage (e.g. in Kafka). 
> The rationale behind the proposal is that the cost of reprocessing is paid 
> only on recovery from (presumably rare) failures, whereas the cost of 
> persisting the state is paid on every checkpoint. Eliminating the need for 
> persistent storage will also simplify system setup and operation.
> In the cases where this optimisation is applicable, the state of the 
> operators can be restored by restarting the pipeline from a checkpoint taken 
> before the pipeline ingested any of the records required for the state 
> re-computation of the operators (call this the "null-state checkpoint"), as 
> opposed to a restart from the "latest checkpoint". 
> The "latest checkpoint" is still relevant for the recovery: the barriers 
> belonging to that checkpoint must be inserted into the source streams in the 
> position they were originally inserted. Sinks must discard all records until 
> this barrier reaches them.
> Note the inherent relationship between the "latest" and the "null-state" 
> checkpoints: the pipeline must be restarted from the latter to restore the 
> state at the former.
> For the stateful operators for which this optimization is applicable we can 
> define the notion of "current null-state watermark" as the watermark such 
> that the operator can correctly (re)compute its current state merely from 
> records after this watermark. 
>  
> For the checkpoint-coordinator to be able to compute the null-state 
> checkpoint, each stateful operator should report its "current null-state 
> watermark" as part of acknowledging the ongoing checkpoint. The null-state 
> checkpoint of the ongoing checkpoint is the most recent checkpoint preceding 
> all the received null-state watermarks (assuming the pipeline preserves the 
> relative order of barriers and watermarks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-4120) Lightweight fault tolerance through recomputing lost state

2016-06-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351739#comment-15351739
 ] 

Dénes Vadász edited comment on FLINK-4120 at 6/27/16 8:23 PM:
--

Whether the state is re-computed or restored from stable storage should be a 
trade-off controlled by the designer of the pipeline. If the input records are 
persisted before the pipeline ingests them and how long they are preserved 
there is at the discretion of the designer of the pipeline, so he/she can align 
these decisions.
I am arguing in favour of making the re-computation of the state possible; not 
in favour of mandating it.

The use case I am coming with is correlation of log records to restore the 
message flow in a distributed computer system. In such a case the typical 
lifetime of operator state is a few minutes at most. (In fact we wanted to port 
a solution using this optimization (and written long before Spark and Flink 
were born) to Flink, and discovered that Flink currently lacks this.)

Note that if you want to mix recomputing and restoring operators in a pipeline, 
the restoring operators have to restore the state they saved during the 
null-state checkpoint and participate in the re-computation phase in the same 
way they behave during normal operation, unaffected by the latest checkpoint 
barrier. This barrier influences the behaviour of sinks only: before the 
barrier sinks consume and discard their input (as these records were already 
emitted before the failure); after the barrier they emit normally.

I do not quite understood the statement relating to barriers being 
non-deterministic; if in view of my explanations this is still relevant, can 
you please elaborate on it?

Regards
Dénes


was (Author: dvadasz):
Whether the state is re-computed or restored from stable storage should be a 
trade-off controlled by the designer of the pipeline. If the input records are 
persisted before the pipeline ingests them and how long they are preserved 
there should again be at the discretion of the designer of the pipeline.
I am arguing in favour of making the re-computation of the state possible; not 
in favour of mandating it.

The use case I am coming with is correlation of log records to restore the 
message flow in a distributed computer system. In such a case the typical 
lifetime of operator state is a few minutes at most. (In fact we wanted to port 
a solution using this optimization (and written long before Spark and Flink 
were born) to Flink, and discovered that Flink currently lacks this.)

Note that if you want to mix recomputing and restoring operators in a pipeline, 
the restoring operators have to restore the state they saved during the 
null-state checkpoint and participate in the re-computation phase in the same 
way they behave during normal operation, unaffected by the latest checkpoint 
barrier. This barrier influences the behaviour of sinks only: before the 
barrier sinks consume and discard their input (as these records were already 
emitted before the failure); after the barrier they emit normally.

I do not quite understood the statement relating to barriers being 
non-deterministic; if in view of my explanations this is still relevant, can 
you please elaborate on it?

Regards
Dénes

> Lightweight fault tolerance through recomputing lost state
> --
>
> Key: FLINK-4120
> URL: https://issues.apache.org/jira/browse/FLINK-4120
> Project: Flink
>  Issue Type: New Feature
>  Components: State Backends, Checkpointing
>Reporter: Dénes Vadász
>Priority: Minor
>
> The current fault tolerance mechanism requires that stateful operators write 
> their internal state to stable storage during a checkpoint. 
> This proposal aims at optimizing out this operation in the cases where the 
> operator state can be recomputed from a finite (practically: small) set of 
> source records, and those records are already on checkpoint-aware persistent 
> storage (e.g. in Kafka). 
> The rationale behind the proposal is that the cost of reprocessing is paid 
> only on recovery from (presumably rare) failures, whereas the cost of 
> persisting the state is paid on every checkpoint. Eliminating the need for 
> persistent storage will also simplify system setup and operation.
> In the cases where this optimisation is applicable, the state of the 
> operators can be restored by restarting the pipeline from a checkpoint taken 
> before the pipeline ingested any of the records required for the state 
> re-computation of the operators (call this the "null-state checkpoint"), as 
> opposed to a restart from the "latest checkpoint". 
> The "latest checkpoint"

[jira] [Created] (FLINK-4136) Add a JobHistory Server service

2016-06-30 Thread JIRA
Márton Balassi created FLINK-4136:
-

 Summary: Add a JobHistory Server service
 Key: FLINK-4136
 URL: https://issues.apache.org/jira/browse/FLINK-4136
 Project: Flink
  Issue Type: New Feature
  Components: Cluster Management
Reporter: Márton Balassi


When running on YARN the lack of a JobHistory server is very inconvenient. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4209) Docker image breaks with multiple NICs

2016-07-13 Thread JIRA
Ismaël Mejía created FLINK-4209:
---

 Summary: Docker image breaks with multiple NICs
 Key: FLINK-4209
 URL: https://issues.apache.org/jira/browse/FLINK-4209
 Project: Flink
  Issue Type: Improvement
  Components: flink-contrib
Reporter: Ismaël Mejía
Priority: Minor


The resolution of the host is done by IP today in the docker image scripts, 
this is an issue when the system has multiple network cards, if the hostname 
resolution is done by name, this is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4208) Support Running Flink processes in foreground mode

2016-07-13 Thread JIRA
Ismaël Mejía created FLINK-4208:
---

 Summary: Support Running Flink processes in foreground mode
 Key: FLINK-4208
 URL: https://issues.apache.org/jira/browse/FLINK-4208
 Project: Flink
  Issue Type: Improvement
Reporter: Ismaël Mejía
Priority: Minor


Flink clusters are started automatically in daemon mode, this is definitely the 
default case, however if we want to start containers based on flinks, the 
execution context gets lost. Running flink as foreground processes can fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-4208) Support Running Flink processes in foreground mode

2016-07-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía closed FLINK-4208.
---
Resolution: Fixed

I decided to remove the issue for foreground mode and change the docker script 
to use an extra wait to avoid losing the process in the background.

> Support Running Flink processes in foreground mode
> --
>
> Key: FLINK-4208
> URL: https://issues.apache.org/jira/browse/FLINK-4208
> Project: Flink
>  Issue Type: Improvement
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Flink clusters are started automatically in daemon mode, this is definitely 
> the default case, however if we want to start containers based on flinks, the 
> execution context gets lost. Running flink as foreground processes can fix 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-4209) Fix issue on docker with multiple NICs and remove supervisord dependency

2016-07-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated FLINK-4209:

Summary: Fix issue on docker with multiple NICs and remove supervisord 
dependency  (was: Docker image breaks with multiple NICs)

> Fix issue on docker with multiple NICs and remove supervisord dependency
> 
>
> Key: FLINK-4209
> URL: https://issues.apache.org/jira/browse/FLINK-4209
> Project: Flink
>  Issue Type: Improvement
>  Components: flink-contrib
>Reporter: Ismaël Mejía
>Priority: Minor
>
> The resolution of the host is done by IP today in the docker image scripts, 
> this is an issue when the system has multiple network cards, if the hostname 
> resolution is done by name, this is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-4259) Unclosed FSDataOutputStream in FileCache#copy()

2016-07-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Márton Balassi closed FLINK-4259.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed via 5d7e949.

> Unclosed FSDataOutputStream in FileCache#copy()
> ---
>
> Key: FLINK-4259
> URL: https://issues.apache.org/jira/browse/FLINK-4259
> Project: Flink
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Neelesh Srinivas Salian
>Priority: Minor
> Fix For: 1.2.0
>
>
> {code}
> try {
>   FSDataOutputStream lfsOutput = tFS.create(targetPath, false);
>   FSDataInputStream fsInput = sFS.open(sourcePath);
>   IOUtils.copyBytes(fsInput, lfsOutput);
> {code}
> The FSDataOutputStream lfsOutput should be closed upon exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2662) CompilerException: "Bug: Plan generation for Unions picked a ship strategy between binary plan operators."

2016-08-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401814#comment-15401814
 ] 

Lorenz Bühmann commented on FLINK-2662:
---

I do have the same problem:

Exception in thread "main" org.apache.flink.optimizer.CompilerException: Bug: 
Plan generation for Unions picked a ship strategy between binary plan operators.
at 
org.apache.flink.optimizer.traversals.BinaryUnionReplacer.collect(BinaryUnionReplacer.java:113)
at 
org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:73)
at 
org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:41)
at 
org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:170)
at 
org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
at 
org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
at 
org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
at 
org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
at 
org.apache.flink.optimizer.plan.OptimizedPlan.accept(OptimizedPlan.java:128)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:516)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:398)
at 
org.apache.flink.client.LocalExecutor.executePlan(LocalExecutor.java:184)
at 
org.apache.flink.api.java.LocalEnvironment.execute(LocalEnvironment.java:90)
at 
org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:855)
at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
at org.apache.flink.api.java.DataSet.print(DataSet.java:1605)
at org.apache.flink.api.scala.DataSet.print(DataSet.scala:1615)
at 
org.sansa.inference.flink.conformance.FlinkBug$.main(FlinkBug.scala:31)
at org.sansa.inference.flink.conformance.FlinkBug.main(FlinkBug.scala)

I tried both, Flink 1.0.3 and Flink 1.1.0-SNAPSHOT.

Any progress on it?

> CompilerException: "Bug: Plan generation for Unions picked a ship strategy 
> between binary plan operators."
> --
>
> Key: FLINK-2662
>     URL: https://issues.apache.org/jira/browse/FLINK-2662
> Project: Flink
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Gabor Gevay
> Fix For: 1.0.0
>
>
> I have a Flink program which throws the exception in the jira title. Full 
> text:
> Exception in thread "main" org.apache.flink.optimizer.CompilerException: Bug: 
> Plan generation for Unions picked a ship strategy between binary plan 
> operators.
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.collect(BinaryUnionReplacer.java:113)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:72)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.postVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:170)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:163)
>   at 
> org.apache.flink.optimizer.plan.WorksetIterationPlanNode.acceptForStepFunction(WorksetIterationPlanNode.java:194)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:49)
>   at 
> org.apache.flink.optimizer.traversals.BinaryUnionReplacer.preVisit(BinaryUnionReplacer.java:41)
>   at 
> org.apache.flink.optimizer.plan.DualInputPlanNode.accept(DualInputPlanNode.java:162)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.SingleInputPlanNode.accept(SingleInputPlanNode.java:199)
>   at 
> org.apache.flink.optimizer.plan.OptimizedPlan.accept(OptimizedPlan.java:127)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:520)
>   at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:402)
>   at 
> org.apache.flink.client.LocalExecutor.getOptimizerPlanAsJSON(LocalExecutor.java:202)
>   at 
> org.apache.flink.api.java.LocalEnvironment.getExecutionPlan(LocalE

[jira] [Updated] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josep Rubió updated FLINK-1707:
---
Description: 
This issue proposes adding the an implementation of the Affinity Propagation 
algorithm as a Gelly library method and a corresponding example.
The algorithm is described in paper [1] and a description of a vertex-centric 
implementation can be found is [2].

[1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
[2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf

Design doc:
https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing

Example spreadsheet:
https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing

  was:
This issue proposes adding the an implementation of the Affinity Propagation 
algorithm as a Gelly library method and a corresponding example.
The algorithm is described in paper [1] and a description of a vertex-centric 
implementation can be found is [2].

[1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
[2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf

Design doc:
https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing


> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1707) Add an Affinity Propagation Library Method

2016-08-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406651#comment-15406651
 ] 

Josep Rubió commented on FLINK-1707:


Hi all,

Sorry for my late response but I did not have much time to work with this.

I have created a spreadsheet executing the algorithm with the same example i 
have in github. This is a 3 nodes graph with following similarities:

1 -> 1: 1.0
1 -> 2: 1.0
1 -> 3: 5.0
2 -> 1: 1.0
2 -> 2: 1.0
2 -> 3: 3.0
3 -> 1: 5.0
3 -> 2: 3.0
3 -> 3: 1.0

The execution in the spreadsheet contains the calculations for intermediate 
results of the algorithm (eta and beta values being n and b respectively). 
These calculations are implicit in in alfa and rho in the implementation. 
You can see that values for Rho messages in the spreadsheet are the values sent 
from I to E nodes in the implementation and alfa messages are the values sent 
from E to I nodes.

Damping factor can be set at second row.

https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing

About the other issues:

- Although It is recommended in the paper to use damping it can be implemented 
without it, avoiding having in the vertices the old sent values for each 
destination. I don't see any other way to implement damping.
- Calculating values in parallel is not an issue is a different scheduling of 
the algorithm. It could be done in a serial mode just doing the calculations 
for odd or even vertices in each superstep.
- I will review the other implementation issues you commented

Thanks!

> Add an Affinity Propagation Library Method
> --
>
> Key: FLINK-1707
> URL: https://issues.apache.org/jira/browse/FLINK-1707
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Reporter: Vasia Kalavri
>Assignee: Josep Rubió
>Priority: Minor
>  Labels: requires-design-doc
> Attachments: Binary_Affinity_Propagation_in_Flink_design_doc.pdf
>
>
> This issue proposes adding the an implementation of the Affinity Propagation 
> algorithm as a Gelly library method and a corresponding example.
> The algorithm is described in paper [1] and a description of a vertex-centric 
> implementation can be found is [2].
> [1]: http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf
> [2]: http://event.cwi.nl/grades2014/00-ching-slides.pdf
> Design doc:
> https://docs.google.com/document/d/1QULalzPqMVICi8jRVs3S0n39pell2ZVc7RNemz_SGA4/edit?usp=sharing
> Example spreadsheet:
> https://docs.google.com/spreadsheets/d/1CurZCBP6dPb1IYQQIgUHVjQdyLxK0JDGZwlSXCzBcvA/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-3026) Publish the flink docker container to the docker registry

2016-08-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reassigned FLINK-3026:
---

Assignee: Ismaël Mejía

> Publish the flink docker container to the docker registry
> -
>
> Key: FLINK-3026
> URL: https://issues.apache.org/jira/browse/FLINK-3026
> Project: Flink
>  Issue Type: Task
>Reporter: Omer Katz
>Assignee: Ismaël Mejía
>  Labels: Deployment, Docker
>
> There's a dockerfile that can be used to build a docker container already in 
> the repository. It'd be awesome to just be able to pull it instead of 
> building it ourselves.
> The dockerfile can be found at 
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
> It also doesn't point to the latest version of Flink which I fixed in 
> https://github.com/apache/flink/pull/1366



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-3026) Publish the flink docker container to the docker registry

2016-08-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410906#comment-15410906
 ] 

Ismaël Mejía commented on FLINK-3026:
-

I assign myself this issue following the previous discussion with [~aljoscha].

> Publish the flink docker container to the docker registry
> -
>
> Key: FLINK-3026
> URL: https://issues.apache.org/jira/browse/FLINK-3026
> Project: Flink
>  Issue Type: Task
>Reporter: Omer Katz
>Assignee: Ismaël Mejía
>  Labels: Deployment, Docker
>
> There's a dockerfile that can be used to build a docker container already in 
> the repository. It'd be awesome to just be able to pull it instead of 
> building it ourselves.
> The dockerfile can be found at 
> https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
> It also doesn't point to the latest version of Flink which I fixed in 
> https://github.com/apache/flink/pull/1366



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-3155) Update Flink docker version to latest stable Flink version

2016-08-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated FLINK-3155:

Affects Version/s: 1.1.0

> Update Flink docker version to latest stable Flink version
> --
>
> Key: FLINK-3155
> URL: https://issues.apache.org/jira/browse/FLINK-3155
> Project: Flink
>  Issue Type: Task
>  Components: flink-contrib
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Maximilian Michels
> Fix For: 1.0.0
>
>
> It would be nice to always set the Docker Flink binary URL to point to the 
> latest Flink version. Until then, this JIRA keeps track of the updates for 
> releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-3155) Update Flink docker version to latest stable Flink version

2016-08-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated FLINK-3155:

Fix Version/s: 1.2.0

> Update Flink docker version to latest stable Flink version
> --
>
> Key: FLINK-3155
> URL: https://issues.apache.org/jira/browse/FLINK-3155
> Project: Flink
>  Issue Type: Task
>  Components: flink-contrib
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Maximilian Michels
> Fix For: 1.0.0
>
>
> It would be nice to always set the Docker Flink binary URL to point to the 
> latest Flink version. Until then, this JIRA keeps track of the updates for 
> releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-3155) Update Flink docker version to latest stable Flink version

2016-08-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated FLINK-3155:

Fix Version/s: (was: 1.2.0)

> Update Flink docker version to latest stable Flink version
> --
>
> Key: FLINK-3155
> URL: https://issues.apache.org/jira/browse/FLINK-3155
> Project: Flink
>  Issue Type: Task
>  Components: flink-contrib
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Maximilian Michels
> Fix For: 1.0.0
>
>
> It would be nice to always set the Docker Flink binary URL to point to the 
> latest Flink version. Until then, this JIRA keeps track of the updates for 
> releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-3155) Update Flink docker version to latest stable Flink version

2016-08-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated FLINK-3155:

Priority: Minor  (was: Major)

> Update Flink docker version to latest stable Flink version
> --
>
> Key: FLINK-3155
> URL: https://issues.apache.org/jira/browse/FLINK-3155
> Project: Flink
>  Issue Type: Task
>  Components: flink-contrib
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Maximilian Michels
>Priority: Minor
> Fix For: 1.0.0
>
>
> It would be nice to always set the Docker Flink binary URL to point to the 
> latest Flink version. Until then, this JIRA keeps track of the updates for 
> releases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4327) Error in link to downlad version for hadoop 2.7 and scala 2.11

2016-08-07 Thread JIRA
Ismaël Mejía created FLINK-4327:
---

 Summary: Error in link to downlad version for hadoop 2.7 and scala 
2.11
 Key: FLINK-4327
 URL: https://issues.apache.org/jira/browse/FLINK-4327
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Affects Versions: 1.1.0
Reporter: Ismaël Mejía


In the download page https://flink.apache.org/downloads.html
The link to download the release for hadoop 2.7.0 and scala 2.11 goes into the 
previous version (1.0.3) URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4326) Flink start-up scripts should optionally start services on the foreground

2016-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411605#comment-15411605
 ] 

Ismaël Mejía commented on FLINK-4326:
-

For ref this is the commit of my proposed solution:
https://github.com/iemejia/flink/commit/af9ac6cdb3f6601d6248abe82df4fd44de4453e5

Notice that I finally closed the pull request since we could hack this via the 
&& wait for the docker script that was my purpose, but I still agree that 
having foreground processes has its value. For ref, the JIRA where we discussed 
this early on:
https://issues.apache.org/jira/browse/FLINK-4208

Is there an alternative PR for this ?

> Flink start-up scripts should optionally start services on the foreground
> -
>
> Key: FLINK-4326
> URL: https://issues.apache.org/jira/browse/FLINK-4326
> Project: Flink
>  Issue Type: Improvement
>  Components: Startup Shell Scripts
>Affects Versions: 1.0.3
>Reporter: Elias Levy
>
> This has previously been mentioned in the mailing list, but has not been 
> addressed.  Flink start-up scripts start the job and task managers in the 
> background.  This makes it difficult to integrate Flink with most processes 
> supervisory tools and init systems, including Docker.  One can get around 
> this via hacking the scripts or manually starting the right classes via Java, 
> but it is a brittle solution.
> In addition to starting the daemons in the foreground, the start up scripts 
> should use exec instead of running the commends, so as to avoid forks.  Many 
> supervisory tools assume the PID of the process to be monitored is that of 
> the process it first executes, and fork chains make it difficult for the 
> supervisor to figure out what process to monitor.  Specifically, 
> jobmanager.sh and taskmanager.sh should exec flink-daemon.sh, and 
> flink-daemon.sh should exec java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4326) Flink start-up scripts should optionally start services on the foreground

2016-08-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413252#comment-15413252
 ] 

Ismaël Mejía commented on FLINK-4326:
-

Well it is good to know that there is interest around the foreground mode, 
probably it is a good idea to invite [~greghogan] to the discussion since he 
reviewed my previous PR.

What do you think ? should I rebase my previous patch and create a PR for this, 
or any of you guys has a better idea of how to do it ?

> Flink start-up scripts should optionally start services on the foreground
> -
>
> Key: FLINK-4326
> URL: https://issues.apache.org/jira/browse/FLINK-4326
> Project: Flink
>  Issue Type: Improvement
>  Components: Startup Shell Scripts
>Affects Versions: 1.0.3
>Reporter: Elias Levy
>
> This has previously been mentioned in the mailing list, but has not been 
> addressed.  Flink start-up scripts start the job and task managers in the 
> background.  This makes it difficult to integrate Flink with most processes 
> supervisory tools and init systems, including Docker.  One can get around 
> this via hacking the scripts or manually starting the right classes via Java, 
> but it is a brittle solution.
> In addition to starting the daemons in the foreground, the start up scripts 
> should use exec instead of running the commends, so as to avoid forks.  Many 
> supervisory tools assume the PID of the process to be monitored is that of 
> the process it first executes, and fork chains make it difficult for the 
> supervisor to figure out what process to monitor.  Specifically, 
> jobmanager.sh and taskmanager.sh should exec flink-daemon.sh, and 
> flink-daemon.sh should exec java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4326) Flink start-up scripts should optionally start services on the foreground

2016-08-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414182#comment-15414182
 ] 

Ismaël Mejía commented on FLINK-4326:
-

Well that's the question I was wondering before my previous PR but then I 
realized that having a centralized point for all the changes was less 
error-prone (current flink-daemon.sh), that's the reason I ended up mixing 
flink-daemon with an action like 'start-foreground', on the other hand we can 
rename flink-daemon into flink-service and it will make the same but it will 
have a less confusing naming.


> Flink start-up scripts should optionally start services on the foreground
> -
>
> Key: FLINK-4326
> URL: https://issues.apache.org/jira/browse/FLINK-4326
> Project: Flink
>  Issue Type: Improvement
>  Components: Startup Shell Scripts
>Affects Versions: 1.0.3
>Reporter: Elias Levy
>
> This has previously been mentioned in the mailing list, but has not been 
> addressed.  Flink start-up scripts start the job and task managers in the 
> background.  This makes it difficult to integrate Flink with most processes 
> supervisory tools and init systems, including Docker.  One can get around 
> this via hacking the scripts or manually starting the right classes via Java, 
> but it is a brittle solution.
> In addition to starting the daemons in the foreground, the start up scripts 
> should use exec instead of running the commends, so as to avoid forks.  Many 
> supervisory tools assume the PID of the process to be monitored is that of 
> the process it first executes, and fork chains make it difficult for the 
> supervisor to figure out what process to monitor.  Specifically, 
> jobmanager.sh and taskmanager.sh should exec flink-daemon.sh, and 
> flink-daemon.sh should exec java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4326) Flink start-up scripts should optionally start services on the foreground

2016-08-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15417255#comment-15417255
 ] 

Ismaël Mejía commented on FLINK-4326:
-

A separation of (daemon/console) scripts would be the nicest, no doubt. 
However, I am not sure if removing the PID code + output will be appropriate 
when we run daemons and foreground processes at the same time, how do we count 
the running instances if somebody runs a new process in foreground mode, or 
what would be the logic if we call stop-all, must we kill  all the processes 
even the foreground ones ? in these cases I think we need the PID/output refs, 
but well I am not really sure and maybe we can do such things without it.

Independent of this we must also not forget that we should preserve at least 
the same options (start|stop|stop-all) for both jobmanager.sh and taskmanager. 
because they do their magic (build the runtime options) and at the end they 
call the the  (daemon/console) script. I suppose we will need the new 
start-foreground option in these scripts too, or are there any other ideas of 
how to do it best ?


> Flink start-up scripts should optionally start services on the foreground
> -
>
> Key: FLINK-4326
> URL: https://issues.apache.org/jira/browse/FLINK-4326
> Project: Flink
>  Issue Type: Improvement
>  Components: Startup Shell Scripts
>Affects Versions: 1.0.3
>Reporter: Elias Levy
>
> This has previously been mentioned in the mailing list, but has not been 
> addressed.  Flink start-up scripts start the job and task managers in the 
> background.  This makes it difficult to integrate Flink with most processes 
> supervisory tools and init systems, including Docker.  One can get around 
> this via hacking the scripts or manually starting the right classes via Java, 
> but it is a brittle solution.
> In addition to starting the daemons in the foreground, the start up scripts 
> should use exec instead of running the commends, so as to avoid forks.  Many 
> supervisory tools assume the PID of the process to be monitored is that of 
> the process it first executes, and fork chains make it difficult for the 
> supervisor to figure out what process to monitor.  Specifically, 
> jobmanager.sh and taskmanager.sh should exec flink-daemon.sh, and 
> flink-daemon.sh should exec java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-4319) Rework Cluster Management (FLIP-6)

2017-04-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961748#comment-15961748
 ] 

Markus Müller commented on FLINK-4319:
--

For a reference implementation on Bluemix with kubernetes support in beta, 
please have a look at github.com/sedgewickmm18/After-HadoopsummitMeetupIoT  - 
it should work on all platforms with kubernetes support.
I'm waiting for FLIP-6 to implement multi-tenancy support, i.e. to have task 
managers dedicated to tenants and thus isolate tenant workload

> Rework Cluster Management (FLIP-6)
> --
>
> Key: FLINK-4319
> URL: https://issues.apache.org/jira/browse/FLINK-4319
> Project: Flink
>  Issue Type: Improvement
>  Components: Cluster Management
>Affects Versions: 1.1.0
>Reporter: Stephan Ewen
>
> This is the root issue to track progress of the rework of cluster management 
> (FLIP-6) 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6317) History server - wrong default directory

2017-04-18 Thread JIRA
Lorenz Bühmann created FLINK-6317:
-

 Summary: History server - wrong default directory
 Key: FLINK-6317
 URL: https://issues.apache.org/jira/browse/FLINK-6317
 Project: Flink
  Issue Type: Bug
  Components: Web Client
Affects Versions: 1.2.0
Reporter: Lorenz Bühmann
Priority: Minor


When the history server is started without a directory specified in the 
configuration file, it will use some random directory located in the Java Temp 
directory. Unfortunately, a file separator is missing:

{code:title=HistoryServer.java@L139-L143|borderStyle=solid}
String webDirectory = 
config.getString(HistoryServerOptions.HISTORY_SERVER_WEB_DIR);
if (webDirectory == null) {
webDirectory = System.getProperty("java.io.tmpdir") + 
"flink-web-history-" + UUID.randomUUID();
}
webDir = new File(webDirectory);
{code}

It should be 

{code}
webDirectory = System.getProperty("java.io.tmpdir") + File.separator +  
"flink-web-history-" + UUID.randomUUID();
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-6317) History server - wrong default directory

2017-04-18 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lorenz Bühmann updated FLINK-6317:
--
Description: 
When the history server is started without a directory specified in the 
configuration file, it will use some random directory located in the Java Temp 
directory. Unfortunately, a file separator is missing:

{code:title=HistoryServer.java@L139-L143|borderStyle=solid}
String webDirectory = 
config.getString(HistoryServerOptions.HISTORY_SERVER_WEB_DIR);
if (webDirectory == null) {
webDirectory = System.getProperty("java.io.tmpdir") + 
"flink-web-history-" + UUID.randomUUID();
}
webDir = new File(webDirectory);
{code}

It should be 

{code}
webDirectory = System.getProperty("java.io.tmpdir") + File.separator +  
"flink-web-history-" + UUID.randomUUID();
{code}

  was:
When the history server is started without a directory specified in the 
configuration file, it will use some random directory located in the Java Temp 
directory. Unfortunately, a file separator is missing:

{code:title=HistoryServer.java@L139-L143|borderStyle=solid}
String webDirectory = 
config.getString(HistoryServerOptions.HISTORY_SERVER_WEB_DIR);
if (webDirectory == null) {
webDirectory = System.getProperty("java.io.tmpdir") + 
"flink-web-history-" + UUID.randomUUID();
}
webDir = new File(webDirectory);
{code}

It should be 

{code}
webDirectory = System.getProperty("java.io.tmpdir") + File.separator +  
"flink-web-history-" + UUID.randomUUID();
{code}


> History server - wrong default directory
> 
>
> Key: FLINK-6317
> URL: https://issues.apache.org/jira/browse/FLINK-6317
> Project: Flink
>  Issue Type: Bug
>  Components: Web Client
>Affects Versions: 1.2.0
>Reporter: Lorenz Bühmann
>Priority: Minor
>  Labels: easyfix
>
> When the history server is started without a directory specified in the 
> configuration file, it will use some random directory located in the Java 
> Temp directory. Unfortunately, a file separator is missing:
> {code:title=HistoryServer.java@L139-L143|borderStyle=solid}
> String webDirectory = 
> config.getString(HistoryServerOptions.HISTORY_SERVER_WEB_DIR);
> if (webDirectory == null) {
>   webDirectory = System.getProperty("java.io.tmpdir") + 
> "flink-web-history-" + UUID.randomUUID();
> }
> webDir = new File(webDirectory);
> {code}
> It should be 
> {code}
> webDirectory = System.getProperty("java.io.tmpdir") + File.separator +  
> "flink-web-history-" + UUID.randomUUID();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981414#comment-15981414
 ] 

Gábor Hermann commented on FLINK-2147:
--

I would prefer emitting updates on a window basis, as a Flink have quite rich 
options for triggering. With overpartitioning, there could be _count-min sketch 
partitions_ (CMS-partitions), more than the number of partitions (i.e. 
subtasks). We could assign the CMS-partition to the input data (based on 
hashing) and keyBy on CMS-partition. Then, we could fold over the CMS-partition 
(a compact array), which is (AFAIK) internally stored as a keyed state. This 
way, we would not keep state for every key separately (saving memory), while 
allowing scaling operators (inc./dec. parallelism). Does that make sense?

Using windows makes easier to define _when to delete old data_ and _when to 
emit results_ and deal with _out-of-orderness_. However, with windows there's 
slightly more memory overhead compared to e.g. storing one count-min sketch 
array per partition.

A question is then *what API should we provide?* The user could specify the 
key, window assigner, trigger, evictor, allowedLateness, and the count-min 
sketch properties (size, hash functions). Then, the window could be translated 
into a another window keyed by the CMS-partition (as I described). But should 
it be a simply function that takes a DataStream as input and returns a 
DataStream with the results? Or should we add DataStream a special 
countMinSketch function to KeyedDataStream?

Alternatively, we could implement count-min sketch without windows. The user 
would specify two streams: one queries and the other writes the count-min 
sketch. So the "triggering" is done by a stream. The problem is then how do we 
specify when to delete old data and how to deal with out-of-orderness?

Another question is *where could we place the API?* In flink-streaming-java 
module? Or flink-streaming-contrib? This, of course, highly depends on what API 
we would provide.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
>     URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5753) CEP timeout handler.

2017-05-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007743#comment-16007743
 ] 

Michał Jurkiewicz commented on FLINK-5753:
--

[~kkl0u],

My job is operating is eventTime. Here is my configuration:
{code}
final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.getConfig().setGlobalJobParameters(conf);
{code}

> CEP timeout handler.
> 
>
> Key: FLINK-5753
> URL: https://issues.apache.org/jira/browse/FLINK-5753
> Project: Flink
>  Issue Type: Bug
>  Components: CEP
>Affects Versions: 1.1.2
>Reporter: Michał Jurkiewicz
>
> I configured the following flink job in my environment:
> {code}
> Pattern patternCommandStarted = Pattern. 
> begin("event-accepted").subtype(Event.class)
> .where(e -> {event accepted where 
> statement}).next("second-event-started").subtype(Event.class)
> .where(e -> {event started where statement}))
> .within(Time.seconds(30));
> DataStream> events = CEP
>   .pattern(eventsStream.keyBy(e -> e.getEventProperties().get("deviceCode")), 
> patternCommandStarted)
>   .select(eventSelector, eventSelector);
> static class EventSelector implements PatternSelectFunction, 
> PatternTimeoutFunction {}
> {code}
> The problem that I have is related to timeout handling. I observed that: 
> if: first event appears, second event not appear in the stream  
> and *no new events appear in a stream*, timeout handler is not executed.
> Expected result: timeout handler should be executed in case if there are no 
> new events in a stream



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15522303#comment-15522303
 ] 

刘喆 commented on FLINK-4632:
---

I tried the lastest github version, the problem is still there.
When the job is running, kill one taskmanager, then it becomes canceling and 
then hung

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh (dataPort=45349)
>   at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:162)
>   at 
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
>   at 
> org.apache

[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528781#comment-15528781
 ] 

刘喆 commented on FLINK-4632:
---

I can't reproduce it now.
I only save the TaskManager' log as the beginning.
If it happen again, I will come back here.
Thanks very much. 

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>Priority: Blocker
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 连接超时
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   ... 6 more
> after applyhttps://issues.apache.org/jira/browse/FLINK-4625   
> it show:
> java.lang.Exception: TaskManager was lost/killed: 
> ResourceID{resourceId='container_1471620986643_744852_01_001400'} @ 
> 38.slave.adh (dataPort=45349)
>   at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:162)
>   at 
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.ja

[jira] [Commented] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528833#comment-15528833
 ] 

刘喆 commented on FLINK-4632:
---

I think it is related to checkpoint.
When I use checkpoint with 'exactly_once'  mode,  sub task may hung but other 
sub tasks running, after a long time, all tasks hung.  At the same time,  there 
is no more checkpoint producted.
When killing, maybe the checkpoint block other thread.  I use JobManager as 
checkpoint backend.  The checkpoint interval is 30 seconds.

Some log as below:
2016-09-27 09:42:59,463 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (369/500) 
(e60c3755b5461c181e29cd30400cd6b0) switched from DEPLOYING to RUNNING
2016-09-27 09:42:59,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (274/500) 
(f983db19c603a51027cf7031e19edb79) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,256 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (497/500) 
(92bde125c4eb920d32aa11b6514f4cf1) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,477 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (373/500) 
(a59fec5a8ce66518d9003e6a480e1854) switched from DEPLOYING to RUNNING
2016-09-27 09:44:05,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 1 @ 1474940645865
2016-09-27 09:44:41,782 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 1 (in 35917 ms)
2016-09-27 09:49:05,865 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 2 @ 1474940945865
2016-09-27 09:50:05,866 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 2 
expired before completing.
2016-09-27 09:50:07,390 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:10,572 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:12,207 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2



> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2
> Environment: cdh5.5.1  jdk1.7 flink1.1.2  1.2-snapshot   kafka0.8.2
>Reporter: 刘喆
>Priority: Blocker
>
> When run flink streaming on yarn,  using kafka as source,  it runs well when 
> start. But after long run, for example  8 hours, dealing 60,000,000+ 
> messages, it hung: no messages consumed,   one taskmanager is CANCELING, the 
> exception show:
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> connection timeout
>   at 
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:152)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadExcepti

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528833#comment-15528833
 ] 

刘喆 edited comment on FLINK-4632 at 9/28/16 8:13 AM:


I think it is related to checkpoint.
When I use checkpoint with 'exactly_once'  mode,  sub task may hung but other 
sub tasks running, after a long time, all tasks hung.  At the same time,  there 
is no more checkpoint producted.
When killing, maybe the checkpoint block other thread.  I use JobManager as 
checkpoint backend.  The checkpoint interval is 30 seconds.

Some log as below:
2016-09-27 09:42:59,463 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (369/500) 
(e60c3755b5461c181e29cd30400cd6b0) switched from DEPLOYING to RUNNING
2016-09-27 09:42:59,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (274/500) 
(f983db19c603a51027cf7031e19edb79) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,256 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (497/500) 
(92bde125c4eb920d32aa11b6514f4cf1) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,477 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (373/500) 
(a59fec5a8ce66518d9003e6a480e1854) switched from DEPLOYING to RUNNING
2016-09-27 09:44:05,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 1 @ 1474940645865
2016-09-27 09:44:41,782 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 1 (in 35917 ms)
2016-09-27 09:49:05,865 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 2 @ 1474940945865
2016-09-27 09:50:05,866 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 2 
expired before completing.
2016-09-27 09:50:07,390 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:10,572 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:12,207 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2


and when I don't use checkpoint at all, it work well


was (Author: liuzhe):
I think it is related to checkpoint.
When I use checkpoint with 'exactly_once'  mode,  sub task may hung but other 
sub tasks running, after a long time, all tasks hung.  At the same time,  there 
is no more checkpoint producted.
When killing, maybe the checkpoint block other thread.  I use JobManager as 
checkpoint backend.  The checkpoint interval is 30 seconds.

Some log as below:
2016-09-27 09:42:59,463 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (369/500) 
(e60c3755b5461c181e29cd30400cd6b0) switched from DEPLOYING to RUNNING
2016-09-27 09:42:59,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (274/500) 
(f983db19c603a51027cf7031e19edb79) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,256 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (497/500) 
(92bde125c4eb920d32aa11b6514f4cf1) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,477 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (373/500) 
(a59fec5a8ce66518d9003e6a480e1854) switched from DEPLOYING to RUNNING
2016-09-27 09:44:05,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 1 @ 1474940645865
2016-09-27 09:44:41,782 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 1 (in 35917 ms)
2016-09-27 09:49:05,865 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 2 @ 1474940945865
2016-09-27 09:50:05,866 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 2 
expired before completing.
2016-09-27 09:50:07,390 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:10,572 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:12,207 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2



> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
> URL: https://issues.apache.org/jira/browse/FLINK-4632
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination, Streaming
>Affects Versions: 1.2.0, 1.1.2

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528833#comment-15528833
 ] 

刘喆 edited comment on FLINK-4632 at 9/28/16 8:25 AM:


I think it is related to checkpoint.
When I use checkpoint with 'exactly_once'  mode,  sub task may hung but other 
sub tasks running, after a long time, all tasks hung.  At the same time,  there 
is no more checkpoint producted.
When killing, maybe the checkpoint block other thread.  I use JobManager as 
checkpoint backend.  The checkpoint interval is 30 seconds.

I have 2 screenshot for it:
http://p1.bpimg.com/567571/eb9442e01ede0a24.png
http://p1.bpimg.com/567571/eb9442e01ede0a24.png



Some log as below:
2016-09-27 09:42:59,463 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (369/500) 
(e60c3755b5461c181e29cd30400cd6b0) switched from DEPLOYING to RUNNING
2016-09-27 09:42:59,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (274/500) 
(f983db19c603a51027cf7031e19edb79) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,256 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (497/500) 
(92bde125c4eb920d32aa11b6514f4cf1) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,477 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (373/500) 
(a59fec5a8ce66518d9003e6a480e1854) switched from DEPLOYING to RUNNING
2016-09-27 09:44:05,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 1 @ 1474940645865
2016-09-27 09:44:41,782 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 1 (in 35917 ms)
2016-09-27 09:49:05,865 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 2 @ 1474940945865
2016-09-27 09:50:05,866 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 2 
expired before completing.
2016-09-27 09:50:07,390 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:10,572 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:12,207 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2


and when I don't use checkpoint at all, it work well


was (Author: liuzhe):
I think it is related to checkpoint.
When I use checkpoint with 'exactly_once'  mode,  sub task may hung but other 
sub tasks running, after a long time, all tasks hung.  At the same time,  there 
is no more checkpoint producted.
When killing, maybe the checkpoint block other thread.  I use JobManager as 
checkpoint backend.  The checkpoint interval is 30 seconds.

Some log as below:
2016-09-27 09:42:59,463 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (369/500) 
(e60c3755b5461c181e29cd30400cd6b0) switched from DEPLOYING to RUNNING
2016-09-27 09:42:59,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (274/500) 
(f983db19c603a51027cf7031e19edb79) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,256 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (497/500) 
(92bde125c4eb920d32aa11b6514f4cf1) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,477 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (373/500) 
(a59fec5a8ce66518d9003e6a480e1854) switched from DEPLOYING to RUNNING
2016-09-27 09:44:05,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 1 @ 1474940645865
2016-09-27 09:44:41,782 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 1 (in 35917 ms)
2016-09-27 09:49:05,865 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 2 @ 1474940945865
2016-09-27 09:50:05,866 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 2 
expired before completing.
2016-09-27 09:50:07,390 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:10,572 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:12,207 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2


and when I don't use checkpoint at all, it work well

> when yarn nodemanager lost,  flink hung
> ---
>
> Key: FLINK-4632
>     URL: https://issues.apache.org/ji

[jira] [Comment Edited] (FLINK-4632) when yarn nodemanager lost, flink hung

2016-09-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-4632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528833#comment-15528833
 ] 

刘喆 edited comment on FLINK-4632 at 9/28/16 11:00 AM:
-

I think it is related to checkpoint.
When I use checkpoint with 'exactly_once'  mode,  sub task may hung but other 
sub tasks running, after a long time, all tasks hung.  At the same time,  there 
is no more checkpoint producted.
When killing, maybe the checkpoint block other thread.  I use JobManager as 
checkpoint backend.  The checkpoint interval is 30 seconds.

I have 2 screenshot for it:
http://p1.bpimg.com/567571/eb9442e01ede0a24.png
http://p1.bpimg.com/567571/6502b8c89fc68229.png



Some log as below:
2016-09-27 09:42:59,463 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (369/500) 
(e60c3755b5461c181e29cd30400cd6b0) switched from DEPLOYING to RUNNING
2016-09-27 09:42:59,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (274/500) 
(f983db19c603a51027cf7031e19edb79) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,256 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (497/500) 
(92bde125c4eb920d32aa11b6514f4cf1) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,477 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (373/500) 
(a59fec5a8ce66518d9003e6a480e1854) switched from DEPLOYING to RUNNING
2016-09-27 09:44:05,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 1 @ 1474940645865
2016-09-27 09:44:41,782 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 1 (in 35917 ms)
2016-09-27 09:49:05,865 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 2 @ 1474940945865
2016-09-27 09:50:05,866 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 2 
expired before completing.
2016-09-27 09:50:07,390 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:10,572 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:12,207 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2


and when I don't use checkpoint at all, it work well


was (Author: liuzhe):
I think it is related to checkpoint.
When I use checkpoint with 'exactly_once'  mode,  sub task may hung but other 
sub tasks running, after a long time, all tasks hung.  At the same time,  there 
is no more checkpoint producted.
When killing, maybe the checkpoint block other thread.  I use JobManager as 
checkpoint backend.  The checkpoint interval is 30 seconds.

I have 2 screenshot for it:
http://p1.bpimg.com/567571/eb9442e01ede0a24.png
http://p1.bpimg.com/567571/eb9442e01ede0a24.png



Some log as below:
2016-09-27 09:42:59,463 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (369/500) 
(e60c3755b5461c181e29cd30400cd6b0) switched from DEPLOYING to RUNNING
2016-09-27 09:42:59,552 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (274/500) 
(f983db19c603a51027cf7031e19edb79) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,256 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (497/500) 
(92bde125c4eb920d32aa11b6514f4cf1) switched from DEPLOYING to RUNNING
2016-09-27 09:43:00,477 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- send (373/500) 
(a59fec5a8ce66518d9003e6a480e1854) switched from DEPLOYING to RUNNING
2016-09-27 09:44:05,867 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 1 @ 1474940645865
2016-09-27 09:44:41,782 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed 
checkpoint 1 (in 35917 ms)
2016-09-27 09:49:05,865 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
checkpoint 2 @ 1474940945865
2016-09-27 09:50:05,866 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 2 
expired before completing.
2016-09-27 09:50:07,390 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:10,572 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2
2016-09-27 09:50:12,207 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 2


and when I don't use checkpoint at all, it work well

> wh

[jira] [Created] (FLINK-4706) Is there a way to load user classes first?

2016-09-28 Thread JIRA
刘喆 created FLINK-4706:
-

 Summary: Is there a way to load user classes first?
 Key: FLINK-4706
 URL: https://issues.apache.org/jira/browse/FLINK-4706
 Project: Flink
  Issue Type: New Feature
  Components: Core
Affects Versions: 1.1.2, 1.2.0
Reporter: 刘喆


If some classes in the job jar different with flink run-time or yarn 
environment, it can't work.  So how can we load classes in the job jar first? 
Is there some cli argument or some configuation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-4712) Implementing ranking predictions for ALS

2016-09-29 Thread JIRA
Domokos Miklós Kelen created FLINK-4712:
---

 Summary: Implementing ranking predictions for ALS
 Key: FLINK-4712
 URL: https://issues.apache.org/jira/browse/FLINK-4712
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Domokos Miklós Kelen


We started working on implementing ranking predictions for recommender systems. 
Ranking prediction means that beside predicting scores for user-item pairs, the 
recommender system is able to recommend a top K list for the users.

Details:

In practice, this would mean finding the K items for a particular user with the 
highest predicted rating. It should be possible also to specify whether to 
exclude the already seen items from a particular user's toplist. (See for 
example the 'exclude_known' setting of [Graphlab Create's ranking factorization 
recommender|https://turi.com/products/create/docs/generated/graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender.recommend.html#graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender.recommend].

The output of the topK recommendation function could be in the form of 
DataSet[(Int,Int,Int)], meaning (user, item, rank), similar to Graphlab 
Create's output. However, this is arguable: follow up work includes 
implementing ranking recommendation evaluation metrics (such as precision@k, 
recall@k, ndcg@k), similar to [Spark's 
implementations|https://spark.apache.org/docs/1.5.0/mllib-evaluation-metrics.html#ranking-systems].
 It would be beneficial if we were able to design the API such that it could be 
included in the proposed evaluation framework (see 
[5157|https://issues.apache.org/jira/browse/FLINK-2157]), which makes it 
neccessary to consider the possible output type DataSet[(Int, Array[Int])] or 
DataSet[(Int, Array[(Int,Double)])] meaning (user, array of items), possibly 
including the predicted scores as well. See [issue todo] for details.

Another question arising is whether to provide this function as a member of the 
ALS class, as a switch-kind of parameter to the ALS implementation (meaning the 
model is either a rating or a ranking recommender model) or in some other way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-4712) Implementing ranking predictions for ALS

2016-09-29 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domokos Miklós Kelen updated FLINK-4712:

Description: 
We started working on implementing ranking predictions for recommender systems. 
Ranking prediction means that beside predicting scores for user-item pairs, the 
recommender system is able to recommend a top K list for the users.

Details:

In practice, this would mean finding the K items for a particular user with the 
highest predicted rating. It should be possible also to specify whether to 
exclude the already seen items from a particular user's toplist. (See for 
example the 'exclude_known' setting of [Graphlab Create's ranking factorization 
recommender|https://turi.com/products/create/docs/generated/graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender.recommend.html#graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender.recommend]
 ).

The output of the topK recommendation function could be in the form of 
{{DataSet[(Int,Int,Int)]}}, meaning (user, item, rank), similar to Graphlab 
Create's output. However, this is arguable: follow up work includes 
implementing ranking recommendation evaluation metrics (such as precision@k, 
recall@k, ndcg@k), similar to [Spark's 
implementations|https://spark.apache.org/docs/1.5.0/mllib-evaluation-metrics.html#ranking-systems].
 It would be beneficial if we were able to design the API such that it could be 
included in the proposed evaluation framework (see 
[5157|https://issues.apache.org/jira/browse/FLINK-2157]), which makes it 
neccessary to consider the possible output type {{DataSet[(Int, Array[Int])]}} 
or {{DataSet[(Int, Array[(Int,Double)])]}} meaning (user, array of items), 
possibly including the predicted scores as well. See [issue todo] for details.

Another question arising is whether to provide this function as a member of the 
ALS class, as a switch-kind of parameter to the ALS implementation (meaning the 
model is either a rating or a ranking recommender model) or in some other way.

  was:
We started working on implementing ranking predictions for recommender systems. 
Ranking prediction means that beside predicting scores for user-item pairs, the 
recommender system is able to recommend a top K list for the users.

Details:

In practice, this would mean finding the K items for a particular user with the 
highest predicted rating. It should be possible also to specify whether to 
exclude the already seen items from a particular user's toplist. (See for 
example the 'exclude_known' setting of [Graphlab Create's ranking factorization 
recommender|https://turi.com/products/create/docs/generated/graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender.recommend.html#graphlab.recommender.ranking_factorization_recommender.RankingFactorizationRecommender.recommend].

The output of the topK recommendation function could be in the form of 
{{DataSet[(Int,Int,Int)]}}, meaning (user, item, rank), similar to Graphlab 
Create's output. However, this is arguable: follow up work includes 
implementing ranking recommendation evaluation metrics (such as precision@k, 
recall@k, ndcg@k), similar to [Spark's 
implementations|https://spark.apache.org/docs/1.5.0/mllib-evaluation-metrics.html#ranking-systems].
 It would be beneficial if we were able to design the API such that it could be 
included in the proposed evaluation framework (see 
[5157|https://issues.apache.org/jira/browse/FLINK-2157]), which makes it 
neccessary to consider the possible output type {{DataSet[(Int, Array[Int])]}} 
or {{DataSet[(Int, Array[(Int,Double)])]}} meaning (user, array of items), 
possibly including the predicted scores as well. See [issue todo] for details.

Another question arising is whether to provide this function as a member of the 
ALS class, as a switch-kind of parameter to the ALS implementation (meaning the 
model is either a rating or a ranking recommender model) or in some other way.


> Implementing ranking predictions for ALS
> 
>
>     Key: FLINK-4712
> URL: https://issues.apache.org/jira/browse/FLINK-4712
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Domokos Miklós Kelen
>
> We started working on implementing ranking predictions for recommender 
> systems. Ranking prediction means that beside predicting scores for user-item 
> pairs, the recommender system is able to recommend a top K list for the users.
> Details:
> In practice, this would mean finding the K items for a particular user with 
> the highest predicted rating. It should be possible also to specify whether 
> to excl

  1   2   3   4   5   6   7   8   9   10   >