Re: Reading from HBase problem

2015-06-09 Thread Hilmi Yildirim
Hi Ufuk, I used the TableInput format from flink-addons. Best Regards, Hilmi Am 09.06.2015 um 13:17 schrieb Ufuk Celebi: Hey Hilmi, thanks for reporting the issue. Sorry for the inconvenience this has caused. I'm not familiar with HBase in combination with Flink. From what I've seen, there a

Re: Reading from HBase problem

2015-06-09 Thread Ufuk Celebi
Hey Hilmi, thanks for reporting the issue. Sorry for the inconvenience this has caused. I'm not familiar with HBase in combination with Flink. From what I've seen, there are two options: either use Flink's TableInputFormat from flink-addons or the Hadoop TableInputFormat, right? Which one are y

Re: Reading from HBase problem

2015-06-09 Thread fhueske
Thank you very much! From: Hilmi Yildirim Sent: ‎Tuesday‎, ‎9‎. ‎June‎, ‎2015 ‎11‎:‎40 To: user@flink.apache.org Done https://issues.apache.org/jira/browse/FLINK-2188 Am 09.06.2015 um 11:26 schrieb Fabian Hueske: Would you mind opening a JIRA for this issue? -> https://issues.apa

Re: Reading from HBase problem

2015-06-09 Thread Hilmi Yildirim
Done https://issues.apache.org/jira/browse/FLINK-2188 Am 09.06.2015 um 11:26 schrieb Fabian Hueske: Would you mind opening a JIRA for this issue? -> https://issues.apache.org/jira/browse/FLINK I can do it as well, but you know all the details. Thanks, Fabian 2015-06-09 11:03 GMT+02:00 Hilmi

Re: Reading from HBase problem

2015-06-09 Thread Fabian Hueske
Would you mind opening a JIRA for this issue? -> https://issues.apache.org/jira/browse/FLINK I can do it as well, but you know all the details. Thanks, Fabian 2015-06-09 11:03 GMT+02:00 Hilmi Yildirim : > I want to add that I run the Flink job on a cluster with 13 machines and > each machine

Re: Reading from HBase problem

2015-06-09 Thread Hilmi Yildirim
I want to add that I run the Flink job on a cluster with 13 machines and each machine has 13 processing slots which results in a total number of processing slots of 169. Am 09.06.2015 um 10:59 schrieb Hilmi Yildirim: Correct. I also counted the rows with Spark and Hive. Both returned the same

Re: Reading from HBase problem

2015-06-09 Thread Hilmi Yildirim
Correct. I also counted the rows with Spark and Hive. Both returned the same value which is nearly 100 mio. rows. But Flink returns 102 mio. rows. Best Regards, Hilmi Am 09.06.2015 um 10:47 schrieb Fabian Hueske: OK, so the problem seems to be with the HBase InputFormat. I guess this issue

Re: Reading from HBase problem

2015-06-09 Thread Fabian Hueske
OK, so the problem seems to be with the HBase InputFormat. I guess this issue needs a bit of debugging. We need to check if records are emitted twice (or more often) and if that is the case which records. Unfortunately, this issue only seems to occur with large tables :-( Did I got that right, th

Re: Reading from HBase problem

2015-06-09 Thread Hilmi Yildirim
Hi, Now I tested the "count" method. It returns the same result as the flatmap.groupBy(0).sum(1) method. Furthermore, the Hbase contains nearly 100 mio. rows but the result is 102 mio.. This means that the HbaseInput reads more rows than the HBase contains. Best Regards, Hilmi Am 08.06.201

Re: Reading from HBase problem

2015-06-08 Thread Fabian Hueske
Hi Hilmi, I see two possible reasons: 1) The data source / InputFormat is not properly working, so not all HBase records are read/forwarded, or 2) The aggregation / count is buggy Roberts suggestion will use an alternative mechanism to do the count. In fact, you can count with groupBy(0).sum() a

Re: Reading from HBase problem

2015-06-08 Thread Robert Metzger
Hi Hilmi, if you just want to count the number of elements, you can also use accumulators, as described here [1]. They are much more lightweight. So you need to make your flatMap function a RichFlatMapFunction, then call getExecutionContext(). Use a long accumulator to count the elements. If the

Reading from HBase problem

2015-06-08 Thread Hilmi Yildirim
Hi, I implemented a simple Flink Batch job which reads from an HBase Cluster of 13 machines and with nearly 100 million rows. The hbase version is 1.0.0-cdh5.4.1. So, I imported hbase-client 1.0.0-cdh5.4.1. I implemented a flatmap which creates a tuple ("a", 1L) for each row . Then, I use group

Reading from HBase problem

2015-06-08 Thread Hilmi Yildirim
Hi, I implemented a simple Flink Batch job which reads from an HBase Cluster of 13 machines and with nearly 100 million rows. The hbase version is 1.0.0-cdh5.4.1. So, I imported hbase-client 1.0.0-cdh5.4.1. I implemented a flatmap which creates a tuple ("a", 1L) for each row . Then, I use group