Spring release using apache clients 11

2017-07-20 Thread David Espinosa
Hi, somebody know if we will any spring integration/kafka release soon
using apache clients 11?


Re: Spring release using apache clients 11

2017-07-20 Thread David Espinosa
Thanks Rajini!

El dia 20 jul. 2017 18:41, "Rajini Sivaram"  va
escriure:

> David,
>
> The release plans are here: https://github.com/spring-
> projects/spring-kafka/
> milestone/20?closed=1
>
> We have already included TX and headers support to the current M3 which is
> planned just after the next SF 5.0 RC3, which is expected tomorrow.
>
> Regards,
>
> Rajini
>
> On Thu, Jul 20, 2017 at 5:01 PM, David Espinosa  wrote:
>
> > Hi, somebody know if we will any spring integration/kafka release soon
> > using apache clients 11?
> >
>


Re: Spring release using apache clients 11

2017-07-21 Thread David Espinosa
Hi Gary,
The feature I'm looking for to use from apache clients 11 is the custom
headers creation (if i'm not wrong you have already created the support for
it). Up now, I was populating a list of headers (metadata) into the payload
of the message itself, so as soon as I can move them to the kafka "message
headers space", it will be great :)

Thanks!
David

2017-07-21 13:16 GMT+02:00 Gary Russell :

> We are also considering releasing a 1.3 version, which will have a subset
> of the 2.0 features, and also support the 0.11 clients, while not requiring
> Spring Framework 5.0 and Java 8.
>
> On 2017-07-20 15:58 (-0400), David Espinosa  wrote:
> > Thanks Rajini!
> >
> > El dia 20 jul. 2017 18:41, "Rajini Sivaram"  va
> > escriure:
> >
> > > David,
> > >
> > > The release plans are here: https://github.com/spring-
> > > projects/spring-kafka/
> > > milestone/20?closed=1
> > >
> > > We have already included TX and headers support to the current M3
> which is
> > > planned just after the next SF 5.0 RC3, which is expected tomorrow.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > > On Thu, Jul 20, 2017 at 5:01 PM, David Espinosa 
> wrote:
> > >
> > > > Hi, somebody know if we will any spring integration/kafka release
> soon
> > > > using apache clients 11?
> > > >
> > >
> >
>


jute maxbuffer in Zookeeper

2017-08-07 Thread David Espinosa
Hi all,
In fact this is a question regarding Zookeeper but very related to Kafka.
I'm doing a test in order to check that we can create up to 100k topics in
a kafka cluster, so we can manage multitenancy this way.

After a proper setup, I have managed to create those 100k topics in Kafka,
but now I find that I can't retrieve the list of topics using
kafka-topics.sh, this is the error I get:

[2017-08-07 08:12:37,085] WARN Session 0x15dbb9894d3 for server ...,
unexpected error, closing socket connection and attempting reconnect
(org.apache.zookeeper.ClientCnxn)
java.io.IOException: Packet len4194320 is out of range!
at
org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

I have read that this is related to a property in Zookeeper named
jute.maxbuffer, and that I have to set it up through a java property. But
after some tries I'm not getting any result, so I would like to ask you if
somebody has found this error, and how have you solved it.

Thanks in advance,
David


java.io.IOException: Packet len4194320 is out of range!

2017-08-07 Thread David Espinosa
Hi,
I'm having this error when trying to connect zookeeper once I have created
+70k topics.
I have played with the java property jute.maxbuffer with no success.
Have anybody found this error before?

Thanks in advance,
David


Kafka, Data Lake and Event Sourcing

2017-08-21 Thread David Espinosa
Hi,
Nowadays in my company we are planning to create a Data Lake. As we have
started also to use Kafka as our Event Store, and therefore implement some
Event Sourcing on it, we are wondering if it would be a good idea to use
the same approach to create a Data Lake.

So, one of the ideas in our mind is to use Kafka as our primary data source
for the Data Lake.

Has anyone some experience on this?

Thanks in advance,
David


GDPR appliance

2017-11-22 Thread David Espinosa
Hi all,
I would like to double check with you how we want to apply some GDPR into
my kafka topics. In concrete the "right to be forgotten", what forces us to
delete some data contained in the messages. So not deleting the message,
but editing it.
For doing that, my intention is to replicate the topic and apply a
transformation over it.
I think that frameworks like Kafka Streams or Apache Storm.

Did anybody had to solve this problem?

Thanks in advance.


Re: GDPR appliance

2017-11-23 Thread David Espinosa
Hi Scott and thanks for your reply.
For what you say, I guess that when you are asked to delete some "data
user" (that's the "right to be forgotten" in GDPR), what you are really
doing is blocking the access to it. I had a similar approach, based on the
idea of Greg Young's solution of encrypting any private data and forgetting
the key when data has to deleted.
Sadly, our legal department after some checkins has conclude that this
approach is "to block" data but not deleting it, as a consequence it can
take us problems. If my guess about your solution is right, you could have
the same problems.

Thanks

2017-11-22 19:59 GMT+01:00 Scott Reynolds :

> We are using Kafka Connect consumers that consume from the raw unredacted
> topic and apply transformations and produce to a redacted topic. Using
> kafka connect allows us to set it all up with an HTTP request and doesn't
> require additional infrastructure.
>
> Then we wrote a KafkaPrincipal builder to authenticate each consumer to
> their service names. KafkaPrincipal class is specified in the
> server.properties file on the brokers. To provide topic level access
> control we just configured SimpleAclAuthorizer. The net result is, some
> consumers can only read redacted topic and very few have consumers can read
> unredacted.
>
> On Wed, Nov 22, 2017 at 10:47 AM David Espinosa  wrote:
>
> > Hi all,
> > I would like to double check with you how we want to apply some GDPR into
> > my kafka topics. In concrete the "right to be forgotten", what forces us
> to
> > delete some data contained in the messages. So not deleting the message,
> > but editing it.
> > For doing that, my intention is to replicate the topic and apply a
> > transformation over it.
> > I think that frameworks like Kafka Streams or Apache Storm.
> >
> > Did anybody had to solve this problem?
> >
> > Thanks in advance.
> >
> --
>
> Scott Reynolds
> Principal Engineer
> [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> MOBILE (630) 254-2474
> EMAIL sreyno...@twilio.com
>


Re: GDPR appliance

2018-01-26 Thread David Espinosa
Thanks a lot. I think that's the only way that ensures GDPR compliance.
In a second iteration, my thoughts are to anonymize instead of removing,
maybe identifying PII fields using AVRO custom types.

Thanks again,

2017-11-28 15:54 GMT+01:00 Ben Stopford :

> You should also be able to manage this with a compacted topic. If you give
> each message a unique key you'd then be able to delete, or overwrite
> specific records. Kafka will delete them from disk when compaction runs. If
> you need to partition for ordering purposes you'd need to use a custom
> partitioner that extracts a partition key from the unique key before it
> does the hash.
>
> B
>
> On Sun, Nov 26, 2017 at 10:40 AM Wim Van Leuven <
> wim.vanleu...@highestpoint.biz> wrote:
>
> > Thanks, Lars, for the most interesting read!
> >
> >
> >
> > On Sun, 26 Nov 2017 at 00:38 Lars Albertsson  wrote:
> >
> > > Hi David,
> > >
> > > You might find this presentation useful:
> > > https://www.slideshare.net/lallea/protecting-privacy-in-practice
> > >
> > > It explains privacy building blocks primarily in a batch processing
> > > context, but most of the principles are applicable for stream
> > > processing as well, e.g. splitting non-PII and PII data ("ejected
> > > record" slide), encrypting PII data ("lost key" slide).
> > >
> > > Regards,
> > >
> > >
> > >
> > > Lars Albertsson
> > > Data engineering consultant
> > > www.mapflat.com
> > > https://twitter.com/lalleal
> > > +46 70 7687109 <+46%2070%20768%2071%2009> <+46%2070%20768%2071%2009>
> > > Calendar: http://www.mapflat.com/calendar
> > >
> > >
> > > On Wed, Nov 22, 2017 at 7:46 PM, David Espinosa 
> > wrote:
> > > > Hi all,
> > > > I would like to double check with you how we want to apply some GDPR
> > into
> > > > my kafka topics. In concrete the "right to be forgotten", what forces
> > us
> > > to
> > > > delete some data contained in the messages. So not deleting the
> > message,
> > > > but editing it.
> > > > For doing that, my intention is to replicate the topic and apply a
> > > > transformation over it.
> > > > I think that frameworks like Kafka Streams or Apache Storm.
> > > >
> > > > Did anybody had to solve this problem?
> > > >
> > > > Thanks in advance.
> > >
> >
>


Confluent REST proxy and Kafka Headers

2018-01-26 Thread David Espinosa
Hi all,
Does somebody know if it's possible to retrieve message kafka headers
(since 11) using the Confluent REST proxy?

Thanks in advance,


Re: Recommended max number of topics (and data separation)

2018-01-28 Thread David Espinosa
Hi Monty,

I'm also planning to use a big amount of topics in Kafka, so recently I
made a test within a 3 nodes kafka cluster where I created 100k topics with
one partition. Sent 1M messages in total.
These are my conclusions:

   - There is not any limitation on kafka regarding the number of topics
   but on Zookeeper and in the system where Kafka nodes is allocated.
   - Zookeeper will start having problems from 70k topics, which can be
   solved modifying a buffer parameter on the JVM (-Djute.maxbuffer).
   Performance is reduced.
   - Open file descriptors of the system are equivalent to [number of
   topics]X[number of partitions per topic]. Set to 128k in my test to avoid
   problems.
   - System needs a big amount of memory for page caching.

So, after creating 100k with the required setup (system+JVM) but seeing
problems at 70k, I feel safe by not creating more than 50k, and always will
have Zookeeper as my first suspect if a problem comes. I think with proper
resources (memory) and system setup (open file descriptors), you don't have
any real limitation regarding partitions.
By the way, I used long topic names (about 30 characters), which can be
important for ZK.
Hope this information is of your help.

David

2018-01-28 2:22 GMT+01:00 Monty Hindman :

> I'm designing a system and need some more clarity regarding Kafka's
> recommended limits on the number of topics and/or partitions. At a high
> level, our system would work like this:
>
> - A user creates a job X (X is a UUID).
> - The user uploads data for X to an input topic: X.in.
> - Workers process the data, writing results to an output topic: X.out.
> - The user downloads the data from X.out.
>
> It's important for the system that data for different jobs be kept
> separate, and that input and output data be kept separate. By "separate" I
> mean that there needs to be a reasonable way for users and the system's
> workers to query for the data they need (by job-id and by input-vs-output)
> and not get the data they don't need.
>
> Based on expected usage and our data retention policy, we would not expect
> to need more than 12,000 active jobs at any one time -- in other words,
> 24,000 topics. If we were to have 5 partitions per topic (our cluster has 5
> brokers), that would imply 120,000 partitions. [These number refer only to
> main/primary partitions, not any replicas that might exist.]
>
> Those numbers seem to be far larger than the suggested limits I see online.
> For example, the Kafka FAQ on these matters seems to imply that the most
> relevant limit is the number of partitions (rather than topics) and sort of
> implies that 10,000 partitions might be a suggested guideline (
> https://goo.gl/fQs2md). Also implied is that systems should use fewer
> topics and instead partition the data within topics if further separation
> is needed (the FAQ entry uses the example of partitioning by user ID, which
> is roughly analogous to job ID in my use case).
>
> The guidance in the FAQ is unclear to me:
>
> - Does the suggested limit of 10,000 refer to the total number of
> partitions (ie, main partitions plus any replicas) or just the main
> partitions?
>
> - If the most important limitation is number of partitions (rather than
> number of topics), how does the suggested strategy of using fewer topics
> and then partitioning by some other attribute (ie job ID) help at all?
>
> - Is my use case just a bad fit for Kafka? Or, is there a way for us to use
> Kafka while still supporting the kinds of query patterns that we need (ie,
> by job ID and by input-vs-output)?
>
> Thanks in advance for any guidance.
>
> Monty
>


Re: Recommended max number of topics (and data separation)

2018-01-30 Thread David Espinosa
Hi Andrey,
My topics are replicated with a replicated factor equals to the number of
nodes, 3 in this test.
Didn't know about the kip-227.
The problems I see at 70k topics coming from ZK are related to any
operation where ZK has to retrieve topics metadata. Just listing topics at
50K or 60k you will experience a big delay in the response. I have no more
details about these problems, but is easy to reproduce the latency in the
topics list request.
Thanks me for pointing me to this parameter,  vm.max_map_count, it wasn't
on my radar. Could you tell me what value you use?
The other way around about topic naming, I think the longer the topic names
are the sooner jute.maxbuffer overflows.
David


2018-01-30 4:40 GMT+01:00 Andrey Falko :

> On Sun, Jan 28, 2018 at 8:45 AM, David Espinosa  wrote:
> > Hi Monty,
> >
> > I'm also planning to use a big amount of topics in Kafka, so recently I
> > made a test within a 3 nodes kafka cluster where I created 100k topics
> with
> > one partition. Sent 1M messages in total.
>
> Are your topic partitions replicated?
>
> > These are my conclusions:
> >
> >- There is not any limitation on kafka regarding the number of topics
> >but on Zookeeper and in the system where Kafka nodes is allocated.
>
> There are also the problems being addressed in KIP-227:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 227%3A+Introduce+Incremental+FetchRequests+to+Increase+
> Partition+Scalability
>
> >- Zookeeper will start having problems from 70k topics, which can be
> >solved modifying a buffer parameter on the JVM (-Djute.maxbuffer).
> >Performance is reduced.
>
> What kind of problems do you see at 70k topics? If performance is
> reduced w/ modifying jute.maxbuffer, won't that effect the performance
> of kafka interms of how long it takes to recover from broker failure,
> creating/deleting topics, producing and consuming?
>
> >- Open file descriptors of the system are equivalent to [number of
> >topics]X[number of partitions per topic]. Set to 128k in my test to
> avoid
> >problems.
> >- System needs a big amount of memory for page caching.
>
> I also had to tune vm.max_map_count much higher.
>
> >
> > So, after creating 100k with the required setup (system+JVM) but seeing
> > problems at 70k, I feel safe by not creating more than 50k, and always
> will
> > have Zookeeper as my first suspect if a problem comes. I think with
> proper
> > resources (memory) and system setup (open file descriptors), you don't
> have
> > any real limitation regarding partitions.
>
> I can confirm the 50k number. After about 40k-45k topics, I start
> seeing slow down in consume offset commit latencies that eclipse 50ms.
> Hopefully KIP-227 will alleviate that problem and leave ZK as the last
> remaining hurdle. I'm testing with 3x replication per partition and 10
> brokers.
>
> > By the way, I used long topic names (about 30 characters), which can be
> > important for ZK.
>
> I'd like to learn more about this, are you saying that long topic
> names would improve ZK performance because that relates to bumping up
> jute.maxbuffer?
>
> > Hope this information is of your help.
> >
> > David
> >
> > 2018-01-28 2:22 GMT+01:00 Monty Hindman :
> >
> >> I'm designing a system and need some more clarity regarding Kafka's
> >> recommended limits on the number of topics and/or partitions. At a high
> >> level, our system would work like this:
> >>
> >> - A user creates a job X (X is a UUID).
> >> - The user uploads data for X to an input topic: X.in.
> >> - Workers process the data, writing results to an output topic: X.out.
> >> - The user downloads the data from X.out.
> >>
> >> It's important for the system that data for different jobs be kept
> >> separate, and that input and output data be kept separate. By
> "separate" I
> >> mean that there needs to be a reasonable way for users and the system's
> >> workers to query for the data they need (by job-id and by
> input-vs-output)
> >> and not get the data they don't need.
> >>
> >> Based on expected usage and our data retention policy, we would not
> expect
> >> to need more than 12,000 active jobs at any one time -- in other words,
> >> 24,000 topics. If we were to have 5 partitions per topic (our cluster
> has 5
> >> brokers), that would imply 120,000 partitions. [These number refer only
> to
> >> main/primary partitions, not any replicas that might exist.]
> >>
> 

Re: Recommended max number of topics (and data separation)

2018-01-31 Thread David Espinosa
I used:
-Djute.maxbuffer=50111000
and the gain I had is that I could increment number of topics from 70k to
100k :P

2018-01-30 23:25 GMT+01:00 Andrey Falko :

> On Tue, Jan 30, 2018 at 1:38 PM, David Espinosa  wrote:
> > Hi Andrey,
> > My topics are replicated with a replicated factor equals to the number of
> > nodes, 3 in this test.
> > Didn't know about the kip-227.
> > The problems I see at 70k topics coming from ZK are related to any
> > operation where ZK has to retrieve topics metadata. Just listing topics
> at
> > 50K or 60k you will experience a big delay in the response. I have no
> more
> > details about these problems, but is easy to reproduce the latency in the
> > topics list request.
>
> AFAIK kafka doesn't do a full list as part of normal operations from
> ZK. If you have requirements in your consumer/producer code on doing
> --describe, then that would be a problem. I think that can be worked
> around. Based on my profiling data so far, while things are working in
> non-failure mode, none of the ZK functions pop up as "hot methods".
>
> > Thanks me for pointing me to this parameter,  vm.max_map_count, it wasn't
> > on my radar. Could you tell me what value you use?
>
> I set it to the max allowable on Amzn Linux: vm.max_map_count=1215752192
>
> > The other way around about topic naming, I think the longer the topic
> names
> > are the sooner jute.maxbuffer overflows.
>
> I see; what value(s) have you tried with and how much gain did you you see?
>
> > David
> >
> >
> > 2018-01-30 4:40 GMT+01:00 Andrey Falko :
> >
> >> On Sun, Jan 28, 2018 at 8:45 AM, David Espinosa 
> wrote:
> >> > Hi Monty,
> >> >
> >> > I'm also planning to use a big amount of topics in Kafka, so recently
> I
> >> > made a test within a 3 nodes kafka cluster where I created 100k topics
> >> with
> >> > one partition. Sent 1M messages in total.
> >>
> >> Are your topic partitions replicated?
> >>
> >> > These are my conclusions:
> >> >
> >> >- There is not any limitation on kafka regarding the number of
> topics
> >> >but on Zookeeper and in the system where Kafka nodes is allocated.
> >>
> >> There are also the problems being addressed in KIP-227:
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> 227%3A+Introduce+Incremental+FetchRequests+to+Increase+
> >> Partition+Scalability
> >>
> >> >- Zookeeper will start having problems from 70k topics, which can
> be
> >> >solved modifying a buffer parameter on the JVM (-Djute.maxbuffer).
> >> >Performance is reduced.
> >>
> >> What kind of problems do you see at 70k topics? If performance is
> >> reduced w/ modifying jute.maxbuffer, won't that effect the performance
> >> of kafka interms of how long it takes to recover from broker failure,
> >> creating/deleting topics, producing and consuming?
> >>
> >> >- Open file descriptors of the system are equivalent to [number of
> >> >topics]X[number of partitions per topic]. Set to 128k in my test to
> >> avoid
> >> >problems.
> >> >- System needs a big amount of memory for page caching.
> >>
> >> I also had to tune vm.max_map_count much higher.
> >>
> >> >
> >> > So, after creating 100k with the required setup (system+JVM) but
> seeing
> >> > problems at 70k, I feel safe by not creating more than 50k, and always
> >> will
> >> > have Zookeeper as my first suspect if a problem comes. I think with
> >> proper
> >> > resources (memory) and system setup (open file descriptors), you don't
> >> have
> >> > any real limitation regarding partitions.
> >>
> >> I can confirm the 50k number. After about 40k-45k topics, I start
> >> seeing slow down in consume offset commit latencies that eclipse 50ms.
> >> Hopefully KIP-227 will alleviate that problem and leave ZK as the last
> >> remaining hurdle. I'm testing with 3x replication per partition and 10
> >> brokers.
> >>
> >> > By the way, I used long topic names (about 30 characters), which can
> be
> >> > important for ZK.
> >>
> >> I'd like to learn more about this, are you saying that long topic
> >> names would improve ZK performance because that relates to bumping up
> >> jute.maxbuffer?
> >>
> >> > Hope

Log Compaction configuration over all topics in cluster

2018-05-09 Thread David Espinosa
Hi all,
I would like to apply log compaction configuration for any topic in my
kafka cluster, as default properties. These configuration properties are:

   - cleanup.policy
   - delete.retention.ms
   - segment.ms
   - min.cleanable.dirty.ratio

I have tried to place them in the server.properties file, but they are not
applied. I could only apply them when using kafka-topics create topic
command.

Somebody knows how to apply those properties as default for any topic
created?

Thanks in advance,
David.


How set log compaction policies at cluster level

2018-06-14 Thread David Espinosa
Hi all,

I would like to apply log compaction configuration for any topic in my
kafka cluster, as default properties. These configuration properties are:

   - cleanup.policy
   - delete.retention.ms
   - segment.ms
   - min.cleanable.dirty.ratio

I have tried to place them in the server.properties file, but they are not
applied. I could only apply them when using kafka-topics create topic
command.

Somebody knows how to apply those properties as default for any topic
created?

Thanks in advance,
David.


Is possible to have an infinite delete.retention.ms?

2018-06-25 Thread David Espinosa
Hi all,

I would like to setup a compaction policy on a topic where a message can be
deleted (GDPR..) using a tombstone with the same key that the message to be
removed. My problem is that I would like to use empty payload messages also
for identifying that an entity has been deleted, but these last messages
got deleted by compaction. So my whole data integrity goes to trash.

Mi idea is to modify the property delete.retention.ms in a way that
tombstones are not deleted. My questions are:

   - Can I set an infinite value for delete.rentention.ms the same way a
   can do it with retention.ms=-1?
   - Does this property (delete.rentention.ms) has any consequence on the
   compaction interval?

Thanks in advance,
David


How set properly infinite retention

2018-07-30 Thread David Espinosa
Hi all,
I would like to set infinite retention for all topics created in the
cluster by default.
I have tried with:

*log.retention.ms =-1* at *server.properties*

But messages get deleted approx after 10 days.

Which configuration at broker level should I use for infinite retention?

Thanks in advance,
David


Re: How set properly infinite retention

2018-07-30 Thread David Espinosa
Hi thanks a lot for the reply.

The thing is that I need compaction to delete some messages (for GDPR
purposes), and for that I need the log cleaner to be enabled (with
policy=compact).

David


El lun., 30 jul. 2018 a las 11:27, M. Manna () escribió:

> I believe you can simply disable log cleaner.
>
> On Mon, 30 Jul 2018, 10:07 David Espinosa,  wrote:
>
> > Hi all,
> > I would like to set infinite retention for all topics created in the
> > cluster by default.
> > I have tried with:
> >
> > *log.retention.ms <http://log.retention.ms>=-1* at *server.properties*
> >
> > But messages get deleted approx after 10 days.
> >
> > Which configuration at broker level should I use for infinite retention?
> >
> > Thanks in advance,
> > David
> >
>


Re: How set properly infinite retention

2018-07-30 Thread David Espinosa
Thanks a lot! I will try that!
David

El lun., 30 jul. 2018 a las 13:26, Kamal Chandraprakash (<
kamal.chandraprak...@gmail.com>) escribió:

> log.retention.ms = 9223372036854775807 (Long.MAX_VALUE)
>
>
> On Mon, Jul 30, 2018 at 3:04 PM David Espinosa  wrote:
>
> > Hi thanks a lot for the reply.
> >
> > The thing is that I need compaction to delete some messages (for GDPR
> > purposes), and for that I need the log cleaner to be enabled (with
> > policy=compact).
> >
> > David
> >
> >
> > El lun., 30 jul. 2018 a las 11:27, M. Manna ()
> > escribió:
> >
> > > I believe you can simply disable log cleaner.
> > >
> > > On Mon, 30 Jul 2018, 10:07 David Espinosa,  wrote:
> > >
> > > > Hi all,
> > > > I would like to set infinite retention for all topics created in the
> > > > cluster by default.
> > > > I have tried with:
> > > >
> > > > *log.retention.ms <http://log.retention.ms>=-1* at
> *server.properties*
> > > >
> > > > But messages get deleted approx after 10 days.
> > > >
> > > > Which configuration at broker level should I use for infinite
> > retention?
> > > >
> > > > Thanks in advance,
> > > > David
> > > >
> > >
> >
>


Include Keys in FileStreamSink connector

2018-08-08 Thread David Espinosa
Hi all,

I'm trying to backup the whole content of an avro topic into a file, and
later restoring the Kafka topic from the file using Kafka Connect. I'm
using avroconverters from both key and value, but key is not included in
the dump file.

Somebody knows how to include keys using Kafka Connect? (I'm using Kafka
Connect in distributed mode).

Btw, this is the request that I'm using to create the connector:

{
 "name": "mytopic-connector",
 "config": {
   "connector.class":
"org.apache.kafka.connect.file.FileStreamSourceConnector",
   "tasks.max": "1",
   "topic": "mytopic",
   "file": "/data/backups/mytopic.txt",
   "key.converter": "io.confluent.connect.avro.AvroConverter",
   "value.converter": "io.confluent.connect.avro.AvroConverter",
   "key.converter.schema.registry.url": "http://schema-registry:8081";,
   "value.converter.schema.registry.url": "http://schema-registry:8081";
 }
} (editado)


Best way for reading all messages and close

2018-09-14 Thread David Espinosa
Hi all,

Although the usage of Kafka is stream oriented, for a concrete use case I
need to read all the messages existing in a topic and once all them has
been read then closing the consumer.

What's the best way or framework for doing this?

Thanks in advance,
David,


Re: Best way for reading all messages and close

2018-09-17 Thread David Espinosa
Thank you all for your responses!
I also asked this on the confluent slack channel (
https://confluentcommunity.slack.com) and I got this approach:

   1. Query the partitions' high watermark offset
   2. Set the consumer to consume from beginning
   3. Break out when you've reached the high offset

Still have some doubts regarding the implementation, but it seems a good
approach (I'm using a single partition so a single loop would be enough per
topic).
What do you think?

El sáb., 15 sept. 2018 a las 0:30, John Roesler ()
escribió:

> Specifically, you can monitor the "records-lag-max" (
> https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics)
> metric. (or the more granular one per partition).
>
> Once this metric goes to 0, you know that you've caught up with the tail of
> the log.
>
> Hope this helps,
> -John
>
> On Fri, Sep 14, 2018 at 2:02 PM Matthias J. Sax 
> wrote:
>
> > Using Kafka Streams this is a little tricky.
> >
> > The API itself has no built-in mechanism to do this. You would need to
> > monitor the lag of the application, and if the lag is zero (assuming you
> > don't write new data into the topic in parallel), terminate the
> > application.
> >
> >
> > -Matthias
> >
> > On 9/14/18 4:19 AM, Henning Røigaard-Petersen wrote:
> > > Spin up a consumer, subscribe to EOF events, assign all partitions from
> > the beginning, and keep polling until all partitions has reached EOF.
> > > Though, if you have concurrent writers, new messages may be appended
> > after you observe EOF on a partition, so you are never guaranteed to have
> > read all messages at the time you choose to close the consumer.
> > >
> > > /Henning Røigaard-Petersen
> > >
> > > -Original Message-
> > > From: David Espinosa 
> > > Sent: 14. september 2018 09:46
> > > To: users@kafka.apache.org
> > > Subject: Best way for reading all messages and close
> > >
> > > Hi all,
> > >
> > > Although the usage of Kafka is stream oriented, for a concrete use case
> > I need to read all the messages existing in a topic and once all them has
> > been read then closing the consumer.
> > >
> > > What's the best way or framework for doing this?
> > >
> > > Thanks in advance,
> > > David,
> > >
> >
> >
>


Partitions as mechanism to keep multitenant segregated data

2017-05-23 Thread David Espinosa
Hi,

In order to keep separated (physically) the data from different customers
in our application, we are using a custom partitioner to drive messages to
a concrete partition of a topic. We know that we are loosing parallelism
per topic this way, but our requirements regarding multitenancy are higher
than our throughput requirements.

So, in order to increase the number of customers working on a cluster, we
are increasing the number of partitions dinamically per topic as the new
customer arrives using kafka AdminUtilities.
Our problem arrives when using the new kafka consumer and a new partition
is added into the topic, as this consumer doesn't get updated with the "new
partition" and therefore messages driven into that new partition never
arrives to this consumer unless we reload the consumer itself. What was
surprising was to check that using the old consumer (configured to deal
with Zookeeper), a consumer does get messages from a new added partition.

Is there a way to emulate the old consumer behaviour when new partitions
are added in the new consumer?

Thanks in advance,
David


Re: Partitions as mechanism to keep multitenant segregated data

2017-05-23 Thread David Espinosa
Thanks for the answer Tom,
Indeed I will not have more than 10 or 20 customer per cluster, so that's
also the maximum number of partitions possible per topic.
Still a bad idea?

2017-05-23 16:48 GMT+02:00 Tom Crayford :

> Hi there,
>
> I don't know about the consumer, but I'd *strongly* recommend not designing
> your application around this. Kafka has severe and notable stability
> concerns with large numbers of partitions, and requiring "one partition per
> customer" is going to be limiting, unless you only ever expect to have
> *very* small customer numbers (hundreds at most, ever). Instead, use a hash
> function and a key, as recommended to land customers on the same partition.
>
> Thanks
>
> Tom Crayford
> Heroku Kafka
>
> On Tue, May 23, 2017 at 9:46 AM, David Espinosa  wrote:
>
> > Hi,
> >
> > In order to keep separated (physically) the data from different customers
> > in our application, we are using a custom partitioner to drive messages
> to
> > a concrete partition of a topic. We know that we are loosing parallelism
> > per topic this way, but our requirements regarding multitenancy are
> higher
> > than our throughput requirements.
> >
> > So, in order to increase the number of customers working on a cluster, we
> > are increasing the number of partitions dinamically per topic as the new
> > customer arrives using kafka AdminUtilities.
> > Our problem arrives when using the new kafka consumer and a new partition
> > is added into the topic, as this consumer doesn't get updated with the
> "new
> > partition" and therefore messages driven into that new partition never
> > arrives to this consumer unless we reload the consumer itself. What was
> > surprising was to check that using the old consumer (configured to deal
> > with Zookeeper), a consumer does get messages from a new added partition.
> >
> > Is there a way to emulate the old consumer behaviour when new partitions
> > are added in the new consumer?
> >
> > Thanks in advance,
> > David
> >
>


Bridging activeMQ and Kafka

2017-05-25 Thread David Espinosa
Hi All,

I want to migrate our system which is using activeMQ to Kafka. In order to
do a gradual migration to Kafka, I would like to create a bridge between
activeMQ and Kafka, so a producer and consumer could be working on
different message brokers until the migration is complete and all my
services gets migrated to Kafka.

I have seen some github projects like kalinka (
https://github.com/dcsolutions/kalinka) and also seen that I could do with
Apache Camel.

I would like to ask you about some experience or advice you can provide in
this bridging between ActiveMQ and Kafka.

Thanks in advance,
David.