Hi Here is where you can add that logic where you want to send to a given topic. https://github.com/apache/hive/blob/98e2a3582d5d239de6744a016d4f481312c43df2/kafka-handler/src/java/org/apache/hadoop/hive/kafka/TransactionalKafkaWriter.java#L142
Keep in mind that you might have lot of topics thus many open clients at the same time On Sun, Apr 28, 2019 at 12:49 AM PengHui Li <codelipeng...@gmail.com> wrote: > @Slim > I have updated the google doc. I think I lost important information in the > previous image. Sorry for that. > > Slim Bouguerra <bs...@apache.org> 于2019年4月26日周五 下午11:43写道: > > > Thanks i can see the document now. > > If my understanding is correct, you want achieve the following: > > A Hive User will submit a SQL query like select * from pulsar_table where > > column_used_to_partition_pulsrar_topic = 'value' > > And you want to only scan the pulsar topic that match that filter. > > Assuming my understanding is correct, you need first to clarify if the > > partition column used by pulsar is user defined ? does it change over > time > > ? can you list partition with some RPC calls to pulsar? > > But in nutshell what you are trying to do is very close to how we filter > > partitions in Kafka. > > This is how we do it: > > 1/ intercept the filter expression at the split generation phase. > > > > > https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaInputFormat.java#L120 > > 2/ > > < > https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaInputFormat.java#L1202/ > > > > IF the filter has predicates on kafka_Partition kafka_offsets or > > kafka_timestamp, THEN extract that predicate (partition IN 1,3,4) to be > > used for split generations. ( > > > > > https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaScanTrimmer.java#L116 > > ) > > 3/ Let the Hive split do the rest of the work > > > > Hope that answer your question. > > > > > > On Fri, Apr 26, 2019 at 2:29 AM PengHui Li <codelipeng...@gmail.com> > > wrote: > > > > > @Slim I have copied the image to Google Docs and hope to work fine. > > > > > > > > > > > > https://docs.google.com/document/d/1K_JE_a47bu1I7va1GwUK36vdOKZGqGFWTt4qPuTRShg/edit?usp=sharing > > > > > > Slim Bouguerra <slim.bougue...@gmail.com> 于2019年4月26日周五 上午12:13写道: > > > > > > > Hey sorry your image is not showing? Not sure why. > > > > > > > > On Wed, Apr 24, 2019 at 6:53 AM PengHui Li <codelipeng...@gmail.com> > > > > wrote: > > > > > > > > > Sorry for so long to reply, > > > > > > > > > > I drew a simple picture, hope can help for the question. > > > > > The main point is to reduce the read of messages from unnecessary > > > topics > > > > > while read data from partitioned table of hive. > > > > > [image: image.png] > > > > > > > > > > Slim Bouguerra <bs...@apache.org> 于2019年4月20日周六 上午12:16写道: > > > > > > > > > >> Hi am not sure am getting the question 100% Can you share a design > > doc > > > > or > > > > >> outline the big picture in your mind? FYI am not very familiar > with > > > > Pulsar > > > > >> thus please account for that :D > > > > >> But let me point out that Hive does not have the notion of > > partitions > > > > for > > > > >> tables backed by storage handlers, that is because by definition > the > > > > table > > > > >> is not stored by Hive therefore can not control the layout. > > > > >> > > > > >> Will be happy to look at any POC. > > > > >> looking forward to hear from you. > > > > >> > > > > >> On Wed, Apr 17, 2019 at 7:25 PM PengHui Li < > codelipeng...@gmail.com > > > > > > > >> wrote: > > > > >> > > > > >> > @Slim > > > > >> > > > > > >> > I want to use different pulsar topic to store data for different > > > hive > > > > >> > partition. Is there a way to do this, or does this idea make > > sense? > > > > >> > > > > > >> > Can you give me some advice? > > > > >> > > > > > >> > > > > > >> > 李鹏辉gmail <codelipeng...@gmail.com> 于2019年4月15日周一 下午6:22写道: > > > > >> > > > > > >> > > I already have a simple implementation that can write data and > > > query > > > > >> > data. > > > > >> > > I read the design document and implementation of kafka. > > > > >> > > There are some differences of table partition with what I > think. > > > > >> > > > > > > >> > > I want hive table partition locations work with pulsar topics. > > > > >> Different > > > > >> > > table partitions correspond to different topics. > > > > >> > > But i can’t get the partition where the data will be written. > > > > >> > > > > > > >> > > I know that the drawback of doing this is that it will lose > the > > > > order > > > > >> of > > > > >> > > the stream data itself. > > > > >> > > But can reduce unnecessary data reading when querying. > > > > >> > > > > > > >> > > Best Regards > > > > >> > > > > > > >> > > Penghui > > > > >> > > Beijing,China > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > 在 2019年4月13日,21:43,Jörn Franke <jornfra...@gmail.com> 写道: > > > > >> > > > > > > > >> > > > I think you need to develop a custom hiveserde + custom > > > > >> > > Hadoopinputformat + custom Hiveoutputformat > > > > >> > > > > > > > >> > > >> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail < > > > codelipeng...@gmail.com > > > > >: > > > > >> > > >> > > > > >> > > >> Hi guys, > > > > >> > > >> > > > > >> > > >> I’m working on integration of hive and pulsar recently. But > > > now i > > > > >> have > > > > >> > > encountered some problems and hope to get help here. > > > > >> > > >> > > > > >> > > >> First of all, i simply describe the motivation. > > > > >> > > >> > > > > >> > > >> Pulsar can be used as infinite streams for keeping both > > > historic > > > > >> data > > > > >> > > and streaming data, So we want to use pulsar as a storage > > > extension > > > > >> for > > > > >> > > hive. > > > > >> > > >> In this way, hive can read the data in pulsar naturally, > and > > > can > > > > >> also > > > > >> > > write data into pulsar. > > > > >> > > >> We will benefit from the same data that provides both > > > interactive > > > > >> > query > > > > >> > > and streaming capabilities. > > > > >> > > >> > > > > >> > > >> As an improvement, support data partitioning can make the > > query > > > > >> more > > > > >> > > efficient(e.g. partition by date or any other field). > > > > >> > > >> > > > > >> > > >> But > > > > >> > > >> > > > > >> > > >> - how to get hive table partition definition? > > > > >> > > >> - While user inert data to hive table, how to get partition > > the > > > > >> data > > > > >> > > should be store? > > > > >> > > >> - While use select data from hive table, how to determine > > data > > > is > > > > >> in > > > > >> > > that partition? > > > > >> > > >> > > > > >> > > >> If hive already expose some mechanism to support, please > show > > > me > > > > >> how > > > > >> > to > > > > >> > > use it. > > > > >> > > >> > > > > >> > > >> Best regards > > > > >> > > >> > > > > >> > > >> Penghui > > > > >> > > >> Beijing, China > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > -- > > > > > > > > B-Slim > > > > > > _______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______ > > > > > > > > > >