Re: Hive Pulsar Integration

PengHui Li Sun, 28 Apr 2019 00:50:04 -0700

@Slim
I have updated the google doc. I think I lost important information in the
previous image. Sorry for that.


Slim Bouguerra <bs...@apache.org> 于2019年4月26日周五 下午11:43写道：

> Thanks i can see the document now.
> If my understanding is correct, you want achieve the following:
> A Hive User will submit a SQL query like select * from pulsar_table where
> column_used_to_partition_pulsrar_topic = 'value'
> And you want to only scan the pulsar topic that match that filter.
> Assuming my understanding is correct, you need first to clarify if the
> partition column used by pulsar is user defined ? does it change over time
> ? can you list partition with some RPC calls to pulsar?
> But in nutshell what you are trying to do is very close to how we filter
> partitions in Kafka.
> This is how we do it:
> 1/ intercept the filter expression at the split generation phase.
>
> https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaInputFormat.java#L120
> 2/
> <https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaInputFormat.java#L1202/>
> IF the filter has predicates on kafka_Partition kafka_offsets or
> kafka_timestamp, THEN extract that predicate (partition IN 1,3,4) to be
> used for split generations. (
>
> https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaScanTrimmer.java#L116
> )
> 3/ Let the Hive split do the rest of the work
>
> Hope that answer your question.
>
>
> On Fri, Apr 26, 2019 at 2:29 AM PengHui Li <codelipeng...@gmail.com>
> wrote:
>
> > @Slim I have copied the image to Google Docs and hope to work fine.
> >
> >
> >
> https://docs.google.com/document/d/1K_JE_a47bu1I7va1GwUK36vdOKZGqGFWTt4qPuTRShg/edit?usp=sharing
> >
> > Slim Bouguerra <slim.bougue...@gmail.com> 于2019年4月26日周五 上午12:13写道：
> >
> > > Hey sorry your image is not showing? Not sure why.
> > >
> > > On Wed, Apr 24, 2019 at 6:53 AM PengHui Li <codelipeng...@gmail.com>
> > > wrote:
> > >
> > > > Sorry for so long to reply,
> > > >
> > > > I drew a simple picture, hope can help for the question.
> > > > The main point is to reduce the read of messages from unnecessary
> > topics
> > > > while read data from partitioned table of hive.
> > > > [image: image.png]
> > > >
> > > > Slim Bouguerra <bs...@apache.org> 于2019年4月20日周六 上午12:16写道：
> > > >
> > > >> Hi am not sure am getting the question 100% Can you share a design
> doc
> > > or
> > > >> outline the big picture in your mind? FYI am not very familiar with
> > > Pulsar
> > > >> thus please account for that :D
> > > >> But let me point out that Hive does not have the notion of
> partitions
> > > for
> > > >> tables backed by storage handlers, that is because by definition the
> > > table
> > > >> is not stored by Hive therefore can not control the layout.
> > > >>
> > > >> Will be happy to look at any POC.
> > > >> looking forward to hear from you.
> > > >>
> > > >> On Wed, Apr 17, 2019 at 7:25 PM PengHui Li <codelipeng...@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > @Slim
> > > >> >
> > > >> > I want to use different pulsar topic to store data for different
> > hive
> > > >> > partition. Is there a way to do this, or does this idea make
> sense?
> > > >> >
> > > >> > Can you give me some advice?
> > > >> >
> > > >> >
> > > >> > 李鹏辉gmail <codelipeng...@gmail.com> 于2019年4月15日周一 下午6:22写道：
> > > >> >
> > > >> > > I already have a simple implementation that can write data and
> > query
> > > >> > data.
> > > >> > > I read the design document and implementation of kafka.
> > > >> > > There are some differences of table partition with what I think.
> > > >> > >
> > > >> > > I want hive table partition locations work with pulsar topics.
> > > >> Different
> > > >> > > table partitions correspond to different topics.
> > > >> > > But i can’t get the partition where the data will be written.
> > > >> > >
> > > >> > > I know that the drawback of doing this is that it will lose the
> > > order
> > > >> of
> > > >> > > the stream data itself.
> > > >> > > But can reduce unnecessary data reading when querying.
> > > >> > >
> > > >> > > Best Regards
> > > >> > >
> > > >> > > Penghui
> > > >> > > Beijing,China
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > > 在 2019年4月13日，21:43，Jörn Franke <jornfra...@gmail.com> 写道：
> > > >> > > >
> > > >> > > > I think you need to develop a custom hiveserde + custom
> > > >> > > Hadoopinputformat + custom Hiveoutputformat
> > > >> > > >
> > > >> > > >> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail <
> > codelipeng...@gmail.com
> > > >:
> > > >> > > >>
> > > >> > > >> Hi guys,
> > > >> > > >>
> > > >> > > >> I’m working on integration of hive and pulsar recently. But
> > now i
> > > >> have
> > > >> > > encountered some problems and hope to get help here.
> > > >> > > >>
> > > >> > > >> First of all, i simply describe the motivation.
> > > >> > > >>
> > > >> > > >> Pulsar can be used as infinite streams for keeping both
> > historic
> > > >> data
> > > >> > > and streaming data, So we want to use pulsar as a storage
> > extension
> > > >> for
> > > >> > > hive.
> > > >> > > >> In this way, hive can read the data in pulsar naturally, and
> > can
> > > >> also
> > > >> > > write data into pulsar.
> > > >> > > >> We will benefit from the same data that provides both
> > interactive
> > > >> > query
> > > >> > > and streaming capabilities.
> > > >> > > >>
> > > >> > > >> As an improvement, support data partitioning can make the
> query
> > > >> more
> > > >> > > efficient(e.g. partition by date or any other field).
> > > >> > > >>
> > > >> > > >> But
> > > >> > > >>
> > > >> > > >> - how to get hive table partition definition?
> > > >> > > >> - While user inert data to hive table, how to get partition
> the
> > > >> data
> > > >> > > should be store?
> > > >> > > >> - While use select data from hive table, how to determine
> data
> > is
> > > >> in
> > > >> > > that partition?
> > > >> > > >>
> > > >> > > >> If hive already expose some mechanism to support, please show
> > me
> > > >> how
> > > >> > to
> > > >> > > use it.
> > > >> > > >>
> > > >> > > >> Best regards
> > > >> > > >>
> > > >> > > >> Penghui
> > > >> > > >> Beijing, China
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > > --
> > >
> > > B-Slim
> > >
> _______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______
> > >
> >
>

Re: Hive Pulsar Integration

Reply via email to