Re: Hive Pulsar Integration

Slim Bouguerra Fri, 26 Apr 2019 08:43:04 -0700

Thanks i can see the document now.
If my understanding is correct, you want achieve the following:
A Hive User will submit a SQL query like select * from pulsar_table where
column_used_to_partition_pulsrar_topic = 'value'
And you want to only scan the pulsar topic that match that filter.
Assuming my understanding is correct, you need first to clarify if the
partition column used by pulsar is user defined ? does it change over time
? can you list partition with some RPC calls to pulsar?
But in nutshell what you are trying to do is very close to how we filter
partitions in Kafka.
This is how we do it:
1/ intercept the filter expression at the split generation phase.
https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaInputFormat.java#L120
2/ IF the filter has predicates on kafka_Partition kafka_offsets or
kafka_timestamp, THEN extract that predicate (partition IN 1,3,4) to be
used for split generations. (
https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaScanTrimmer.java#L116
)
3/ Let the Hive split do the rest of the work


Hope that answer your question.


On Fri, Apr 26, 2019 at 2:29 AM PengHui Li <codelipeng...@gmail.com> wrote:

> @Slim I have copied the image to Google Docs and hope to work fine.
>
>
> https://docs.google.com/document/d/1K_JE_a47bu1I7va1GwUK36vdOKZGqGFWTt4qPuTRShg/edit?usp=sharing
>
> Slim Bouguerra <slim.bougue...@gmail.com> 于2019年4月26日周五 上午12:13写道：
>
> > Hey sorry your image is not showing? Not sure why.
> >
> > On Wed, Apr 24, 2019 at 6:53 AM PengHui Li <codelipeng...@gmail.com>
> > wrote:
> >
> > > Sorry for so long to reply,
> > >
> > > I drew a simple picture, hope can help for the question.
> > > The main point is to reduce the read of messages from unnecessary
> topics
> > > while read data from partitioned table of hive.
> > > [image: image.png]
> > >
> > > Slim Bouguerra <bs...@apache.org> 于2019年4月20日周六 上午12:16写道：
> > >
> > >> Hi am not sure am getting the question 100% Can you share a design doc
> > or
> > >> outline the big picture in your mind? FYI am not very familiar with
> > Pulsar
> > >> thus please account for that :D
> > >> But let me point out that Hive does not have the notion of partitions
> > for
> > >> tables backed by storage handlers, that is because by definition the
> > table
> > >> is not stored by Hive therefore can not control the layout.
> > >>
> > >> Will be happy to look at any POC.
> > >> looking forward to hear from you.
> > >>
> > >> On Wed, Apr 17, 2019 at 7:25 PM PengHui Li <codelipeng...@gmail.com>
> > >> wrote:
> > >>
> > >> > @Slim
> > >> >
> > >> > I want to use different pulsar topic to store data for different
> hive
> > >> > partition. Is there a way to do this, or does this idea make sense?
> > >> >
> > >> > Can you give me some advice?
> > >> >
> > >> >
> > >> > 李鹏辉gmail <codelipeng...@gmail.com> 于2019年4月15日周一 下午6:22写道：
> > >> >
> > >> > > I already have a simple implementation that can write data and
> query
> > >> > data.
> > >> > > I read the design document and implementation of kafka.
> > >> > > There are some differences of table partition with what I think.
> > >> > >
> > >> > > I want hive table partition locations work with pulsar topics.
> > >> Different
> > >> > > table partitions correspond to different topics.
> > >> > > But i can’t get the partition where the data will be written.
> > >> > >
> > >> > > I know that the drawback of doing this is that it will lose the
> > order
> > >> of
> > >> > > the stream data itself.
> > >> > > But can reduce unnecessary data reading when querying.
> > >> > >
> > >> > > Best Regards
> > >> > >
> > >> > > Penghui
> > >> > > Beijing,China
> > >> > >
> > >> > >
> > >> > >
> > >> > > > 在 2019年4月13日，21:43，Jörn Franke <jornfra...@gmail.com> 写道：
> > >> > > >
> > >> > > > I think you need to develop a custom hiveserde + custom
> > >> > > Hadoopinputformat + custom Hiveoutputformat
> > >> > > >
> > >> > > >> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail <
> codelipeng...@gmail.com
> > >:
> > >> > > >>
> > >> > > >> Hi guys,
> > >> > > >>
> > >> > > >> I’m working on integration of hive and pulsar recently. But
> now i
> > >> have
> > >> > > encountered some problems and hope to get help here.
> > >> > > >>
> > >> > > >> First of all, i simply describe the motivation.
> > >> > > >>
> > >> > > >> Pulsar can be used as infinite streams for keeping both
> historic
> > >> data
> > >> > > and streaming data, So we want to use pulsar as a storage
> extension
> > >> for
> > >> > > hive.
> > >> > > >> In this way, hive can read the data in pulsar naturally, and
> can
> > >> also
> > >> > > write data into pulsar.
> > >> > > >> We will benefit from the same data that provides both
> interactive
> > >> > query
> > >> > > and streaming capabilities.
> > >> > > >>
> > >> > > >> As an improvement, support data partitioning can make the query
> > >> more
> > >> > > efficient(e.g. partition by date or any other field).
> > >> > > >>
> > >> > > >> But
> > >> > > >>
> > >> > > >> - how to get hive table partition definition?
> > >> > > >> - While user inert data to hive table, how to get partition the
> > >> data
> > >> > > should be store?
> > >> > > >> - While use select data from hive table, how to determine data
> is
> > >> in
> > >> > > that partition?
> > >> > > >>
> > >> > > >> If hive already expose some mechanism to support, please show
> me
> > >> how
> > >> > to
> > >> > > use it.
> > >> > > >>
> > >> > > >> Best regards
> > >> > > >>
> > >> > > >> Penghui
> > >> > > >> Beijing, China
> > >> > > >>
> > >> > > >>
> > >> > > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > > --
> >
> > B-Slim
> > _______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______
> >
>

Re: Hive Pulsar Integration

Reply via email to