Sorry for so long to reply,

I drew a simple picture, hope can help for the question.
The main point is to reduce the read of messages from unnecessary topics
while read data from partitioned table of hive.
[image: image.png]

Slim Bouguerra <bs...@apache.org> 于2019年4月20日周六 上午12:16写道:

> Hi am not sure am getting the question 100% Can you share a design doc or
> outline the big picture in your mind? FYI am not very familiar with Pulsar
> thus please account for that :D
> But let me point out that Hive does not have the notion of partitions for
> tables backed by storage handlers, that is because by definition the table
> is not stored by Hive therefore can not control the layout.
>
> Will be happy to look at any POC.
> looking forward to hear from you.
>
> On Wed, Apr 17, 2019 at 7:25 PM PengHui Li <codelipeng...@gmail.com>
> wrote:
>
> > @Slim
> >
> > I want to use different pulsar topic to store data for different hive
> > partition. Is there a way to do this, or does this idea make sense?
> >
> > Can you give me some advice?
> >
> >
> > 李鹏辉gmail <codelipeng...@gmail.com> 于2019年4月15日周一 下午6:22写道:
> >
> > > I already have a simple implementation that can write data and query
> > data.
> > > I read the design document and implementation of kafka.
> > > There are some differences of table partition with what I think.
> > >
> > > I want hive table partition locations work with pulsar topics.
> Different
> > > table partitions correspond to different topics.
> > > But i can’t get the partition where the data will be written.
> > >
> > > I know that the drawback of doing this is that it will lose the order
> of
> > > the stream data itself.
> > > But can reduce unnecessary data reading when querying.
> > >
> > > Best Regards
> > >
> > > Penghui
> > > Beijing,China
> > >
> > >
> > >
> > > > 在 2019年4月13日,21:43,Jörn Franke <jornfra...@gmail.com> 写道:
> > > >
> > > > I think you need to develop a custom hiveserde + custom
> > > Hadoopinputformat + custom Hiveoutputformat
> > > >
> > > >> Am 12.04.2019 um 17:35 schrieb 李鹏辉gmail <codelipeng...@gmail.com>:
> > > >>
> > > >> Hi guys,
> > > >>
> > > >> I’m working on integration of hive and pulsar recently. But now i
> have
> > > encountered some problems and hope to get help here.
> > > >>
> > > >> First of all, i simply describe the motivation.
> > > >>
> > > >> Pulsar can be used as infinite streams for keeping both historic
> data
> > > and streaming data, So we want to use pulsar as a storage extension for
> > > hive.
> > > >> In this way, hive can read the data in pulsar naturally, and can
> also
> > > write data into pulsar.
> > > >> We will benefit from the same data that provides both interactive
> > query
> > > and streaming capabilities.
> > > >>
> > > >> As an improvement, support data partitioning can make the query more
> > > efficient(e.g. partition by date or any other field).
> > > >>
> > > >> But
> > > >>
> > > >> - how to get hive table partition definition?
> > > >> - While user inert data to hive table, how to get partition the data
> > > should be store?
> > > >> - While use select data from hive table, how to determine data is in
> > > that partition?
> > > >>
> > > >> If hive already expose some mechanism to support, please show me how
> > to
> > > use it.
> > > >>
> > > >> Best regards
> > > >>
> > > >> Penghui
> > > >> Beijing, China
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
>

Reply via email to