We use Hive to manage 100's of millions machine log data files. These files are 
semi-structured. Semi-structured in that we don't care about the full structure 
of the file up front, nor do they have a format that's easy to understand.

Even data with less structure (e.g. Medical notes) there is always metadata 
about the data and context.
This metadata and the 'blob' of data can fit in a row of a Hive table. We use 
UDFs and UDTFs to parse the blob portion of the data on an as needed basis.
Another pattern is using a sequence file. The value contains the blob, the key 
contains the concatenated metadata object (think Avro encoding).

Storage can be on HDFS or in HBase. The choice depends more on read and write 
access pattern requirements more than what level of structure the data has. The 
processing tool (Pig / Hive / Map Reduce) choice is better influenced by the 
type of data flows (data pipelines) you need to build more so than how much 
structure the data has. The one exception is nested data, I find Pig handles 
this more easily than Hive does.

The trick to managing semi-structured data via Hive/Pig is through the use of 
UDFs for parsing what you need when you need it. All of the tools above support 
UDFs. Map Reduce does it too because it's already operating at the 'assembly 
language' level anyways.

- Douglas

From: Bill Busch <bigdat...@outlook.com<mailto:bigdat...@outlook.com>>
Reply-To: <user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Wed, 3 Dec 2014 20:59:46 -0500
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: RE: Question

MapReduce can be used for both structure and unstructured data.   Hive is a 
storage and retrieval mechanism (e.g. database).   The trouble with RDBMS is 
that you either have to parse the unstructured data into a structured row 
/column format OR store it as an object.  There are issues both performance and 
semantically .  Hence, there is a whole world of NoSQL databases out there that 
have been developed that are not row-column structured.  These databases can 
handle more schema-less/unstructured objects and will allow you to more 
eloquently manipulate your information.      I would check out the Wikipedia 
page on NoSQL databases and focus on Key - Value, Columnar, or Document 
databases.

________________________________
Date: Thu, 4 Dec 2014 07:06:16 +0530
Subject: Re: Question
From: mohan.25fe...@gmail.com<mailto:mohan.25fe...@gmail.com>
To: user@hive.apache.org<mailto:user@hive.apache.org>

Thanks Gabriel for the prompt response

I see in online blogs saying  MapReduce for Unstructured Data , Pig for Semi 
Sturctured Data and Hive is only for Structured Data. Can you please justify 
this?


Thanks in advance



On Thu, Dec 4, 2014 at 6:56 AM, Gabriel Eisbruch 
<gabrieleisbr...@gmail.com<mailto:gabrieleisbr...@gmail.com>> wrote:
Hi Mohan,
   We are using hive for unstructured (or semi structured data) using map 
columns, for example, we use for fixed data standard columns and form dynamic 
data map columns.

Gabriel.

2014-12-03 22:19 GMT-03:00 Mohan Krishna 
<mohan.25fe...@gmail.com<mailto:mohan.25fe...@gmail.com>>:
Hive is  for only structured data or it handles Unstructured data as well ?


Reply via email to