> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
…
> 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]
I'm hoping this is not rewriting to the approx_distinct() in Presto.
> I got similar performance with Hive + LLAP too.
This is a logical plan issue, so I don't know if LLAP helps a lot.
A cou
For A) I’d recommend mapping an EXTERNAL table to the raw/original source
files…then you can just run a SELECT query from the EXTERNAL source and INSERT
into your destination.
LOAD DATA can be very useful when you are trying to move data between two
tables that share the same schema but 1 table
Hi,
I have an ORC table with around 9 million rows. It has an ID column. We are
running a query to make sure that are no duplicate IDs. This is the query
*SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;*
This is the output from the presto shell
presto:test> SELECT COUNT(*), COUNT(DISTINCT id)
Right, that makes sense, Dudu.
So basically, if we have our data in "some form", and a goal of loading it
into a parquet, partitioned table in Hive, we have two choices:
A. Load this data into a temporary table first. Presumably, for this we
should be able to do a LOAD INPATH, from delimited data
“LOAD” is very misleading here. it is all in done the metadata level.
The data is not being touched. The data in not being verified. The “system”
does not have any clue if the flies format match the table definition and they
can be actually used.
The data files are being “moved” (again, a metada
Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a
few tables to achieve the effect of actual insertion of records. Is it not the
case that the LOAD causes the data to get inserted into Hive?
Based on that I'd like to understand whether we can get away with using LOAD
I just want to verify that you understand the following:
· LOAD DATA INPATH is just a HDFS file movement operation.
You can achieve the same results by using hdfs dfs -mv …
· LOAD DATA LOCAL INPATH is just a file copying operation from the
shell to the HDFS.
You can achieve
Dudu,
This is still in design stages, so we have a way to get the data from its
source. The data is *not* in the Parquet format. It's up to us to format
it the best and most efficient way. We can roll with CSV or Parquet;
ultimately the data must make it into a pre-defined PARQUET, PARTITIONED
t
Are your files already in Parquet format?
From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
Sent: Tuesday, April 04, 2017 7:03 PM
To: user@hive.apache.org
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED
AS PARQUET table?
Thanks, Dudu.
Just to re-iterate; t
Thanks, Dudu.
Just to re-iterate; the way I'm reading your response is that yes, we can
use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in
the delimited file is properly formatted. Then we can LOAD it into the
table (mytable in my example) directly and avoid the creation
Since LOAD DATA INPATH only moves files the answer is very simple.
If you’re files are already in a format that matches the destination table
(storage type, number and types of columns etc.) then – yes and if not, then –
no.
But –
You don’t need to load the files into intermediary table.
You sh
We have a table such as the following defined:
CREATE TABLE IF NOT EXISTS db.mytable (
`item_id` string,
`timestamp` string,
`item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;
Currently we insert data into this PARQUET, PARTITIONED table as follows,
using an
You have write access now, Sankar. Welcome to the Hive wiki team!
-- Lefty
On Mon, Apr 3, 2017 at 11:15 PM, Sankar Hariappan <
shariap...@hortonworks.com> wrote:
> Hi,
>
> I’m currently working on Hive Replication feature and need access to
> update some wiki pages.
> Confluence ID: sankarh
>
13 matches
Mail list logo