Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
ueries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table. https://issues.apache.org/jira/browse/SPARK-11197 I think now it's more clear why all companies move to Spark to do ETL. On Fri, Jan 15, 2016 at 3:06 PM, Alexander Pivovarov

Re: Loading data containing newlines

2016-01-15 Thread Alexander Pivovarov
ed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. >

Re: Loading data containing newlines

2016-01-13 Thread Alexander Pivovarov
Time to use Spark and Spark-Sql in addition to Hive? It's probably going to happen sooner or later anyway. I sent you Spark solution yesterday. (you just need to write unbzip2AndCsvToListOfArrays(file: String): List[Array[String]] function using BZip2CompressorInputStream and Super CSV API) you

RE: Loading data containing newlines

2016-01-12 Thread Alexander Pivovarov
an give it a > different line delimiter, but Hive 1.2.1 does not support it: "FAILED: > SemanticException 3:20 LINES TERMINATED BY only supports newline '\n' right > now." > > > > *From:* Alexander Pivovarov [mailto:apivova...@gmail.com] > *Sent:* Tuesday,

Re: Loading data containing newlines

2016-01-12 Thread Alexander Pivovarov
Try CSV serde. It should correctly parse quoted field value having newline inside https://cwiki.apache.org/confluence/display/Hive/CSV+Serde Hadoop should automatically read bz2 files On Tue, Jan 12, 2016 at 9:40 AM, Gerber, Bryan W wrote: > We are attempting to load CSV text files (compressed

Re: Create table from ORC or Parquet file?

2015-12-09 Thread Alexander Pivovarov
at table, so I assume you only care about reading it. Is that > right? > > .. Owen > > On Wed, Dec 2, 2015 at 9:53 PM, Alexander Pivovarov > wrote: > >> Hi Everyone >> >> Is it possible to create Hive table from ORC or Parquet file without >> specifying field names and their types. ORC or Parquet files contain field >> name and type information inside. >> >> Alex >> > >

Create table from ORC or Parquet file?

2015-12-02 Thread Alexander Pivovarov
Hi Everyone Is it possible to create Hive table from ORC or Parquet file without specifying field names and their types. ORC or Parquet files contain field name and type information inside. Alex

Re: join 2 tables located on different clusters

2015-06-25 Thread Alexander Pivovarov
ssue https://issues.apache.org/jira/browse/HIVE-6 On Wed, Jun 24, 2015 at 4:08 PM, Alexander Pivovarov wrote: > I tried on local hadoop/hive instance (hive is the latest from master > branch) > > mydev is ha alias to remote ha name node. > > $ hadoop fs -ls hdfs://mydev/tmp/et1 > Found

Re: join 2 tables located on different clusters

2015-06-24 Thread Alexander Pivovarov
This > can be done however assuming both clusters have network access to each other > > On Wed, Jun 24, 2015 at 4:33 PM, Alexander Pivovarov > wrote: > >> Hello Everyone >> >> Can I define external table on cluster_1 pointing to hdfs location on >> cl

join 2 tables located on different clusters

2015-06-24 Thread Alexander Pivovarov
Hello Everyone Can I define external table on cluster_1 pointing to hdfs location on cluster_2? I tried and got some strange exception in hive FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.reflect.InvocationTargetException) I w

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread Alexander Pivovarov
Thank you Xuefu! Excellent explanation and comparison! We should put it to Hive on Spark wiki. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark On Wed, May 20, 2015 at 10:45 AM, Xuefu Zhang wrote: > I have been working on HIve on Spark, and knows a little about SparkSQL. > Here a

RE: How to compare data in two tables?

2015-04-27 Thread Alexander Pivovarov
cipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore nei

How to compare data in two tables?

2015-04-27 Thread Alexander Pivovarov
Hi Everyone Lets say I have hive table in 2 datacenters. Table format can be textfile or Orc. There is scoop job running every day which adds data to the table. Each datacenter has its own instance of scoop job. In Ideal case scenario the data in these two table should be the same. The same mean

Re: create table fails with exception unable to rename tmp file

2015-04-11 Thread Alexander Pivovarov
maybe user which runs hive cli does not have write permissions on hdfs://zhangj05-a:8020/user/hive/warehouse/reporting.db who is hdfs://zhangj05-a:8020/user/hive/warehouse/reporting.db owner? what user runs hive cli? On Sat, Apr 11, 2015 at 11:07 AM, Jie Zhang wrote: > Hi, > > I hit the foll

Re: Unsubscribe Me

2015-04-07 Thread Alexander Pivovarov
Ashish, Read The Friendly Manual below https://hive.apache.org/mailing_lists.html On Tue, Apr 7, 2015 at 2:15 PM, Ashish Garg wrote: > Hello Admin, > > Please unsubscribe me. > > Regards, > Ashish Garg >

Re: CamelCase using InitCap Function in Hive 0.13

2015-04-01 Thread Alexander Pivovarov
Vivek, You can see the version in two places 1. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions stringinitcap(string A)Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited

Re: adding a local jar for UDF test

2015-04-01 Thread Alexander Pivovarov
I can suggest 3 options 1. you can use JUnit test to test your UDF (e.g. TestGenericUDFLastDay) 2. you can create q file and test your UDF via mvn (look at udf_last_day.q) mvn clean install -DskipTests -Phadoop-2 cd itest/qtest mvn test -Dtest=TestCliDriver -Dqfile=udf_last_day.q -Dtest.output.ov

Re: [ANNOUNCE] New Hive Committers - Jimmy Xiang, Matt McCline, and Sergio Pena

2015-03-23 Thread Alexander Pivovarov
Congrats to Matt, Jimmy and Sergio! On Mon, Mar 23, 2015 at 11:30 AM, Chaoyu Tang wrote: > Congratulations to Jimmy and Sergio! > > On Mon, Mar 23, 2015 at 2:08 PM, Carl Steinbach wrote: > >> The Apache Hive PMC has voted to make Jimmy Xiang, Matt McCline, and >> Sergio Pena committers on the A

Re: sorting in hive -- general

2015-03-07 Thread Alexander Pivovarov
sort by query produces multiple independent files. order by - just one file usually sort by is used with distributed by. In older hive versions (0.7) they might be used to implement local sort within partition similar to RANK() OVER (PARTITION BY A ORDER BY B) On Sat, Mar 7, 2015 at 3:02 PM, ma

Re: Create custom UDF

2015-03-05 Thread Alexander Pivovarov
Several useful common udf methods we added to GenericUDF recently https://issues.apache.org/jira/browse/HIVE-9744 you can look at the following UDFs as an example: https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLevenshtein.java https://githu

Re: how to access array type?

2015-03-02 Thread Alexander Pivovarov
hive> create table test1 (c1 array) row format delimited collection items terminated by ','; OK hive> insert into test1 select array(1,2,3) from dual; OK hive> select * from test1; OK [1,2,3] hive> select c1[0] from test1; OK 1 $ hadoop fs -cat /apps/hive/warehouse/test1/00_0 1,2,3 On Su

Re: HS2 standalone JDBC jar not standalone

2015-03-02 Thread Alexander Pivovarov
yes, we even have a ticket for that https://issues.apache.org/jira/browse/HIVE-9600 btw can anyone test jdbc driver with kerberos enabled? https://issues.apache.org/jira/browse/HIVE-9599 On Mon, Mar 2, 2015 at 10:01 AM, Nick Dimiduk wrote: > Heya, > > I've like to use jmeter against HS2/JDBC a

Re: [ANNOUNCE] New Hive PMC Member - Sergey Shelukhin

2015-02-25 Thread Alexander Pivovarov
Congrats! On Wed, Feb 25, 2015 at 12:33 PM, Vaibhav Gumashta < vgumas...@hortonworks.com> wrote: > Congrats Sergey! > > On 2/25/15, 9:06 AM, "Vikram Dixit" wrote: > > >Congrats Sergey! > > > >On 2/25/15, 8:43 AM, "Carl Steinbach" wrote: > > > >>I am pleased to announce that Sergey Shelukhin has

get max partition column value

2015-02-25 Thread Alexander Pivovarov
Hi Everyone Lets say I have a table partitioned by period string how to select max period? if I run select max(period) from invoice; hive 0.13.1 runs MR which is slow OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Tez Edges:

Re: CSV file reading in hive

2015-02-13 Thread Alexander Pivovarov
hive csv serde is available for all hive versions https://github.com/ogrodnek/csv-serde DEFAULT_ESCAPE_CHARACTER \ DEFAULT_QUOTE_CHARACTER " DEFAULT_SEPARATOR, add jar path/to/csv-serde.jar; (or put it to hive/hadoop/mr classpath on all boxes on cluster) -- you can use custom separ

Re: [ANNOUNCE] New Hive Committers -- Chao Sun, Chengxiang Li, and Rui Li

2015-02-09 Thread Alexander Pivovarov
Congrats! On Mon, Feb 9, 2015 at 12:31 PM, Carl Steinbach wrote: > The Apache Hive PMC has voted to make Chao Sun, Chengxiang Li, and Rui Li > committers on the Apache Hive Project. > > Please join me in congratulating Chao, Chengxiang, and Rui! > > Thanks. > > - Carl > >

Re: Hiveserver2 memory / thread leak v 0.13.1 (hdp-2.1.5)

2015-02-06 Thread Alexander Pivovarov
rg/jira/browse/HIVE-7353. > > Thanks, > —Vaibhav > > From: Alexander Pivovarov > Reply-To: "user@hive.apache.org" > Date: Wednesday, February 4, 2015 at 6:03 PM > To: "user@hive.apache.org" > Subject: Hiveserver2 memory / thread leak v 0.13.1 (hdp-2.1.5) &

Re: Re: How to query data by page in Hive?

2015-02-05 Thread Alexander Pivovarov
ROW_NUMBER doc http://docs.oracle.com/cd/B28359_01/server.111/b28286/functions144.htm#SQLRF06100 On Thu, Feb 5, 2015 at 4:48 PM, r7raul1...@163.com wrote: > *Table structure :* > CREATE TABLE `u_data`( > `userid` int, > `movieid` int, > `rating` int, > `unixtime` string) > ROW FORMAT DELIMITED

Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-02 Thread Alexander Pivovarov
, Alexander Pivovarov wrote: > I like Tez engine for hive (aka Stinger initiative) > > - faster than MR engine. especially for complex queries with lots of > nested sub-queries > - stable > - min latency is 5-7 sec (0 sec for select count(*) ...) > - capable to process huge

Re: Which [open-souce] SQL engine atop Hadoop?

2015-02-02 Thread Alexander Pivovarov
I like Tez engine for hive (aka Stinger initiative) - faster than MR engine. especially for complex queries with lots of nested sub-queries - stable - min latency is 5-7 sec (0 sec for select count(*) ...) - capable to process huge datasets (not limited by RAM as Spark) On Mon, Feb 2, 2015 at 6

Re: Hive wiki write access

2015-02-02 Thread Alexander Pivovarov
Thank you, Lefty! On Mon, Feb 2, 2015 at 3:59 PM, Lefty Leverenz wrote: > Done. Welcome to the Hive wiki team, Alexander! > > -- Lefty > > On Mon, Feb 2, 2015 at 2:14 PM, Alexander Pivovarov > wrote: > >> Hi Everyone >> >> Can I get write access to hive

Hive wiki write access

2015-02-02 Thread Alexander Pivovarov
Hi Everyone Can I get write access to hive wiki? I need to put descriptions for several UDFs added recently (init_cap, add_months, last_day, greatest, least) Confluence username: apivovarov https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Re: Enhancing Query Join to speed up Query

2013-06-13 Thread Alexander Pivovarov
Basically 1. if you join table try to filter out as much as possible in WHERE (to reduce amount of data sent form map to reduce step) 2. if you join big table with small table (< 500 MB) use SELECT /*+ MAPJOIN(small_table) */ hint to avoid reduce step. 3. if you join big table with big table make

Re: Sequence file compression in Hive

2013-06-10 Thread Alexander Pivovarov
Sachin, it works SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET mapred.output.compression.type=BLOCK; create table data1_seq STORED AS SEQUENCEFILE as select * from date1; hadoop fs -cat /user/hive/warehouse/data1_seq/00_0

Re: Sequence file compression in Hive

2013-06-10 Thread Alexander Pivovarov
Sachin, it works SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET mapred.output.compression.type=BLOCK; create table data1_seq STORED AS SEQUENCEFILE as select * from date1; hadoop fs -cat /user/hive/warehouse/data1_seq/00_0

Re: Need rank(), can't build m6d's version

2013-04-01 Thread Alexander Pivovarov
http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/ On Mon, Apr 1, 2013 at 3:45 PM, Keith Wiley wrote: > I need rank() in Hive. I have't had much luck with Edward Capriolo's on > git and it comes with no documentation. It depends on hive-test (also by >

Re: Function definition in hive

2013-02-22 Thread Alexander Pivovarov
https://cwiki.apache.org/Hive/hiveplugins.html Creating Custom UDFs First, you need to create a new class that extends UDF, with one or more methods named evaluate. package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text; public final class Lower

Re: Join not working in HIVE

2012-12-17 Thread Alexander Pivovarov
Hive supports only equi-join I recommend you to read some hive manual before use it. (e.g. http://hive.apache.org/docs/r0.9.0/language_manual/joins.html https://cwiki.apache.org/Hive/languagemanual-joins.html) on the first sentence it says "Only equality joins, outer joins, and left semi joins are

Re: best way to load millions of gzip files in hdfs to one table in hive?

2012-10-02 Thread Alexander Pivovarov
Options 1. create table and put files under the table dir 2. create external table and point it to files dir 3. if files are small then I recomend to create new set of files using simple MR program and specifying number of reduce tasks. Goal is to make files size > hdfs block size (it safes NN me