[jira] [Created] (HIVE-25947) Compactor job queue cannot be set per table via compactor.mapred.job.queue.name

2022-02-10 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-25947:
--

 Summary: Compactor job queue cannot be set per table via 
compactor.mapred.job.queue.name
 Key: HIVE-25947
 URL: https://issues.apache.org/jira/browse/HIVE-25947
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Before HIVE-20723 it was possible to schedule the compaction for each table on 
specific job queues by putting {{compactor.mapred.job.queue.name}} in the table 
properties. 

{code:sql}
CREATE TABLE person (name STRING, age INT) STORED AS ORC TBLPROPERTIES(
'transactional'='true',
'compactor.mapred.job.queue.name'='root.user2);

ALTER TABLE person COMPACT 'major' WITH OVERWRITE 
TBLPROPERTIES('compactor.mapred.job.queue.name'='root.user2')
{code}

This is no longer possible (after HIVE-20723) and in order to achieve the same 
effect someone needs to use the {{compactor.hive.compactor.job.queue}}.

{code:sql}
CREATE TABLE person (name STRING, age INT) STORED AS ORC TBLPROPERTIES(
'transactional'='true',
'compactor.hive.compactor.job.queue'='root.user2);

ALTER TABLE person COMPACT 'major' WITH OVERWRITE 
TBLPROPERTIES('compactor.hive.compactor.job.queue'='root.user2')
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] Properties for scheduling compactions on specific queues

2022-02-10 Thread Stamatis Zampetakis
Hi all,

@Janos: The patch didn't go through but I get the idea. FYI: in most apache
lists attachments are not allowed

Since people do not have strong feelings about this I am inclined to move
forward with Janos suggestion and accept all three properties for
specifying compactions.

I just logged HIVE-25947 about this.

Best,
Stamatis

On Tue, Feb 8, 2022 at 2:45 PM Janos Kovacs  wrote:

> Hi Stamatis,
>
> The attached one is a more generic proposal: basically moves out target
> queue resolution to CompactorUtil.getCompactorJobQueueName as it is
> already in use for the StatsUpdater.
> There then the old properties are used based on a new fallback config
> prop. Fallback is only for statement (ci.props) and table (t.props) and not
> global configuration.
>
> It's just a mock-up, I didn't even check if it compiles, but shows the
> logic which should be good enough for the discussion.
>
> R, Janos
>
>
> Stamatis Zampetakis  ezt írta (időpont: 2022. febr.
> 7., H, 23:44):
>
>> Thanks Janos for the feedback.
>>
>> If I understand well your suggestion is support all of the properties
>> below
>> for table level compactions and treat them as equivalent:
>> * compactor.mapred.job.queue.name
>> * compactor.mapreduce.job.queuename
>> * compactor.hive.compactor.job.queue
>>
>> It is something that crossed my mind as well but I am slightly skeptical
>> because like this we explicitly state that people are free to use whatever
>> they like. It might also have as a consequence MR properties affecting Tez
>> (as it happens a bit with HIVE-25595) which from my perspective is not
>> that
>> great. I am also thinking that it will lead to more requests for accepting
>> these MR specific properties in the query based compactor which cannot
>> (and
>> probably will never) use MR as the underlying engine. We should also keep
>> in mind that the MR engine was deprecated ~6years ago and the MR compactor
>> may follow soon.
>>
>> I am fine implementing this specific change (accepting all properties
>> above) as long as someone from the people contributing to the compactor
>> confirms it is the desired path going forward.
>>
>> Best,
>> Stamatis
>>
>> On Mon, Feb 7, 2022 at 11:50 AM Janos Kovacs  wrote:
>>
>> > Hi Stamatis,
>> >
>> > I agree that the [compactor.]*hive.compactor.queue.name
>> > * is a better solution as hive now
>> also
>> > supports query based compaction, not only MR.
>> > ...although I think this needs to be backward compatible!
>> >
>> > What do you think about a logic similar to this:
>> >
>> > ---
>> a/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
>> 2022-02-07 10:31:28.0 +0100
>> > +++
>> b/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java
>> 2022-02-07 10:33:25.0 +0100
>> > @@ -145,10 +145,19 @@
>> >  overrideMRProps(job, t.getParameters()); // override MR properties
>> from tblproperties if applicable
>> >  if (ci.properties != null) {
>> >overrideTblProps(job, t.getParameters(), ci.properties);
>> >  }
>> >
>> > +// make queue configuration backward compatible
>> > +// at that point overrideMRProps and OverrideTblProps already
>> consolidated
>> > +// the final value, just need to use job.TBALE_PROPS
>> > +String queueNameLegacy =
>> > +  (new
>> StringableMap(job.get(TABLE_PROPS))).toProperties().getProperty("
>> compactor.mapred.job.queue.name");
>> > +if (queueNameLegacy != null && queueNameLegacy.length() > 0) {
>> > +  job.set(ConfVars.COMPACTOR_JOB_QUEUE, queueNameLegacy);
>> > +}
>> > +
>> >  String queueName = HiveConf.getVar(job,
>> ConfVars.COMPACTOR_JOB_QUEUE);
>> >  if (queueName != null && queueName.length() > 0) {
>> >job.setQueueName(queueName);
>> >  }
>> >
>> >
>> > Of course this can be wrapped around with a new config if needed, like
>> > hive.compaction.queue.name.use.legacy or whatever...
>> > FYI: we might also want to check legacy config not only for *"
>> compactor.mapred.job.queue.name
>> > "* but also for
>> > *"compactor.mapreduce.job.queuename" *as the first one was already on
>> the
>> > deprecated list as pointed out by Peter Vary.
>> >
>> > Please also note that the change introduced by HIVE-25595 is currently
>> not
>> > compatible with the new config as it was developed for the old
>> > compactor.mapred... property:
>> >
>> >
>> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorUtil.java#L31
>> > This also needs to be handled - for both the new prop name and backward
>> > compatibility.
>> >
>> > R, Janos
>> >
>> >
>> > On 2022/01/31 09:50:49 Stamatis Zampetakis wrote:
>> > > Hi all,
>> > >
>> > > This email is an attempt to converge on which Hive/Tez/MR properties
>> > > someone should use in order to schedule a compaction on specific
>> queues.
>> > > For those who are not familiar with how queues are used the YARN
>

[jira] [Created] (HIVE-25948) Optimize Iceberg writes by directing records either Clustered- or Fanoutwriter

2022-02-10 Thread Jira
Ádám Szita created HIVE-25948:
-

 Summary: Optimize Iceberg writes by directing records either 
Clustered- or Fanoutwriter
 Key: HIVE-25948
 URL: https://issues.apache.org/jira/browse/HIVE-25948
 Project: Hive
  Issue Type: Improvement
Reporter: Ádám Szita
Assignee: Ádám Szita


Currently Hive writes Iceberg tables with ClusteredWriter. This has less memory 
footprint as it only keeps one writer open at a time, but requires the records 
to be sorted.

However if data cardinality is low Fanoutwriter is a better choice for 
performance.

We should add support so that either can be used, and the decision could be 
based similarly how currently SortedDynPartitonOptimizer has it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Time to Remove Hive-on-Spark

2022-02-10 Thread Zoltan Haindrich

Hey,

I think there is no real interest in this feature; we don't have users/contributors backing it - last development was around 2018 October; there were ~2 bugfix commits ever 
since that...we should stop carrying dead weight...another 2 weeks went by since Stamatis have reminded us that after 1.5 years(!) nothing have changed.


+1 on removing it

cheers,
Zoltan

you may inspect some of the recent changes with:
git log -c `find . -type f -path '**/spark/**'|grep -v xml|grep -v 
properties|grep -v q.out`


On 1/28/22 2:32 PM, Stamatis Zampetakis wrote:

Hi team,

Almost one year has passed since the last exchange in this discussion and
if I am not wrong there has been no effort to revive Hive-on-Spark. To be
more precise, I don't think I have seen any Spark related JIRA for quite
some time now and although I don't want to rush into conclusions, there
does not seem to be any community member involved in maintaining or adding
new features in this part of the code.

Keeping dead code in the repository does not do any good to the project and
puts a non-negligible burden to future maintainers.

Clearly, we cannot make a new Hive release where a major feature is
completely untested so either someone commits to re-enable/fix the
respective tests soon or we move forward the work started by David and drop
support for Hive-on-Spark.

I would like to ask the community if there is anyone who can take up this
maintenance task and enable/fix Spark related tests in the next month or so?

Best,
Stamatis

On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo 
wrote:


I do not know how it works for most of the world. But in cloudera where the
TEZ options were never popular hive-on-spark represents a solid way to get
things done for small datasets lower latency.

As for the spark adoption. You know a while ago I came up with some ways to
make hive more  spark like. One of them was a found a way to make "compile"
a hive keyword so folks could build UDFs on the fly. It was such an
uphil climb. Folks found a way to make it disabled by default for security.
Then later when things moved from CLI to beeline it was like the ONLY thing
that I found not ported. Like it was extremely frustrating.






On Mon, Jul 27, 2020 at 3:19 PM David  wrote:


Hello  Xuefu,

I am not part of the Cloudera Hive product team,  though I volunteer to
work on small projects from time to time.  Perhaps someone from that team
can chime in with some of their thoughts, but personally, I think that in
the long run, there will be more of a merge between Hive-on-Spark and

other

Spark-native offerings.  I'm not sure what the differentiation will be
going forward.  With that said, are there any developers on this mailing
list who are willing to take on the maintenance effort of keeping HoS
moving forward?

http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/



https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html



Thanks.

On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang  wrote:


Previous reasoning seemed to suggest a lack of user adoption. Now we

are

concerned about ongoing maintenance effort. Both are valid

considerations.

However, I think we should have ways to find out the answers.

Therefore,

I

suggest the following be carried out:

1. Send out the proposal (removing Hive on Spark) to users including
u...@hive.apache.org and get their feedback.
2. Ask if any developers on this mailing list are willing to take on

the

maintenance effort.

I'm concerned about user impact because I can still see issues being
reported on HoS from time to time. I'm more concerned about the future

of

Hive if we narrow Hive neutrality on execution engines, which will

possibly

force more Hive users to migrate to other alternatives such as Spark

SQL,

which is already eroding Hive's user base.

Being open and neutral used to be Hive's most admired strengths.

Thanks,
Xuefu


On Wed, Jul 22, 2020 at 8:46 AM Alan Gates 

wrote:



An important point here is I don't believe David is proposing to

remove

Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing

to

support it in existing 2 and 3 lines makes sense, but since no one

has

maintained it on trunk for some time and it does not work with many

of

the

newer features it should be removed from trunk.

Alan.

On Tue, Jul 21, 2020 at 4:10 PM Chao Sun  wrote:


Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a

very

large scale in production right now and I don't think we have any

plan

to

change it soon.



On Tue, Jul 21, 2020 at 11:28 AM David  wrote:


Hello,

Thanks for the feedback.

Just a quick recap: I did propose this @dev and I received

unanimous

+1's

from the community.  After a couple months, I created the PR.

Certainly open to discussion, but there hasn't been any

discussion

thus

far

because there have been no objections until this point.

HoS has low adoption, heavy technical debt, and the manner i

Nightlies

2022-02-10 Thread Zoltan Haindrich

Hey,

I've built a preview of a nightly build; it could be tried out using the 
following:

git clone https://github.com/kgyrtkirk/hive-dev-box
cd hive-dev-box
./hdb run nightlytest
sw hive 
http://ci.hive.apache.org/job/hive-nightly/4/artifact/archive/apache-hive-4.0.0-nightly-dd23fa9147-20220210_160351-bin.tar.gz
reinit_metastore
hive_launch

the patch which fixes pom.xml issues/etc is not yet merged; but its here: 
https://github.com/apache/hive/pull/3013
job is at: http://ci.hive.apache.org/job/hive-nightly/

cheers,
Zoltan




[jira] [Created] (HIVE-25949) predicate pushdown does not seem to work correctly for external tables pointing to MySQL

2022-02-10 Thread Witold Drabicki (Jira)
Witold Drabicki created HIVE-25949:
--

 Summary: predicate pushdown does not seem to work correctly for 
external tables pointing to MySQL
 Key: HIVE-25949
 URL: https://issues.apache.org/jira/browse/HIVE-25949
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 2.3.7
 Environment: Hive runs on *GCP Dataproc,* image version is 
*1.5.56-debian10* (Hive v {*}2.3.7{*})

*MySQL* server version is {*}5.7.36{*}.

The following jars are used:
{code:java}
add jar gs://d-test-bucket-1/commons-pool-1.6.jar;
add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar;
add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar;
add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar;  (identical 
behavior when using mysql-connector-java-5.1.49){code}
Reporter: Witold Drabicki


The following external table has been created:

 
{code:java}
CREATE EXTERNAL TABLE test_table_ext (
  col1 integer,
  col2 integer,
  col3 bigint,
  col4 decimal(38,15),
...
  col13 decimal(38,15),
  partition_column integer
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
    "hive.sql.database.type" = "MYSQL",
    "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
    "hive.sql.jdbc.url" = "jdbc:mysql:///",
    "hive.sql.dbcp.username" = "",
    "hive.sql.dbcp.password" = "",
    "hive.sql.table" = "TEST_TABLE",
    "hive.sql.schema" = "",
    "hive.sql.query" = "select col1, col2, col3, col4, ..., col13, 
partition_column from .TEST_TABLE",
    "hive.sql.dbcp.maxActive" = "1",
    "hive.sql.numPartitions" = "1"
);
{code}
 

Corresponding MySQL table:

 
{code:java}
CREATE TABLE TEST_TABLE (
  col1 int(11),
  col2 int(11),
  col3 bigint(20),
  col4 decimal(38,15),
...
  col13 decimal(38,15),
  partition_column int(11)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITION BY LIST (partition_column)
{code}
 

The MySQL table has *470678* rows and *363* partitions.

Running *ANALYZE TABLE COMPUTE STATISTICS* calculates the number of rows in the 
table as {*}11303242{*}, which is incorrect.

The following 2 queries have been tested:

1) *select count(*) from test_table_ext where col2 = 7602*

2) *select count(*) from test_table_ext where partition_column = 20220207 col2 
= 7602*

*hive.optimize.ppd* setting is set to *true*

Query plan for query #2 is the following:

 
{code:java}
    Plan optimized by CBO.
    Vertex dependency in root stage
    Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
    Stage-0
    Fetch Operator
        limit:-1
        Stage-1
        Reducer 2
        File Output Operator [FS_7]
            Group By Operator [GBY_5] (rows=1 width=8)
            Output:["_col0"],aggregations:["count(VALUE._col0)"]
            <-Map 1 [CUSTOM_SIMPLE_EDGE]
            PARTITION_ONLY_SHUFFLE [RS_4]
                Group By Operator [GBY_3] (rows=1 width=8)
                Output:["_col0"],aggregations:["count()"]
                Select Operator [SEL_2] (rows=1 width=0)
                    Filter Operator [FIL_8] (rows=1 width=0)
                    predicate:((UDFToDouble(partition_column) = 2.0220207E7) 
and (UDFToDouble(col2) = 7602.0))
                    TableScan [TS_0] (rows=11303242 width=0)
                        
wd@test_table_ext,test_table_ext,Tbl:PARTIAL,Col:NONE,Output:["col2","partition_column"]
{code}
Since query #2 is specifying partition value, expected behavior is that it 
should be noticeably faster that query #1 which has to scan all partitions, but 
these 2 queries take the same amount of time (150 seconds). This has been 
tested multiple times.

When the same queries are executed directly in MySQL, query #2 runs noticeably 
faster than query #1 (0.01 vs 0.26 seconds).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HIVE-25950) unix_timestamp returns NULL for incomplete date

2022-02-10 Thread roxxette (Jira)
roxxette created HIVE-25950:
---

 Summary: unix_timestamp returns NULL for incomplete date
 Key: HIVE-25950
 URL: https://issues.apache.org/jira/browse/HIVE-25950
 Project: Hive
  Issue Type: Bug
  Components: Clients, UDF
Affects Versions: 3.1.0
Reporter: roxxette


select from_unixtime(unix_timestamp('202202', 'MM'), 'MMdd');

returns 20220201 in HDI3.6(Hive1.2.1)

returns NULL in HDI4.0(Hive3.1.0)

 

Turns out Hive 1.2.1 has default value for year(1970), month(01), day(01).

And Hive 3.1.0 doesn't.

 

So in Hive 3.1.0 we have to parse a date with all 3 parts(y/m/d), otherwise 
it'll fail.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)