[
https://issues.apache.org/jira/browse/IGNITE-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067736#comment-15067736
]
Sergey Soldatov commented on IGNITE-2143:
-----------------------------------------
Well, I did some testing and here is the result:
Query:
SELECT a.name, a.lang, a.desc, count(a.name) AS cnt, a.url
FROM ( SELECT repo_name AS name, lang AS lang, desc AS desc, repo_url AS url
FROM gitrecord WHERE type='WatchEvent' AND create_at BETWEEN
'2015-01-01T00:01:00Z' AND '2015-01-01T03:59:00Z') AS a INNER JOIN ( SELECT
repo_url AS url FROM gitrecord WHERE type='CreateEvent' AND create_at BETWEEN
'2015-01-01T00:01:00Z' AND '2015-01-01T03:59:00Z') AS b ON a.url=b.url GROUP BY
a.name, a.lang, a.desc, a.url HAVING cnt > 5 ORDER BY cnt DESC LIMIT 25
I used 1 node ignite to avoid problems with affinity. Without any indexes it
takes 300 sec. With indexes on create_at and type it takes 40 sec. The same
query in Spark works less than 5 sec.
Here is the explain on indexed version:
[[SELECT
A.NAME AS __C0,
A.LANG AS __C1,
A.DESC AS __C2,
COUNT(A.NAME) AS __C3,
A.URL AS __C4,
COUNT(A.NAME) AS __C5
FROM (
SELECT
REPO_URL AS URL
FROM "gitrecords".GITRECORD
/* "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
AND CREATE_AT <= '2015-01-01T03:59:00Z'
*/
WHERE (TYPE = 'CreateEvent')
AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
AND (CREATE_AT <= '2015-01-01T03:59:00Z'))
) B
/* SELECT
REPO_URL AS URL
FROM "gitrecords".GITRECORD
/++ "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
AND CREATE_AT <= '2015-01-01T03:59:00Z'
++/
WHERE (TYPE = 'CreateEvent')
AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
AND (CREATE_AT <= '2015-01-01T03:59:00Z'))
*/
INNER JOIN (
SELECT
REPO_NAME AS NAME,
LANG AS LANG,
DESC AS DESC,
REPO_URL AS URL
FROM "gitrecords".GITRECORD
/* "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
AND CREATE_AT <= '2015-01-01T03:59:00Z'
*/
WHERE (TYPE = 'WatchEvent')
AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
AND (CREATE_AT <= '2015-01-01T03:59:00Z'))
) A
/* SELECT
REPO_NAME AS NAME,
LANG AS LANG,
DESC AS DESC,
REPO_URL AS URL
FROM "gitrecords".GITRECORD
/++ "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
AND CREATE_AT <= '2015-01-01T03:59:00Z'
++/
WHERE (REPO_URL IS ?1)
AND ((TYPE = 'WatchEvent')
AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
AND (CREATE_AT <= '2015-01-01T03:59:00Z'))): URL = B.URL
*/
ON 1=1
WHERE A.URL = B.URL
GROUP BY A.NAME, A.LANG, A.DESC, A.URL], [SELECT
__C0 AS NAME,
__C1 AS LANG,
__C2 AS DESC,
CAST(SUM(__C3) AS BIGINT) AS CNT,
__C4 AS URL
FROM PUBLIC.__T0
/* "gitrecords"."merge_scan" */
WHERE __C5
GROUP BY __C0, __C1, __C2, __C4
ORDER BY 4 DESC
LIMIT 25]]
Hope it can help
> inner join produce wrong result and is very slow
> ------------------------------------------------
>
> Key: IGNITE-2143
> URL: https://issues.apache.org/jira/browse/IGNITE-2143
> Project: Ignite
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5
> Environment: 3 nodes in docker containers.
> Reporter: Sergey Soldatov
> Priority: Critical
> Attachments: config.xml, ignite-cache-1.0.0.jar
>
>
> I have following query for gitarchive records processing.
> {code}
> select
> lang, count(distinct ForkTable.repo_url) as cnt
> from (
> select repo_url, lang from gitrecord where type = 'ForkEvent' and
> lang is not null group by lang, repo_url) as ForkTable
> inner join (
> select repo_url, html_url from gitrecord where type='PullRequestEvent'
>
> group by repo_url, html_url) as PullTable
> on
> ForkTable.repo_url = PullTable.repo_url
> group by lang
> order by cnt desc
> {code}
> And there are two major problems:
> 1. It produces an incorrect result if it's running in cluster mode.
> 2. It's really slow. for 200k rows it takes 500+ seconds comparing to 20 on
> Spark for the same query
> Steps to reproduce:
> 1. Download github archive for 1 day and put it to hdfs ("/Data" in my case):
> {code}wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz{code}
> 2. copy attached ignite-cache-1.0.0.jar to Ignite's lib dir
> 3. run ignite with attached config.xml
> 4. run spark-shell
> {code}
> spark-shell --packages org.apache.ignite:ignite-spark:1.5.0-b1 --repositories
> http://www.gridgainsystems.com/nexus/content/repositories/external --jars
> /usr/lib/ignite-hadoop/libs/ignite-cache-1.0.0.jar --master
> spark://{spark-master}:7077
> {code}
> 5. load data and execute the query:
> {code}
> import org.apache.ignite._
> import org.apache.ignite.configuration._
> import org.apache.ignite.cache.query._
> import org.apache.ignite.spark._
> import org.apache.ignite.lang.IgniteBiPredicate
> import io.dtk._
> val df = sqlContext.read.json("/Data/*.gz")
> Ignition.setClientMode(true)
> val cacheName = "gitrecords"
> val rdd = df.select("id", "repo.url",
> "payload.forkee.language","type","payload.pull_request.head.repo.html_url").map(row
> => (row.getString(0), new GitRecord(0,
> row.getString(1),row.getString(2),row.getString(3),row.getString(4))))
> val cfg = new
> CacheConfiguration[String,GitRecord]().setName(cacheName).setIndexedTypes(classOf[String],classOf[GitRecord]).setLoadPreviousValue(true)
> val ic = new IgniteContext[String,GitRecord](sc,
> "/usr/lib/ignite-hadoop/config/config.xml")
> val sharedRDD = ic.fromCache(cfg)
> sharedRDD.savePairs(rdd)
> sharedRDD.sql("select lang, count(distinct ForkTable.repo_url) as cnt from
> (select repo_url, lang from gitrecord where type = 'ForkEvent' and lang is
> not null group by lang, repo_url) as ForkTable inner join (select repo_url,
> html_url from gitrecord where type='PullRequestEvent' group by repo_url,
> html_url) as PullTable on ForkTable.repo_url = PullTable.repo_url group by
> lang order by cnt desc").show
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)