[ https://issues.apache.org/jira/browse/IGNITE-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067736#comment-15067736 ]
Sergey Soldatov commented on IGNITE-2143: ----------------------------------------- Well, I did some testing and here is the result: Query: SELECT a.name, a.lang, a.desc, count(a.name) AS cnt, a.url FROM ( SELECT repo_name AS name, lang AS lang, desc AS desc, repo_url AS url FROM gitrecord WHERE type='WatchEvent' AND create_at BETWEEN '2015-01-01T00:01:00Z' AND '2015-01-01T03:59:00Z') AS a INNER JOIN ( SELECT repo_url AS url FROM gitrecord WHERE type='CreateEvent' AND create_at BETWEEN '2015-01-01T00:01:00Z' AND '2015-01-01T03:59:00Z') AS b ON a.url=b.url GROUP BY a.name, a.lang, a.desc, a.url HAVING cnt > 5 ORDER BY cnt DESC LIMIT 25 I used 1 node ignite to avoid problems with affinity. Without any indexes it takes 300 sec. With indexes on create_at and type it takes 40 sec. The same query in Spark works less than 5 sec. Here is the explain on indexed version: [[SELECT A.NAME AS __C0, A.LANG AS __C1, A.DESC AS __C2, COUNT(A.NAME) AS __C3, A.URL AS __C4, COUNT(A.NAME) AS __C5 FROM ( SELECT REPO_URL AS URL FROM "gitrecords".GITRECORD /* "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z' AND CREATE_AT <= '2015-01-01T03:59:00Z' */ WHERE (TYPE = 'CreateEvent') AND ((CREATE_AT >= '2015-01-01T00:01:00Z') AND (CREATE_AT <= '2015-01-01T03:59:00Z')) ) B /* SELECT REPO_URL AS URL FROM "gitrecords".GITRECORD /++ "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z' AND CREATE_AT <= '2015-01-01T03:59:00Z' ++/ WHERE (TYPE = 'CreateEvent') AND ((CREATE_AT >= '2015-01-01T00:01:00Z') AND (CREATE_AT <= '2015-01-01T03:59:00Z')) */ INNER JOIN ( SELECT REPO_NAME AS NAME, LANG AS LANG, DESC AS DESC, REPO_URL AS URL FROM "gitrecords".GITRECORD /* "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z' AND CREATE_AT <= '2015-01-01T03:59:00Z' */ WHERE (TYPE = 'WatchEvent') AND ((CREATE_AT >= '2015-01-01T00:01:00Z') AND (CREATE_AT <= '2015-01-01T03:59:00Z')) ) A /* SELECT REPO_NAME AS NAME, LANG AS LANG, DESC AS DESC, REPO_URL AS URL FROM "gitrecords".GITRECORD /++ "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z' AND CREATE_AT <= '2015-01-01T03:59:00Z' ++/ WHERE (REPO_URL IS ?1) AND ((TYPE = 'WatchEvent') AND ((CREATE_AT >= '2015-01-01T00:01:00Z') AND (CREATE_AT <= '2015-01-01T03:59:00Z'))): URL = B.URL */ ON 1=1 WHERE A.URL = B.URL GROUP BY A.NAME, A.LANG, A.DESC, A.URL], [SELECT __C0 AS NAME, __C1 AS LANG, __C2 AS DESC, CAST(SUM(__C3) AS BIGINT) AS CNT, __C4 AS URL FROM PUBLIC.__T0 /* "gitrecords"."merge_scan" */ WHERE __C5 GROUP BY __C0, __C1, __C2, __C4 ORDER BY 4 DESC LIMIT 25]] Hope it can help > inner join produce wrong result and is very slow > ------------------------------------------------ > > Key: IGNITE-2143 > URL: https://issues.apache.org/jira/browse/IGNITE-2143 > Project: Ignite > Issue Type: Bug > Components: SQL > Affects Versions: 1.5 > Environment: 3 nodes in docker containers. > Reporter: Sergey Soldatov > Priority: Critical > Attachments: config.xml, ignite-cache-1.0.0.jar > > > I have following query for gitarchive records processing. > {code} > select > lang, count(distinct ForkTable.repo_url) as cnt > from ( > select repo_url, lang from gitrecord where type = 'ForkEvent' and > lang is not null group by lang, repo_url) as ForkTable > inner join ( > select repo_url, html_url from gitrecord where type='PullRequestEvent' > > group by repo_url, html_url) as PullTable > on > ForkTable.repo_url = PullTable.repo_url > group by lang > order by cnt desc > {code} > And there are two major problems: > 1. It produces an incorrect result if it's running in cluster mode. > 2. It's really slow. for 200k rows it takes 500+ seconds comparing to 20 on > Spark for the same query > Steps to reproduce: > 1. Download github archive for 1 day and put it to hdfs ("/Data" in my case): > {code}wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz{code} > 2. copy attached ignite-cache-1.0.0.jar to Ignite's lib dir > 3. run ignite with attached config.xml > 4. run spark-shell > {code} > spark-shell --packages org.apache.ignite:ignite-spark:1.5.0-b1 --repositories > http://www.gridgainsystems.com/nexus/content/repositories/external --jars > /usr/lib/ignite-hadoop/libs/ignite-cache-1.0.0.jar --master > spark://{spark-master}:7077 > {code} > 5. load data and execute the query: > {code} > import org.apache.ignite._ > import org.apache.ignite.configuration._ > import org.apache.ignite.cache.query._ > import org.apache.ignite.spark._ > import org.apache.ignite.lang.IgniteBiPredicate > import io.dtk._ > val df = sqlContext.read.json("/Data/*.gz") > Ignition.setClientMode(true) > val cacheName = "gitrecords" > val rdd = df.select("id", "repo.url", > "payload.forkee.language","type","payload.pull_request.head.repo.html_url").map(row > => (row.getString(0), new GitRecord(0, > row.getString(1),row.getString(2),row.getString(3),row.getString(4)))) > val cfg = new > CacheConfiguration[String,GitRecord]().setName(cacheName).setIndexedTypes(classOf[String],classOf[GitRecord]).setLoadPreviousValue(true) > val ic = new IgniteContext[String,GitRecord](sc, > "/usr/lib/ignite-hadoop/config/config.xml") > val sharedRDD = ic.fromCache(cfg) > sharedRDD.savePairs(rdd) > sharedRDD.sql("select lang, count(distinct ForkTable.repo_url) as cnt from > (select repo_url, lang from gitrecord where type = 'ForkEvent' and lang is > not null group by lang, repo_url) as ForkTable inner join (select repo_url, > html_url from gitrecord where type='PullRequestEvent' group by repo_url, > html_url) as PullTable on ForkTable.repo_url = PullTable.repo_url group by > lang order by cnt desc").show > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)