[jira] [Commented] (IGNITE-2143) inner join produce wrong result and is very slow

Sergey Soldatov (JIRA) Tue, 22 Dec 2015 00:17:50 -0800

    [ 
https://issues.apache.org/jira/browse/IGNITE-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067736#comment-15067736
 ]


Sergey Soldatov commented on IGNITE-2143:
-----------------------------------------

Well, I did some testing and here is the result:
Query: 
SELECT a.name, a.lang, a.desc, count(a.name) AS cnt, a.url 
FROM ( SELECT  repo_name AS name, lang AS lang, desc AS desc, repo_url AS url 
FROM gitrecord WHERE type='WatchEvent' AND create_at BETWEEN 
'2015-01-01T00:01:00Z' AND '2015-01-01T03:59:00Z') AS a INNER JOIN ( SELECT 
repo_url AS url FROM gitrecord WHERE type='CreateEvent' AND create_at BETWEEN 
'2015-01-01T00:01:00Z' AND '2015-01-01T03:59:00Z') AS b ON a.url=b.url GROUP BY 
a.name, a.lang, a.desc, a.url HAVING cnt > 5 ORDER BY cnt DESC LIMIT 25
I used 1 node ignite to avoid problems with affinity. Without any indexes it 
takes 300 sec. With indexes on create_at and type it takes 40 sec. The same 
query in Spark works less than 5 sec.
Here is the explain on indexed version:
[[SELECT
    A.NAME AS __C0,
    A.LANG AS __C1,
    A.DESC AS __C2,
    COUNT(A.NAME) AS __C3,
    A.URL AS __C4,
    COUNT(A.NAME) AS __C5
FROM (
    SELECT
        REPO_URL AS URL
    FROM "gitrecords".GITRECORD
        /* "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
            AND CREATE_AT <= '2015-01-01T03:59:00Z'
         */
    WHERE (TYPE = 'CreateEvent')
        AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
        AND (CREATE_AT <= '2015-01-01T03:59:00Z'))
) B
    /* SELECT
        REPO_URL AS URL
    FROM "gitrecords".GITRECORD
        /++ "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
            AND CREATE_AT <= '2015-01-01T03:59:00Z'
         ++/
    WHERE (TYPE = 'CreateEvent')
        AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
        AND (CREATE_AT <= '2015-01-01T03:59:00Z'))
     */
INNER JOIN (
    SELECT
        REPO_NAME AS NAME,
        LANG AS LANG,
        DESC AS DESC,
        REPO_URL AS URL
    FROM "gitrecords".GITRECORD
        /* "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
            AND CREATE_AT <= '2015-01-01T03:59:00Z'
         */
    WHERE (TYPE = 'WatchEvent')
        AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
        AND (CREATE_AT <= '2015-01-01T03:59:00Z'))
) A
    /* SELECT
        REPO_NAME AS NAME,
        LANG AS LANG,
        DESC AS DESC,
        REPO_URL AS URL
    FROM "gitrecords".GITRECORD
        /++ "gitrecords"."create_at_idx": CREATE_AT >= '2015-01-01T00:01:00Z'
            AND CREATE_AT <= '2015-01-01T03:59:00Z'
         ++/
    WHERE (REPO_URL IS ?1)
        AND ((TYPE = 'WatchEvent')
        AND ((CREATE_AT >= '2015-01-01T00:01:00Z')
        AND (CREATE_AT <= '2015-01-01T03:59:00Z'))): URL = B.URL
     */
    ON 1=1
WHERE A.URL = B.URL
GROUP BY A.NAME, A.LANG, A.DESC, A.URL], [SELECT
    __C0 AS NAME,
    __C1 AS LANG,
    __C2 AS DESC,
    CAST(SUM(__C3) AS BIGINT) AS CNT,
    __C4 AS URL
FROM PUBLIC.__T0
    /* "gitrecords"."merge_scan" */
WHERE __C5
GROUP BY __C0, __C1, __C2, __C4
ORDER BY 4 DESC
LIMIT 25]]

Hope it can help

> inner join produce wrong result and is very slow
> ------------------------------------------------
>
>                 Key: IGNITE-2143
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2143
>             Project: Ignite
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5
>         Environment: 3 nodes in docker containers.
>            Reporter: Sergey Soldatov
>            Priority: Critical
>         Attachments: config.xml, ignite-cache-1.0.0.jar
>
>
> I have following query for gitarchive records processing.  
> {code}
> select 
>      lang, count(distinct ForkTable.repo_url) as cnt 
> from (
>      select repo_url, lang from gitrecord where type = 'ForkEvent' and 
>      lang is not null group by lang, repo_url) as ForkTable 
> inner join (
>      select repo_url, html_url from gitrecord where type='PullRequestEvent'   
>  
>      group by repo_url, html_url) as PullTable 
> on 
>      ForkTable.repo_url = PullTable.repo_url 
> group by lang 
> order by cnt desc
> {code}
> And there are two major problems:
> 1. It produces an incorrect result if it's running in cluster mode.
> 2. It's really slow. for 200k rows it takes 500+ seconds comparing to 20 on 
> Spark for the same query
> Steps to reproduce:
> 1. Download github archive for 1 day and put it to hdfs ("/Data" in my case):
> {code}wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz{code}
> 2. copy attached ignite-cache-1.0.0.jar to Ignite's lib dir 
> 3. run ignite with attached config.xml 
> 4. run spark-shell
> {code}
> spark-shell --packages org.apache.ignite:ignite-spark:1.5.0-b1 --repositories 
> http://www.gridgainsystems.com/nexus/content/repositories/external --jars 
> /usr/lib/ignite-hadoop/libs/ignite-cache-1.0.0.jar --master 
> spark://{spark-master}:7077
> {code}
> 5. load data and execute the query:
> {code}
> import org.apache.ignite._
> import org.apache.ignite.configuration._
> import org.apache.ignite.cache.query._
> import org.apache.ignite.spark._
> import org.apache.ignite.lang.IgniteBiPredicate
> import io.dtk._
> val df = sqlContext.read.json("/Data/*.gz")
> Ignition.setClientMode(true)
> val cacheName = "gitrecords"
> val rdd = df.select("id", "repo.url", 
> "payload.forkee.language","type","payload.pull_request.head.repo.html_url").map(row
>  => (row.getString(0), new GitRecord(0, 
> row.getString(1),row.getString(2),row.getString(3),row.getString(4))))
> val cfg = new 
> CacheConfiguration[String,GitRecord]().setName(cacheName).setIndexedTypes(classOf[String],classOf[GitRecord]).setLoadPreviousValue(true)
> val ic = new IgniteContext[String,GitRecord](sc, 
> "/usr/lib/ignite-hadoop/config/config.xml")
> val sharedRDD = ic.fromCache(cfg)
> sharedRDD.savePairs(rdd)
> sharedRDD.sql("select lang, count(distinct ForkTable.repo_url) as cnt from 
> (select repo_url, lang from gitrecord where type = 'ForkEvent' and lang is 
> not null group by lang, repo_url) as ForkTable inner join (select repo_url, 
> html_url from gitrecord where type='PullRequestEvent' group by repo_url, 
> html_url) as PullTable on ForkTable.repo_url = PullTable.repo_url group by 
> lang order by cnt desc").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (IGNITE-2143) inner join produce wrong result and is very slow

Reply via email to