[jira] [Commented] (IGNITE-2143) inner join produce wrong result and is very slow

Sergi Vladykin (JIRA) Sat, 26 Dec 2015 00:07:11 -0800

    [ 
https://issues.apache.org/jira/browse/IGNITE-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071833#comment-15071833
 ]


Sergi Vladykin commented on IGNITE-2143:
----------------------------------------

As you can see from the query plan your index on `type` field is never used. 
You need a single compound index on (type, create_at, url) fields in this exact 
order.

Notice that the bug with HAVING (IGNITE-2205) will be fixed in 1.5, so now the 
query will work wrong.

Also you should notice that Ignite SQL is designed for throughput first, while 
you are measuring latency of analytic query. I mean Spark works the following 
way: it splits dataset in a number of chunks and attempts to process these 
chunks as parallel as possible, while Ignite runs each query part just in a 
single thread on each node. If you choose correct indexes you can have 
performance much better than Spark (due to better algorithmic efficiency) + 
good throughput but obviously not in all cases. In the future we will improve 
Ignite SQL so that it will allow tuning for better parallelism and handle 
analytic (latency targeted, disregarding throughput) use cases better.

> inner join produce wrong result and is very slow
> ------------------------------------------------
>
>                 Key: IGNITE-2143
>                 URL: https://issues.apache.org/jira/browse/IGNITE-2143
>             Project: Ignite
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5
>         Environment: 3 nodes in docker containers.
>            Reporter: Sergey Soldatov
>            Priority: Critical
>         Attachments: config.xml, ignite-cache-1.0.0.jar
>
>
> I have following query for gitarchive records processing.  
> {code}
> select 
>      lang, count(distinct ForkTable.repo_url) as cnt 
> from (
>      select repo_url, lang from gitrecord where type = 'ForkEvent' and 
>      lang is not null group by lang, repo_url) as ForkTable 
> inner join (
>      select repo_url, html_url from gitrecord where type='PullRequestEvent'   
>  
>      group by repo_url, html_url) as PullTable 
> on 
>      ForkTable.repo_url = PullTable.repo_url 
> group by lang 
> order by cnt desc
> {code}
> And there are two major problems:
> 1. It produces an incorrect result if it's running in cluster mode.
> 2. It's really slow. for 200k rows it takes 500+ seconds comparing to 20 on 
> Spark for the same query
> Steps to reproduce:
> 1. Download github archive for 1 day and put it to hdfs ("/Data" in my case):
> {code}wget http://data.githubarchive.org/2015-01-01-{0..23}.json.gz{code}
> 2. copy attached ignite-cache-1.0.0.jar to Ignite's lib dir 
> 3. run ignite with attached config.xml 
> 4. run spark-shell
> {code}
> spark-shell --packages org.apache.ignite:ignite-spark:1.5.0-b1 --repositories 
> http://www.gridgainsystems.com/nexus/content/repositories/external --jars 
> /usr/lib/ignite-hadoop/libs/ignite-cache-1.0.0.jar --master 
> spark://{spark-master}:7077
> {code}
> 5. load data and execute the query:
> {code}
> import org.apache.ignite._
> import org.apache.ignite.configuration._
> import org.apache.ignite.cache.query._
> import org.apache.ignite.spark._
> import org.apache.ignite.lang.IgniteBiPredicate
> import io.dtk._
> val df = sqlContext.read.json("/Data/*.gz")
> Ignition.setClientMode(true)
> val cacheName = "gitrecords"
> val rdd = df.select("id", "repo.url", 
> "payload.forkee.language","type","payload.pull_request.head.repo.html_url").map(row
>  => (row.getString(0), new GitRecord(0, 
> row.getString(1),row.getString(2),row.getString(3),row.getString(4))))
> val cfg = new 
> CacheConfiguration[String,GitRecord]().setName(cacheName).setIndexedTypes(classOf[String],classOf[GitRecord]).setLoadPreviousValue(true)
> val ic = new IgniteContext[String,GitRecord](sc, 
> "/usr/lib/ignite-hadoop/config/config.xml")
> val sharedRDD = ic.fromCache(cfg)
> sharedRDD.savePairs(rdd)
> sharedRDD.sql("select lang, count(distinct ForkTable.repo_url) as cnt from 
> (select repo_url, lang from gitrecord where type = 'ForkEvent' and lang is 
> not null group by lang, repo_url) as ForkTable inner join (select repo_url, 
> html_url from gitrecord where type='PullRequestEvent' group by repo_url, 
> html_url) as PullTable on ForkTable.repo_url = PullTable.repo_url group by 
> lang order by cnt desc").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (IGNITE-2143) inner join produce wrong result and is very slow

Reply via email to