Shuffle Write Size

2015-12-24 Thread gsvic
Is there any formula with which I could determine Shuffle Write before execution? For example, in Sort Merge join in the stage in which the first table is being loaded the shuffle write is 429.2 MB. The table is 5.5G in the HDFS with block size 128 MB. Consequently is being loaded in 45 tasks/part

Hash Partitioning & Sort Merge Join

2015-11-18 Thread gsvic
In case of Sort Merge join in which a shuffle (exchange) will be performed, I have the following questions (Please correct me if my understanding is not correct): Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the block size is 64MB. This will produce a Scan JSONRelation()

Re: Are map tasks spilling data to disk?

2015-11-17 Thread gsvic
So, in case of Sort Merge join in which a shuffle (exchange) will be performed, I have the following questions (Please correct me if my understanding is not correct): Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the block size is 64MB. This will produce a Scan JSONRelatio

Map Tasks - Disk Spill (?)

2015-11-15 Thread gsvic
According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Are map tasks spilling data to disk?

2015-11-15 Thread gsvic
According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Map Tasks - Disk I/O

2015-11-11 Thread gsvic
According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Re: Build a specific module only

2015-11-04 Thread gsvic
Ok, thank you very much Ted. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Build-a-specific-module-only-tp14943p14953.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Build a specific module only

2015-11-04 Thread gsvic
Is it possible to build a specific spark module without building the whole project? For example, I am trying to build sql-core project by /build/mvn -pl sql/core install -DskipTests/ and I am getting the following error: /[error] bad symbolic reference. A signature in WebUI.class refers to term

RE: ShuffledHashJoin Possible Issue

2015-10-19 Thread gsvic
Hi Hao, Each table is created with the following python code snippet: data = [{'id': 'A%d'%i, 'value':ceil(random()*10)} for i in range(0,50)] with open('A.json', 'w+') as output: json.dump(data, output) The tables A and B containing 10 and 50 tuples respectively. In spark shell I type sq

ShuffledHashJoin Possible Issue

2015-10-18 Thread gsvic
I am doing some experiments with join algorithms in SparkSQL and I am facing the following issue: I have costructed two "dummy" json tables, t1.json and t2.json. Each of them has two columns, ID and Value. The ID is an incremental integer(unique) and the Value a random value. I am running an equi-

Task Execution

2015-09-30 Thread gsvic
Concerning task execution, a worker executes its assigned tasks in parallel or sequentially? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Re: RDD: Execution and Scheduling

2015-09-22 Thread gsvic
I already have but I needed some clarifications. Thanks for all your help! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177p14286.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

Re: RDD: Execution and Scheduling

2015-09-20 Thread gsvic
Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD chooses a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message

Re: RDD: Execution and Scheduling

2015-09-17 Thread gsvic
Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD choose a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message in c

RDD: Execution and Scheduling

2015-09-17 Thread gsvic
After reading some parts of Spark source code I would like to make some questions about RDD execution and scheduling. At first, please correct me if I am wrong at the following: 1) The number of partitions equals to the number of tasks will be executed in parallel (e.g. , when an RDD is repartitio

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic
No, I created the file by appending each JSON record in a loop without changing line. I've just changed that and now it works fine. Thank you very much for your support. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQLContext-read-json-path-throws-j

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic
ngle line? > > On Wed, Aug 26, 2015 at 6:47 AM, gsvic <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=13852&i=0>> wrote: > >> Hi, >> >> I have the following issue. I am trying to load a 2.5G JSON file from a >> 10-node H

SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic
Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame, using sqlContext.read.json("hdfs://master:9000/path/file.json"). The JSON file contains a parsed table(relation) from the TPCH benchmark. After finis