from:"gsvic"

Shuffle Write Size

2015-12-24 Thread gsvic

Is there any formula with which I could determine Shuffle Write before execution? For example, in Sort Merge join in the stage in which the first table is being loaded the shuffle write is 429.2 MB. The table is 5.5G in the HDFS with block size 128 MB. Consequently is being loaded in 45 tasks/part

Hash Partitioning & Sort Merge Join

2015-11-18 Thread gsvic

In case of Sort Merge join in which a shuffle (exchange) will be performed, I have the following questions (Please correct me if my understanding is not correct): Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the block size is 64MB. This will produce a Scan JSONRelation()

Re: Are map tasks spilling data to disk?

2015-11-17 Thread gsvic

So, in case of Sort Merge join in which a shuffle (exchange) will be performed, I have the following questions (Please correct me if my understanding is not correct): Let's say that relation A is a JSONRelation (640 MB) on the HDFS where the block size is 64MB. This will produce a Scan JSONRelatio

Map Tasks - Disk Spill (?)

2015-11-15 Thread gsvic

According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Are map tasks spilling data to disk?

2015-11-15 Thread gsvic

According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Map Tasks - Disk I/O

2015-11-11 Thread gsvic

According to this paper Spak's map tasks writes the results to disk. My actual question is, in BroadcastHashJoin

Re: Build a specific module only

2015-11-04 Thread gsvic

Ok, thank you very much Ted. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Build-a-specific-module-only-tp14943p14953.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Build a specific module only

2015-11-04 Thread gsvic

Is it possible to build a specific spark module without building the whole project? For example, I am trying to build sql-core project by /build/mvn -pl sql/core install -DskipTests/ and I am getting the following error: /[error] bad symbolic reference. A signature in WebUI.class refers to term

RE: ShuffledHashJoin Possible Issue

2015-10-19 Thread gsvic

Hi Hao, Each table is created with the following python code snippet: data = [{'id': 'A%d'%i, 'value':ceil(random()*10)} for i in range(0,50)] with open('A.json', 'w+') as output: json.dump(data, output) The tables A and B containing 10 and 50 tuples respectively. In spark shell I type sq

ShuffledHashJoin Possible Issue

2015-10-18 Thread gsvic

I am doing some experiments with join algorithms in SparkSQL and I am facing the following issue: I have costructed two "dummy" json tables, t1.json and t2.json. Each of them has two columns, ID and Value. The ID is an incremental integer(unique) and the Value a random value. I am running an equi-

Task Execution

2015-09-30 Thread gsvic

Concerning task execution, a worker executes its assigned tasks in parallel or sequentially? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Re: RDD: Execution and Scheduling

2015-09-22 Thread gsvic

I already have but I needed some clarifications. Thanks for all your help! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177p14286.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -

Re: RDD: Execution and Scheduling

2015-09-20 Thread gsvic

Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD chooses a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message

Re: RDD: Execution and Scheduling

2015-09-17 Thread gsvic

Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD choose a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message in c

RDD: Execution and Scheduling

2015-09-17 Thread gsvic

After reading some parts of Spark source code I would like to make some questions about RDD execution and scheduling. At first, please correct me if I am wrong at the following: 1) The number of partitions equals to the number of tasks will be executed in parallel (e.g. , when an RDD is repartitio

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic

No, I created the file by appending each JSON record in a loop without changing line. I've just changed that and now it works fine. Thank you very much for your support. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQLContext-read-json-path-throws-j

Re: SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic

ngle line? > > On Wed, Aug 26, 2015 at 6:47 AM, gsvic <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=13852&i=0>> wrote: > >> Hi, >> >> I have the following issue. I am trying to load a 2.5G JSON file from a >> 10-node H

SQLContext.read.json("path") throws java.io.IOException

2015-08-26 Thread gsvic

Hi, I have the following issue. I am trying to load a 2.5G JSON file from a 10-node Hadoop Cluster. Actually, I am trying to create a DataFrame, using sqlContext.read.json("hdfs://master:9000/path/file.json"). The JSON file contains a parsed table(relation) from the TPCH benchmark. After finis

Shuffle Write Size

Hash Partitioning & Sort Merge Join

Re: Are map tasks spilling data to disk?

Map Tasks - Disk Spill (?)

Are map tasks spilling data to disk?

Map Tasks - Disk I/O

Re: Build a specific module only

Build a specific module only

RE: ShuffledHashJoin Possible Issue

ShuffledHashJoin Possible Issue

Task Execution

Re: RDD: Execution and Scheduling

Re: RDD: Execution and Scheduling

Re: RDD: Execution and Scheduling

RDD: Execution and Scheduling

Re: SQLContext.read.json("path") throws java.io.IOException

Re: SQLContext.read.json("path") throws java.io.IOException

SQLContext.read.json("path") throws java.io.IOException

18 matches

Site Navigation

Mail list logo

Footer information