from:"java8964"

RE: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread java8964

Same as you, there are lots of people coming from MapReduce world, and try to understand the internals of Spark. Hope below can help you some way. For the end users, they only have concept of Job. I want to run a word count job from this one big file, that is the job I want to run. How many stage

RE: SparkSQL's performance gets degraded depending on number of partitions of Hive tables..is it normal?

2015-06-01 Thread java8964

Take a look about this Jira: https://issues.apache.org/jira/browse/SPARK-6910 Yong > Date: Mon, 1 Jun 2015 12:26:16 -0700 > From: oke...@gmail.com > To: user@spark.apache.org > Subject: SparkSQL's performance gets degraded depending on number of > partitions of Hive tables..is it normal? > > >

RE: Use rank with distribute by in HiveContext

2015-07-16 Thread java8964

Yes. The HIVE UDF and distribute by both supported by Spark SQL. If you are using Spark 1.4, you can try Hive analytics windows function (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most of which are already supported in Spark 1.4, so you don't need the

Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-25 Thread java8964

Hi, Spark Users: I have a problem related to Spark cannot recognize the string type in the Parquet schema generated by Hive. Version of all components: Spark 1.3.1Hive 0.12.0Parquet 1.3.2 I generated a detail low level table in the Parquet format using MapReduce java code. This table can be read

RE: Is this a Spark issue or Hive issue that Spark cannot read the string type data in the Parquet generated by Hive

2015-09-28 Thread java8964

l binary fields :) Cheng On 9/25/15 2:03 PM, java8964 wrote: Hi, Spark Users: I have a problem related to Spark cannot recognize the string type in the Parquet schema generated by Hive. Version of

RE: nested collection object query

2015-09-28 Thread java8964

Your employee in fact is an array of struct, not just struct. If you are using HiveSQLContext, then you can refer it like following: select id from member where employee[0].name = 'employee0' The employee[0] is pointing to the 1st element of the array. If you want to query all the elements in the

RE: Setting executors per worker - Standalone

2015-09-29 Thread java8964

I don't think you can do that in the Standalone mode before 1.5. The best you can do is to have multi workers per box. One worker can and will only start one executor, before Spark 1.5. What you can do is to set "SPARK_WORKER_INSTANCES", which control how many worker instances you can start per b

RE: nested collection object query

2015-09-29 Thread java8964

You have 2 options: Option 1: Use lateral view explode, as you did below. But if you want to remove the duplicate, then use distinct after that. For example: col1, col2, ArrayOf(Struct) After explode: col1, col2, employee0col1, col2, employee1col1, col2, employee0 Then select distinct col1, col2 f

RE: Problem understanding spark word count execution

2015-10-01 Thread java8964

I am not sure about originally explain of shuffle write. In the word count example, the shuffle is needed, as Spark has to group by the word (ReduceBy is more accurate here). Image that you have 2 mappers to read the data, then each mapper will generate the (word, count) tuple output in segment

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964

he 2000 bytes sent to driver is the final output aggregated on the reducers end, and merged back to the driver." , which part of our word count code takes care of this part ? And yes there are only 273 distinct words in the text so that's not a surprise. Thanks again, Hope to get a reply

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964

for every shuffle write , it always writes to disk , what is the meaning of these properties - spark.shuffle.memoryFraction spark.shuffle.spill Thanks,Kartik On Fri, Oct 2, 2015 at 6:22 AM, java8964 wrote: No problem. >From the mapper side, Spark is very similar as the MapReduce;

RE: Building RDD for a Custom MPP Database

2015-10-05 Thread java8964

You want to implement a custom InputFormat for your MPP, which can provide the location preference information to Spark. Yong > Date: Mon, 5 Oct 2015 10:53:27 -0700 > From: vjan...@sankia.com > To: user@spark.apache.org > Subject: Building RDD for a Custom MPP Database > > Hi > I have to build a

How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964

Hi, Sparkers: In this case, I want to use Spark as an ETL engine to load the data from Cassandra, and save it into HDFS. Here is the environment specified information: Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2 I am using the Cassandra Spark Connector 1.3.x, which I have no problem to query the C*

RE: How to handle the UUID in Spark 1.3.1

2015-10-09 Thread java8964

is related:SPARK-10501 On Fri, Oct 9, 2015 at 7:28 AM, java8964 wrote: Hi, Sparkers: In this case, I want to use Spark as an ETL engine to load the data from Cassandra, and save it into HDFS. Here is the environment specified information: Spark 1.3.1Cassandra 2.1HDFS/Hadoop 2.2 I am using the

RE: Spark DataFrame GroupBy into List

2015-10-14 Thread java8964

My guess is the same as UDAF of (collect_set) in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF) Yong From: sliznmail...@gmail.com Date: Wed, 14 Oct 2015 02:45:48 + Subject: Re: Spark DataFrame GroupBy into List To: m

Can I convert RDD[My_OWN_JAVA_CLASS] to DataFrame in Spark 1.3.x?

2015-10-15 Thread java8964

Hi, Sparkers: I wonder if I can convert a RDD of my own Java class into DataFrame in Spark 1.3. Here is what I tried to archive, I want to load the data from Cassandra, and store them into HDFS using either AVRO or Parquet format. I want to test if I can do this in Spark. I am using Spark 1.3.1

RE: Spark SQL running totals

2015-10-15 Thread java8964

Not sure the windows function can work for his case. If you do a "sum() over (partitioned by)", that will return a total sum per partition, instead of a cumulative sum wanted in this case. I saw there is a "cume_dis", but no "cume_sum". Do we really have a "cume_sum" in Spark window function, or a

RE: Spark SQL running totals

2015-10-15 Thread java8964

My mistake. I didn't noticed "UNBOUNDED PRECEDING" already supported. So cumulative sum should work then. Thanks Yong From: java8...@hotmail.com To: mich...@databricks.com; deenar.toras...@gmail.com CC: spanayo...@msn.com; user@spark.apache.org Subject: RE: Spark SQL running totals Date: Thu, 15

RE: Problem with make-distribution.sh

2015-10-26 Thread java8964

Maybe you need the Hive part? Yong Date: Mon, 26 Oct 2015 11:34:30 -0400 Subject: Problem with make-distribution.sh From: yana.kadiy...@gmail.com To: user@spark.apache.org Hi folks, building spark instructions (http://spark.apache.org/docs/latest/building-spark.html) suggest that ./make-distr

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964

Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scanned once. Yong Date: Thu, 29 Oct 2015 10:51:53 -0700 Subject: RDD's filter() or using 'where' conditi

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964

ucket_id=2 .. Am I right? Thanks Anfernee On Thu, Oct 29, 2015 at 11:07 AM, java8964 wrote: Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scan

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread java8964

Any reason that Spark Cassandra connector won't work for you? Yong To: bryan.jeff...@gmail.com; user@spark.apache.org From: bryan.jeff...@gmail.com Subject: RE: Cassandra via SparkSQL/Hive JDBC Date: Tue, 10 Nov 2015 22:42:13 -0500 Anyone have thoughts or a similar use-case for SparkSQL / Cassand

In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964

In my Spark application, I want to access the pass in configuration, but it doesn't work. How should I do that? object myCode extends Logging { // starting point of the application def main(args: Array[String]): Unit = { val sparkContext = new SparkContext() val runtimeEnvironment = sp

RE: In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964

prefix. So try something like --conf spark.runtime.environment=passInValue . RegardsVarun On Thu, Nov 12, 2015 at 9:51 PM, java8964 wrote: In my Spark application, I want to access the pass in configuration, but it doesn't work. How should I do that? object myCode extends Logging { // s

[Spark-Avro] Question related to the Avro data generated by Spark-Avro

2015-11-16 Thread java8964

Hi, I have one question related to Spark-Avro, not sure if here is the best place to ask. I have the following Scala Case class, populated with the data in the Spark application, and I tried to save it as AVRO format in the HDFS case class Claim ( ..) case class Coupon ( account_id: Long .

Spark SQL query AVRO file

2015-08-07 Thread java8964

Hi, Spark users: We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production cluster, which has 42 data/task nodes. There is one dataset stored as Avro files about 3T. Our business has a complex query running for the dataset, which is stored in nest structure with Array of St

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964

...@databricks.com Date: Fri, 7 Aug 2015 11:32:21 -0700 Subject: Re: Spark SQL query AVRO file To: java8...@hotmail.com CC: user@spark.apache.org Have you considered trying Spark SQL's native support for avro data? https://github.com/databricks/spark-avro On Fri, Aug 7, 2015 at 11:30 AM, java8964

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964

it using HiveQL CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path "src/test/resources/episodes.avro") On Fri, Aug 7, 2015 at 11:42 AM, java8964 wrote: Hi, Michael: I am not sure how spark-avro can help in this case. My understanding is that to use Spa

Spark Job Hangs on our production cluster

2015-08-11 Thread java8964

Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 data/task nodes, which runs with BigInsight V3.0.0.2, corresponding with Hadoop 2.2.0 with MR1. Since IBM BigInsight doesn't come with Spark, so we build Spark 1.2.2 with Hadoop 2.2.0 + Hive 0.12 by ourselves, and dep

RE: Spark Job Hangs on our production cluster

2015-08-11 Thread java8964

gured log4j) I think your executors are thrashing or spilling to disk. check memory metrics/swapping On 11 August 2015 at 23:19, java8964 wrote: Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 data/task nodes, which runs with BigInsight V3.0.0.2, corresponding wit

Spark 1.2.2 build problem with Hive 0.12, bringing in wrong version of avro-mapred

2015-08-12 Thread java8964

Hi, This email is sent to both dev and user list, just want to see if someone familiar with Spark/Maven build procedure can provide any help. I am building Spark 1.2.2 with the following command: mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0 The spark-assembly-1.2.2-hadoop2.2.0.jar

RE: Spark Job Hangs on our production cluster

2015-08-14 Thread java8964

I still want to check if anyone can provide any help related to the Spark 1.2.2 will hang on our production cluster when reading Big HDFS data (7800 avro blocks), while looks fine for small data (769 avro blocks). I enable the debug level in the spark log4j, and attached the log file if it helps

Spark Job Hangs on our production cluster

2015-08-17 Thread java8964

I am comparing the log of Spark line by line between the hanging case (big dataset) and not hanging case (small dataset). In the hanging case, the Spark's log looks identical with not hanging case for reading the first block data from the HDFS. But after that, starting from line 438 in the spark

RE: Spark Job Hangs on our production cluster

2015-08-18 Thread java8964

l to serve classes is not responsive. I'd try running outside of the repl and see if that works. sorry not a full diagnosis but maybe this'll help a bit. On Tue, Aug 11, 2015 at 3:19 PM, java8964 wrote: Currently we have a IBM BigInsight cluster with 1 namenode + 1 JobTracker + 42 da

RE: Failed to fetch block error

2015-08-19 Thread java8964

>From the log, it looks like the OS user who is running spark cannot open any >more file. Check your ulimit setting for that user: ulimit -aopen files (-n) 65536 > Date: Tue, 18 Aug 2015 22:06:04 -0700 > From: swethakasire...@gmail.com > To: user@spark.apache.org > Subject: F

Any suggestion about "sendMessageReliably failed because ack was not received within 120 sec"

2015-08-20 Thread java8964

Hi, Sparkers: After first 2 weeks of Spark in our production cluster, with more familiar with Spark, we are more confident to avoid "Lost Executor" due to memory issue. So far, most of our jobs won't fail or slow down due to "Lost executor". But sometimes, I observed that individual tasks failed

RE: Any suggestion about "sendMessageReliably failed because ack was not received within 120 sec"

2015-08-20 Thread java8964

The closed information I can found online related to this error ishttps://issues.apache.org/jira/browse/SPARK-3633 But it is quite different in our case. In our case, we never saw the "(Too many open files)" error, the log just simple show the 120 sec time out. I checked all the GC output from al

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

I believe "spark-shell -i scriptFile" is there. We also use it, at least in Spark 1.3.1. "dse spark" will just wrap "spark-shell" command, underline it is just invoking "spark-shell". I don't know too much about the original problem though. Yong Date: Fri, 21 Aug 2015 18:19:49 +0800 Subject: Re:

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964

What version of Spark you are using, or comes with DSE 4.7? We just cannot reproduce it in Spark. yzhang@localhost>$ more test.sparkval pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) => x + y).collectyzhang@localhost>$ ~/spark/bin/spark-shell --master local -i t

How frequently should full gc we expect

2015-08-21 Thread java8964

In the test job I am running in Spark 1.3.1 in our stage cluster, I can see following information on the application stage information: MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7 min3.4 minGC Time11 s16 s21 s25 s54 s >From the GC output log, I can see it is abou

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964

Did your spark build with Hive? I met the same problem before because the hive-exec jar in the maven itself include "protobuf" class, which will be included in the Spark jar. Yong Date: Tue, 25 Aug 2015 12:39:46 -0700 Subject: Re: Protobuf error when streaming from Kafka From: lcas...@gmail.com T

SparkSQL problem with IBM BigInsight V3

2015-08-25 Thread java8964

Hi, On our production environment, we have a unique problems related to Spark SQL, and I wonder if anyone can give me some idea what is the best way to handle this. Our production Hadoop cluster is IBM BigInsight Version 3, which comes with Hadoop 2.2.0 and Hive 0.12. Right now, we build spark 1

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964

ssue? Do I need to build spark from source code? On Tue, Aug 25, 2015 at 1:06 PM, Cassa L wrote: I downloaded below binary version of spark. spark-1.4.1-bin-cdh4 On Tue, Aug 25, 2015 at 1:03 PM, java8964 wrote: Did your spark build with Hive? I met the same problem before because the hive-e

RE: query avro hive table in spark sql

2015-08-27 Thread java8964

What version of the Hive you are using? And do you compile to the right version of Hive when you compiled Spark? BTY, spark-avro works great for our experience, but still, some non-tech people just want to use as a SQL shell in spark, like HIVE-CLI. Yong From: mich...@databricks.com Date: Wed, 2

RE: query avro hive table in spark sql

2015-08-27 Thread java8964

f this issue might be coz of querying across different schema version of data ? ThanksGiri On Thu, Aug 27, 2015 at 5:39 AM, java8964 wrote: What version of the Hive you are using? And do you compile to the right version of Hive when you compiled Spark? BTY, spark-avro works great for our ex

RE: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread java8964

Or RDD.max() and RDD.min() won't work for you? Yong Subject: Re: Calculating Min and Max Values using Spark Transformations? To: as...@wso2.com CC: user@spark.apache.org From: jfc...@us.ibm.com Date: Fri, 28 Aug 2015 09:28:43 -0700 If you already loaded csv data into a dataframe, why not registe

RE: How to avoid shuffle errors for a large join ?

2015-08-28 Thread java8964

There are several possibilities here. 1) Keep in mind that 7GB data will need way more than 7G heap, as deserialize java object needs much more space than data itself. Grand rule is multiple 6 to 8 times, so 7G data need 50G heap space.2) You should monitor the Spark UI, to check how many record

RE: Best way to merge final output part files created by Spark job

2015-09-14 Thread java8964

For text file, this merge works fine, but for binary format like "ORC", "Parquet" or "AVOR", not sure this will work. These kind of formats in fact are not append-able, as they write the detail data information either in the head or at tail part of the file. You have to use the format specified A

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964

It is a bad idea to use the major version change of protobuf, as it most likely won't work. But you really want to give it a try, set the "user classpath first", so the protobuf 3 coming with your jar will be used. The setting depends on your deployment mode, check this for the parameter: https:/

RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964

package all my custom classes and its dependencies including protobuf 3. The problem is how to configure spark shell to use my uber jar first. java8964 -- appreciate the link and I will try the configuration. Looks promising. However, the "user classpath first" attribute does not apply

RE: application failed on large dataset

2015-09-15 Thread java8964

When you saw this error, does any executor die due to whatever error? Do you check to see if any executor restarts during your job? It is hard to help you just with the stack trace. You need to tell us the whole picture when your jobs are running. Yong From: qhz...@apache.org Date: Tue, 15 Sep 20

RE: application failed on large dataset

2015-09-16 Thread java8964

ch the block, and after several retries, the executor just dies with such error.And for your question, I did not see any executor restart during the job.PS: the operator I am using during that stage if rdd.glom().mapPartitions() java8964 于2015年9月15日周二下午11:44写道： When you saw

RE: application failed on large dataset

2015-09-16 Thread java8964

er: key already cancelled ? sun.nio.ch.selectionkeyi...@3011c7c9java.nio.channels.CancelledKeyException at org.apache.spark.network.nio.ConnectionManager.run(ConnectionManager.scala:461) at org.apache.spark.network.nio.ConnectionManager$$anon$7.run(ConnectionManager.scala:193) java8964 于2015年9月16日周三下午8

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964

Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do you have variable-len

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964

Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do you have variable-len

RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964

Your performance problem sounds like in the driver, which is trying to boardcast 10k files by itself alone, which becomes the bottle neck. What you wants is just transfer the data from AVRO format per file to another format. In MR, most likely each mapper process one file, and you utilized the w

RE: Java Heap Space Error

2015-09-23 Thread java8964

oductname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(productcategory)-2),'\"',''),\",\",' ') inputlist from landing where dt='2015-9' and userid != '' and userid is not null and userid

RE: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread java8964

That is interesting. I don't have any Mesos experience, but just want to know the reason why it does so. Yong > Date: Wed, 23 Sep 2015 15:53:54 -0700 > Subject: Debugging too many files open exception issue in Spark shuffle > From: dbt...@dbtsai.com > To: user@spark.apache.org > > Hi, > > Recen

RE: Java Heap Space Error

2015-09-24 Thread java8964

select userid from landing where dt='2015-9' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid """.stripMargin) @java8964 I tried with sql.shuffle.partitions = 1 but no luck. I

RE: Java Heap Space Error

2015-09-24 Thread java8964

t;,' ') inputlist from landing where dt='${dateUtil.getYear}-${dateUtil.getMonth}' and day >= '${day}' and userid != '' and userid is not null and userid is not NULL and pagetype = 'productDetail' group by userid """.stripMa

My first experience with Spark

2015-02-05 Thread java8964

I am evaluating Spark for our production usage. Our production cluster is Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment running with Hadoop. What I have in mind is to test a very complex Hive query, which joins between 6 tables, lots of nested structure with explo

RE: My first experience with Spark

2015-02-05 Thread java8964

Finally I gave up after there are too many failed retry. >From the log in the worker side, it looks like failed with JVM OOM, as below: 15/02/05 17:02:03 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main]java.lang.OutOfMemoryError: Java heap s

RE: My first experience with Spark

2015-02-05 Thread java8964

pass in a level of parallelism as second parameter to a suitable operation in your code. Deb On Thu, Feb 5, 2015 at 1:03 PM, java8964 wrote: I am evaluating Spark for our production usage. Our production cluster is Hadoop 2.2.0 without Yarn. So I want to test Spark with Standalone deployment

Spark concurrency question

2015-02-08 Thread java8964

Hi, I have some questions about how the spark run the job concurrently. For example, if I setup the Spark on one standalone test box, which has 24 core and 64G memory. I setup the Worker memory to 48G, and Executor memory to 4G, and using spark-shell to run some jobs. Here is something confusing

RE: Spark concurrency question

2015-02-08 Thread java8964

utor is using. I > suppose in theory you could write a function that starts its own > threads too, but that's not generally a good idea or necessary. > > Did you read the docs on the site? > http://spark.apache.org/docs/latest/cluster-overview.html > http://spark.apache.org/

spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964

Hi, I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 fields. I know that the first field from both files are IDs. I want to find all the IDs existed in the first file, but NOT in the 2nd file. I am coming with the following code in spark-shell. case class origAsLeft

RE: spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964

OK. I think I have to use "None" instead null, then it works. Still switching from Java. I can also just use the field name as what I assume. Great experience. From: java8...@hotmail.com To: user@spark.apache.org Subject: spark left outer join with java.lang.UnsupportedOperationException: empty

RangePartitioner in Spark 1.2.1

2015-02-17 Thread java8964

Hi, Sparkers: I just happened to search in google for something related to the RangePartitioner of spark, and found an old thread in this email list as here: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html I followed the code example mentioned in that email thread

Spark performance tuning

2015-02-20 Thread java8964

Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalone box, with 24 cores and 64G memory. We have one SQL in mind to test. Here is the basically setup on this one box for the SQL we are trying to run: 1) Dataset 1, 6.6G AVRO file with snappy compre

RE: Spark performance tuning

2015-02-21 Thread java8964

Can someone share some ideas about how to tune the GC time? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Spark performance tuning Date: Fri, 20 Feb 2015 16:04:23 -0500 Hi, I am new to Spark, and I am trying to test the Spark SQL performance vs Hive. I setup a standalo

Help me understand the partition, parallelism in Spark

2015-02-25 Thread java8964

Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand some internal information of spark. From the web and this list, I keep seeing people talking about increase the parallelism if you get the OOM error. I tried to read document as much as possible to understand the RDD pa

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964

Anyone can share any thoughts related to my questions? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Help me understand the partition, parallelism in Spark Date: Wed, 25 Feb 2015 21:58:55 -0500 Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964

kely to hit an OOM. But there are many other possible sources of OOM, so this is definitely not the *only* solution. Sorry I can't comment in particular about Spark SQL -- hopefully somebody more knowledgeable can comment on that. On Wed, Feb 25, 2015 at 8:58 PM, java8964 wrote: Hi, S

From Spark web ui, how to prove the parquet column pruning working

2015-03-09 Thread java8964

Hi, Currently most of the data in our production is using Avro + Snappy. I want to test the benefits if we store the data in Parquet format. I changed the our ETL to generate the Parquet format, instead of Avor, and want to test a simple sql in Spark SQL, to verify the benefits from Parquet. I g

RE: sc.textFile() on windows cannot access UNC path

2015-03-09 Thread java8964

This is a Java problem, not really Spark. >From this page: >http://stackoverflow.com/questions/18520972/converting-java-file-url-to-file-path-platform-independent-including-u You can see that using Java.nio.* on JDK 7, it will fix this issue. But Path class in Hadoop will use java.io.*, instead o

RE: sc.textFile() on windows cannot access UNC path

2015-03-10 Thread java8964

sc.textFile(…)? Ningjun From: java8964 [mailto:java8...@hotmail.com] Sent: Monday, March 09, 2015 5:33 PM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: RE: sc.textFile() on windows cannot access UNC path This is a Java problem, not really Spark. >From this p

RE: Compilation error

2015-03-10 Thread java8964

Or another option is to use "Scala-IDE", which is built on top of Eclipse, instead of pure Eclipse, so Scala comes with it. Yong > From: so...@cloudera.com > Date: Tue, 10 Mar 2015 18:40:44 + > Subject: Re: Compilation error > To: mohitanch...@gmail.com > CC: t...@databricks.com; user@spark.a

RE: Spark SQL using Hive metastore

2015-03-11 Thread java8964

You need to include the Hadoop native library in your spark-shell/spark-sql, assuming your hadoop native library including native snappy library. spark-sql --driver-library-path point_to_your_hadoop_native_library In spark-sql, you can just use any command as you are in Hive CLI. Yong Date: Wed,

RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964

RangePartitioner? At least for join, you can implement your own partitioner, to utilize the sorted data. Just my 2 cents. Date: Wed, 11 Mar 2015 17:38:04 -0400 Subject: can spark take advantage of ordered data? From: jcove...@gmail.com To: User@spark.apache.org Hello all, I am wondering if spark

Why I didn't see the benefits of using KryoSerializer

2015-03-17 Thread java8964

Hi, I am new to Spark. I tried to understand the memory benefits of using KryoSerializer. I have this one box standalone test environment, which is 24 cores with 24G memory. I installed Hadoop 2.2 plus Spark 1.2.0. I put one text file in the hdfs about 1.2G. Here is the settings in the spark-en

RE: mapPartitions - How Does it Works

2015-03-18 Thread java8964

Here is what I think: mapPartitions is for a specialized map that is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). The combined result iterators are automatically converte

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-19 Thread java8964

I read the Spark code a little bit, trying to understand my own question. It looks like the different is really between org.apache.spark.serializer.JavaSerializer and org.apache.spark.serializer.KryoSerializer, both having the method named writeObject. In my test case, for each line of my text f

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-20 Thread java8964

jects with no circular references. I think that will improve the performance a little, though I dunno how much. It might be worth running your experiments again with slightly more complicated objects and see what you observe. Imran On Thu, Mar 19, 2015 at 12:57 PM, java8964 wrote: I read th

RE: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread java8964

Do you think the ulimit for the user running Spark on your nodes? Can you run "ulimit -a" under the user who is running spark on the executor node? Does the result make sense for the data you are trying to process? Yong From: szheng.c...@gmail.com To: user@spark.apache.org Subject: com.esotericsof

RE: 2 input paths generate 3 partitions

2015-03-27 Thread java8964

The files sound too small to be 2 blocks in HDFS. Did you set the defaultParallelism to be 3 in your spark? Yong Subject: Re: 2 input paths generate 3 partitions From: zzh...@hortonworks.com To: rvern...@gmail.com CC: user@spark.apache.org Date: Fri, 27 Mar 2015 23:15:38 + Hi Rares, T

RE: java.io.FileNotFoundException when using HDFS in cluster mode

2015-03-30 Thread java8964

I think the jar file has to be local. In HDFS is not supported yet in Spark. See this answer: http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs > Date: Sun, 29 Mar 2015 22:34:46 -0700 > From: n.e.trav...@gmail.com > To: user@spark.apache.org > Sub

RE: SparkSql - java.util.NoSuchElementException: key not found: node when access JSON Array

2015-03-31 Thread java8964

You can use the HiveContext instead of SQLContext, which should support all the HiveQL, including lateral view explode. SQLContext is not supporting that yet. BTW, nice coding format in the email. Yong Date: Tue, 31 Mar 2015 18:18:19 -0400 Subject: Re: SparkSql - java.util.NoSuchElementException:

RE: Spark SQL. Memory consumption

2015-04-02 Thread java8964

It is hard to say what could be reason without more detail information. If you provide some more information, maybe people here can help you better. 1) What is your worker's memory setting? It looks like that your nodes have 128G physical memory each, but what do you specify for the worker's heap

RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964

I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a

RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread java8964

Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I cannot reproduce it on Spark 1.2.1 If we check the code change below: Spark 1.3 branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala vs Spark

Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964

I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this:// Select everybody, but increment the age by 1 df.select("name", df("age") + 1).show() //

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964

The import command already run. Forgot the mention, the rest of examples related to "df" all works, just this one caused problem. Thanks Yong Date: Fri, 3 Apr 2015 10:36:45 +0800 From: fightf...@163.com To: java8...@hotmail.com; user@spark.apache.org Subject: Re: Cannot run the example in the Spa

RE: Reading a large file (binary) into RDD

2015-04-03 Thread java8964

rhttp://polyglotprogramming.com On Thu, Apr 2, 2015 at 6:53 PM, java8964 wrote: I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into R

RE: Incremently load big RDD file into Memory

2015-04-07 Thread java8964

cartesian is an expensive operation. If you have 'M' records in location, then locations. cartesian(locations) will generate MxM result.If locations is a big RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong > Date: Tue, 7 Apr 2015 10:04:12 -0700 > From: mas.ha...@gmail.c

RE: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread java8964

It is hard to guess why OOM happens without knowing your application's logic and the data size. Without knowing that, I can only guess based on some common experiences: 1) increase "spark.default.parallelism"2) Increase your executor-memory, maybe 6g is not just enough 3) Your environment is kind

RE: EC2 spark-submit --executor-memory

2015-04-08 Thread java8964

If you are using Spark Standalone deployment, make sure you set the WORKER_MEMROY over 20G, and you do have 20G physical memory. Yong > Date: Tue, 7 Apr 2015 20:58:42 -0700 > From: li...@adobe.com > To: user@spark.apache.org > Subject: EC2 spark-submit --executor-memory > > Dear Spark team, > >

RE: Reading file with Unicode characters

2015-04-08 Thread java8964

Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost only supporting Linux, so UTF-8 is the only encoding supported, as it is the the one on Linux. If you have other encoding data, you may want to vote for this Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232 Yon

RE: SQL can't not create Hive database

2015-04-09 Thread java8964

Can you try the URI of local file format, something like this: hiveContext.hql("SET hive.metastore.warehouse.dir=file:///home/spark/hive/warehouse") Yong > Date: Thu, 9 Apr 2015 04:59:00 -0700 > From: inv...@gmail.com > To: user@spark.apache.org > Subject: SQL can't not create Hive database > > H

RE: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-13 Thread java8964

If it is really due to data skew, will the task hanging has much bigger Shuffle Write Size in this case? In this case, the shuffle write size for that task is 0, and the rest IO of this task is not much larger than the fast finished tasks, is that normal? I am also interested in this case, as fro

RE: Expert advise needed. (POC is at crossroads)

2015-04-30 Thread java8964

Really not expert here, but try the following ideas: 1) I assume you are using yarn, then this blog is very good about the resource tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 2) If 12G is a hard limit in this case, then you have no option but lower yo

1 2 >

1 - 100 of 103 matches

Mail list logo