What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Todd
Hi, I have a long computing chain, when I get the last RDD after a series of transformation. I have two choices to do with this last RDD 1. Call checkpoint on RDD to materialize it to disk 2. Call RDD.saveXXX to save it to HDFS, and read it back for further processing I would ask which choice i

Code Example of Structured Streaming of 2.0

2016-05-16 Thread Todd
Hi, Are there code examples about how to use the structured streaming feature? Thanks.

Re:Re: Code Example of Structured Streaming of 2.0

2016-05-17 Thread Todd
Thanks Ted! At 2016-05-17 16:16:09, "Ted Yu" wrote: Please take a look at: [SPARK-13146][SQL] Management API for continuous queries [SPARK-14555] Second cut of Python API for Structured Streaming On Mon, May 16, 2016 at 11:46 PM, Todd wrote: Hi, Are there code examples

Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd
Hi, We have a requirement to do count(distinct) in a processing batch against all the streaming data(eg, last 24 hours' data),that is,when we do count(distinct),we actually want to compute distinct against last 24 hours' data. Does structured streaming support this scenario?Thanks!

Re:Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd
TH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 17 May 2016 at 20:02, Michael Armbrust wrote: In 2.0 you won't be able to do this. The long term vision would be to make

How to use Kafka as data source for Structured Streaming

2016-05-17 Thread Todd
Hi, I am wondering whether structured streaming supports Kafka as data source. I brief the source code(meanly related with DataSourceRegister trait), and didn't find kafka data source things If Thanks.

How to change output mode to Update

2016-05-17 Thread Todd
scala> records.groupBy("name").count().write.trigger(ProcessingTime("30 seconds")).option("checkpointLocation", "file:///home/hadoop/jsoncheckpoint").startStream("file:///home/hadoop/jsonresult") org.apache.spark.sql.AnalysisException: Aggregations are not supported on streaming DataFrames/Datas

Re:Re: How to change output mode to Update

2016-05-17 Thread Todd
At 2016-05-18 12:10:11, "Ted Yu" wrote: Have you tried adding: .mode(SaveMode.Overwrite) On Tue, May 17, 2016 at 8:55 PM, Todd wrote: scala> records.groupBy("name").count().write.trigger(ProcessingTime("30 seconds")).option("checkpointLoca

Re:Re: Re: How to change output mode to Update

2016-05-17 Thread Todd
eaming("mode() can only be called on non-continuous queries") this.mode = saveMode this } On Wed, May 18, 2016 at 12:25 PM, Todd wrote: Thanks Ted. I didn't try, but I think SaveMode and OuputMode are different things. Currently, the spark code contain two output mode, Append an

Does Structured Streaming support Kafka as data source?

2016-05-18 Thread Todd
Hi, I brief the spark code, and it looks that structured streaming doesn't support kafka as data source yet?

Does spark support Apache Arrow

2016-05-19 Thread Todd
From the official site http://arrow.apache.org/, Apache Arrow is used for Columnar In-Memory storage. I have two quick questions: 1. Does spark support Apache Arrow? 2. When dataframe is cached in memory, the data are saved in columnar in-memory style. What is the relationship between this featur

How spark depends on Guava

2016-05-22 Thread Todd
Hi, In the spark code, guava maven dependency scope is provided, my question is, how spark depends on guava during runtime? I looked into the spark-assembly-1.6.1-hadoop2.6.1.jar,and didn't find class entries like com.google.common.base.Preconditions etc...

Re:How spark depends on Guava

2016-05-22 Thread Todd
Can someone please take alook at my question?I am spark-shell local mode and yarn-client mode.Spark code uses guava library,spark should have guava in place during run time. Thanks. At 2016-05-23 11:48:58, "Todd" wrote: Hi, In the spark code, guava maven dependency scope i

Re:Re: How spark depends on Guava

2016-05-23 Thread Todd
ot; wrote: I got curious so I tried sbt dependencyTree. Looks like Guava comes into spark core from a couple places. -Mat matschaffer.com On Mon, May 23, 2016 at 2:32 PM, Todd wrote: Can someone please take alook at my question?I am spark-shell local mode and yarn-client mode.Spark

Re:why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Todd
As far as I know, there would be Akka version conflicting issue when using Akka as spark streaming source. At 2016-05-23 21:19:08, "Chaoqiang" wrote: >I want to know why spark 1.6 use Netty instead of Akka? Is there some >difficult problems which Akka can not solve, but using Netty can s

Re:how to config spark thrift jdbc server high available

2016-05-23 Thread Todd
There is a jira that works on spark thrift server HA, the patch works,but still hasn't merged into the master branch. At 2016-05-23 20:10:26, "qmzhang" <578967...@qq.com> wrote: >Dear guys, please help... > >In hive,we can enable hiveserver2 high available by using dynamic service >discove

How data locality is honored when spark is running on yarn

2016-01-27 Thread Todd
Hi, I am kind of confused about how data locality is honored when spark is running on yarn(client or cluster mode),can someone please elaberate on this? Thanks!

Compile error when compiling spark 2.0.0 snapshot code base in IDEA

2016-01-27 Thread Todd
Hi, I am able to maven install the whole spark project(from github ) in my IDEA. But, when I run the SparkPi example, IDEA compiles the code again and following exeception is thrown, Does someone meet this problem? Thanks a lot. Error:scalac: while compiling: D:\opensourceprojects\sp

Re:Hive on Spark knobs

2016-01-28 Thread Todd
Did you run hive on spark with spark 1.5 and hive 1.1? I think hive on spark doesn't support spark 1.5. There are compatibility issues. At 2016-01-28 01:51:43, "Ruslan Dautkhanov" wrote: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started There are quite a lot

Required file not found: sbt-interface.jar

2015-11-02 Thread Todd
Hi, I am trying to build spark 1.5.1 in my environment, but encounter the following error complaining Required file not found: sbt-interface.jar: The error message is below and I am building with: ./make-distribution.sh --name spark-1.5.1-bin-2.6.0 --tgz --with-tachyon -Phadoop-2.6 -Dhadoop.vers

[Spark R]could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

2015-11-06 Thread Todd
I am launching spark R with following script: ./sparkR --driver-memory 12G and I try to load a local 3G csv file with following code, > a=read.transactions("/home/admin/datamining/data.csv",sep="\t",format="single",cols=c(1,2)) but I encounter an error: could not allocate memory (2048 Mb) in C

How to use --principal and --keytab in SparkSubmit

2015-11-08 Thread Todd
Hi, I am staring spark thrift server with the following script, ./start-thriftserver.sh --master yarn-client --driver-memory 1G --executor-memory 2G --driver-cores 2 --executor-cores 2 --num-executors 4 --hiveconf hive.server2.thrift.port=10001 --hiveconf hive.server2.thrift.bind.host=$(hostname

How 'select name,age from TBL_STUDENT where age = 37' is optimized when caching it

2015-11-16 Thread Todd
Hi, When I cache the dataframe and run the query, val df = sqlContext.sql("select name,age from TBL_STUDENT where age = 37") df.cache() df.show println(df.queryExecution) I got the following execution plan,from the optimized logical plan,I can see the whole analyzed logical

About Databricks's spark-sql-perf

2015-08-13 Thread Todd
Hi, I got a question about the spark-sql-perf project by Databricks at https://github.com/databricks/spark-sql-perf/ The Tables.scala (https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/Tables.scala) and BigData (https://github.com/dat

Re:Re: About Databricks's spark-sql-perf

2015-08-13 Thread Todd
y requires that you have already created the data/tables. I'll work on updating the README as the QA period moves forward. On Thu, Aug 13, 2015 at 6:49 AM, Todd wrote: Hi, I got a question about the spark-sql-perf project by Databricks at https://github.com/databricks/spark-sql-perf/

Materials for deep insight into Spark SQL

2015-08-13 Thread Todd
Hi, I would ask whether there are slides, blogs or videos on the topic about how spark sql is implemented, the process or the whole picture when spark sql executes the code, Thanks!.

Re:Re: Materials for deep insight into Spark SQL

2015-08-14 Thread Todd
bricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit FYI On Thu, Aug 13, 2015 at 8:54 PM, Todd wrote: Hi, I would ask whether there are slides, blogs or videos on the topic about how spark sql is implemented, the process or the whole picture when spark sql executes the code, Thanks!.

Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Todd
Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame. I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes 272

Re:Re: Can't understand the size of raw RDD and its DataFrame

2015-08-15 Thread Todd
otprint of dataframe to be lower when it contains more information ( RDD + Schema) On Sat, Aug 15, 2015 at 6:35 PM, Todd wrote: Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame. I thought that the df cache would take

Understanding the two jobs run with spark sql join

2015-08-16 Thread Todd
Hi,I have a basic spark sql join run in the local mode. I checked the UI,and see that there are two jobs are run. There DAG graph are pasted at the end. I have several questions here: 1. Looks that Job0 and Job1 all have the same DAG Stages, but the stage 3 and stage4 are skipped. I would ask wha

Paper on Spark SQL

2015-08-17 Thread Todd
Hi, I can't access http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf. Could someone help try to see if it is available and reply with it?Thanks!

Re:Re: Regarding rdd.collect()

2015-08-18 Thread Todd
One spark application can have many jobs,eg,first call rdd.count then call rdd.collect At 2015-08-18 15:37:14, "Hemant Bhanawat" wrote: It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job execut

Re:Changed Column order in DataFrame.Columns call and insertIntoJDBC

2015-08-18 Thread Todd
Take a look at the doc for the method: /** * Applies a schema to an RDD of Java Beans. * * WARNING: Since there is no guaranteed ordering for fields in a Java Bean, * SELECT * queries will return the columns in an undefined order. * @group dataframes * @since 1.3

Why there are overlapping for tasks on the EventTimeline UI

2015-08-18 Thread Todd
Hi, Following is copied from the spark EventTimeline UI. I don't understand why there are overlapping between tasks? I think they should be sequentially one by one in one executor(there are one core each executor). The blue part of each task is the scheduler delay time. Does it mean it is the d

Re:Why there are overlapping for tasks on the EventTimeline UI

2015-08-18 Thread Todd
I think I find the answer.. On the UI, the recording time of each task is when it is put into the thread pool. Then the UI makes sense At 2015-08-18 17:40:07, "Todd" wrote: Hi, Following is copied from the spark EventTimeline UI. I don't understand why there are overlappin

Re:How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Todd
There is an option for the spark-submit (Spark standalone or Mesos with cluster deploy mode only) --supervise If given, restarts the driver on failure. At 2015-08-19 14:55:39, "Spark Enthusiast" wrote: Folks, As I see, the Driver program is a single point of failure. N

Does spark sql support column indexing

2015-08-19 Thread Todd
I don't find related talk on whether spark sql supports column indexing. If it does, is there guide how to do it? Thanks.

Re:Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Todd
o relaunch if driver runs as a Hadoop Yarn Application? On Wednesday, 19 August 2015 12:49 PM, Todd wrote: There is an option for the spark-submit (Spark standalone or Mesos with cluster deploy mode only) --supervise If given, restarts the driver on failure. At 201

blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Todd
Hi, I would ask if there are some blogs/articles/videos on how to analyse spark performance during runtime,eg, tools that can be used or something related.

Re:SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread Todd
please try DataFrame.toJSON, it will give you an RDD of JSON string. At 2015-08-21 15:59:43, "smagadi" wrote: >val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND >age <= 19") > >I need teenagers to be a JSON object rather a simple row .How can we get >that done ? >

What does Attribute and AttributeReference mean in Spark SQL

2015-08-24 Thread Todd
There are many such kind of case class or concept such as Attribute/AttributeReference/Expression in Spark SQL I would ask what Attribute/AttributeReference/Expression mean, given a sql query like select a,b from c, it a, b are two Attributes? a + b is an expression? Looks I misunderstand it b

Test case for the spark sql catalyst

2015-08-24 Thread Todd
Hi, Are there test cases for the spark sql catalyst, such as testing the rules of transforming unsolved query plan? Thanks!

Re:RE: Test case for the spark sql catalyst

2015-08-24 Thread Todd
Thanks Chenghao! At 2015-08-25 13:06:40, "Cheng, Hao" wrote: Yes, check the source code under:https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst From: Todd [mailto:bit1...@163.com] Sent: Tuesday, August 25, 2015 1:01

Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Todd
I cloned the code from https://github.com/apache/spark to my machine. It can compile successfully, But when I run the sparkpi, it throws an exception below complaining the scala.collection.Seq is not found. I have installed scala2.10.4 in my machine, and use the default profiles: window,scala2.1

Re:Re: Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Todd
of modules: https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html On Tue, Aug 25, 2015 at 12:18 PM, Todd wrote: I cloned the code from https://github.com/apache/spark to my machine. It can compile successfully, But when I run the sparkpi, it throws an

Re:Re: What does Attribute and AttributeReference mean in Spark SQL

2015-08-25 Thread Todd
oject ['a] LocalRelation [a#0] scala> parsedQuery.analyze res11: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan=Project [a#0] LocalRelation [a#0] The #0 after a is a unique identifier (within this JVM) that says where the data is coming from, even as plans are rearranged due to optimiza

How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
Hi, The spark sql perf itself contains benchmark data generation. I am using spark shell to run the spark sql perf to generate the data with 10G memory for both driver and executor. When I increase the scalefactor to be 30,and run the job, Then I got the following error: When I jstack it to s

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
(Interpreted frame) At 2015-08-25 19:32:56, "Ted Yu" wrote: Looks like you were attaching images to your email which didn't go through. Consider using third party site for images - or paste error in text. Cheers On Tue, Aug 25, 2015 at 4:22 AM, Todd wrote: Hi, The

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
e you able to get more detailed error message ? Thanks On Aug 25, 2015, at 6:57 PM, Todd wrote: Thanks Ted Yu. Following are the error message: 1. The exception that is shown on the UI is : Exception in thread "Thread-113" Exception in thread "Thread-126" Exception in

BlockNotFoundException when running spark word count on Tachyon

2015-08-25 Thread Todd
I am using tachyon in the spark program below,but I encounter a BlockNotFoundxception. Does someone know what's wrong and also is there guide on how to configure spark to work with Tackyon?Thanks! conf.set("spark.externalBlockStore.url", "tachyon://10.18.19.33:19998") conf.set("spark.ex

Re:Re:Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Todd
Sorry for the noise, It's my bad...I have worked it out now. At 2015-08-26 13:20:57, "Todd" wrote: I think the answer is No. I only see such message on the console..and #2 is the thread stack trace。 I am thinking is that in Spark SQL Perf forks many dsdgen process to gene

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Todd
Increase the number of executors, :-) At 2015-08-26 16:57:48, "Ted Yu" wrote: Mind sharing how you fixed the issue ? Cheers On Aug 26, 2015, at 1:50 AM, Todd wrote: Sorry for the noise, It's my bad...I have worked it out now. At 2015-08-26 13:20:57, "Todd&quo

spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
Hi, I am using data generated with sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark sql performance (spark on yarn, with 10 nodes) with the following code (The table store_sales is about 90 million records, 6G in size) val outputDir="hdfs://tmp/spark_perf/scaleFact

Re:Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
Code Generation: false At 2015-09-11 02:02:45, "Michael Armbrust" wrote: I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising. In my experiments Spark 1.5 is either the same or faster than 1.4 with only

Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
the query again? In our previous testing, it’s about 20% slower for sort merge join. I am not sure if there anything else slow down the performance. Hao From: Jesse F Chen [mailto:jfc...@us.ibm.com] Sent: Friday, September 11, 2015 1:18 PM To: Michael Armbrust Cc: Todd; user@spark.

Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
t we found it probably causes the performance reduce dramatically. From: Todd [mailto:bit1...@163.com] Sent: Friday, September 11, 2015 2:17 PM To: Cheng, Hao Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared w

Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Todd
/spark-sql? It’s a new feature in Spark 1.5, and it’s true by default, but we found it probably causes the performance reduce dramatically. From: Todd [mailto:bit1...@163.com] Sent: Friday, September 11, 2015 2:17 PM To: Cheng, Hao Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.o

Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Todd
here is no table to show queries and execution plan information. At 2015-09-11 14:39:06, "Todd" wrote: Thanks Hao. Yes,it is still low as SMJ。Let me try the option your suggested, At 2015-09-11 14:34:46, "Cheng, Hao" wrote: You mean the performance is stil

Re:Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-13 Thread Todd
gt;> >> >> 在2015年09月11日 15:58,Cheng, Hao 写道: >> >> Can you confirm if the query really run in the cluster mode? Not the local >> mode. Can you print the call stack of the executor when the query is running? >> >> >> >> BTW: spark.shuffle.

A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd
After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the LocalPi example,I got the following slf4j related issue. Does anyone know how to fix this? Thanks Error:scalac: bad symbolic reference. A signature in Logging.class refers to type Logger in package org.slf4j which is not av

Re:Re: A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd
ed Yu" wrote: Spark depends on slf4j 1.7.5 Please check your classpath and make sure slf4j is included. Cheers On Wed, Feb 11, 2015 at 6:20 AM, Todd wrote: After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the LocalPi example,I got the following slf4j related i

Re:Re: How can I read this avro file using spark & scala?

2015-02-11 Thread Todd
Databricks provides a sample code on its website...but i can't find it for now. At 2015-02-12 00:43:07, "captainfranz" wrote: >I am confused as to whether avro support was merged into Spark 1.2 or it is >still an independent library. >I see some people writing sqlContext.avroFile similarly

Re:Is Databricks log analysis reference app only based on Java API

2015-02-18 Thread Todd
sorry for the noise. I have found it.. At 2015-02-18 23:34:40, "Todd" wrote: Looks the log anylysis reference app provided by Databricks at https://github.com/databricks/reference-apps only has java API? I'd like to see the Scala version one.

Is Databricks log analysis reference app only based on Java API

2015-02-18 Thread Todd
Looks the log anylysis reference app provided by Databricks at https://github.com/databricks/reference-apps only has java API? I'd like to see the Scala version one.

I think I am almost lost in the internals of Spark

2015-01-06 Thread Todd
I am a bit new to Spark, except that I tried simple things like word count, and the examples given in the spark sql programming guide. Now, I am investigating the internals of Spark, but I think I am almost lost, because I could not grasp a whole picture what spark does when it executes the word

Build spark source code with Maven in Intellij Idea

2015-01-08 Thread Todd
Hi, I have imported the Spark source code in Intellij Idea as a SBT project. I try to do maven install in Intellij Idea by clicking Install in the Spark Project Parent POM(root),but failed. I would ask which profiles should be checked. What I want to achieve is staring Spark in IDE and Hadoop

Re:Re: EventBatch and SparkFlumeProtocol not found in spark codebase?

2015-01-09 Thread Todd
Thanks Sean. I follow the guide, import the codebase into IntellijIdea as Maven project, with the profiles:hadoop2.4 and yarn. In the maven project view, I run Maven Install against the module: Spark Project Parent POM(root).After a pretty long time, all the modules are built successfully. But

Re: Question about Serialization in Storage Level

2015-05-21 Thread Todd Nist
>From the docs, https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence: Storage LevelMeaningMEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're n

spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Todd Nist
utor classpath are duplicated? This is a newly installed spark-1.3.1-bin-hadoop2.6 standalone cluster just to ensure I had nothing from testing in the way. I can set the SPARK_CLASSPATH in the $SPARK_HOME/spark-env.sh and it will pick up the jar and append it fine. Any suggestions on what is going on here? Seems to just ignore whatever I have in the spark.executor.extraClassPath. Is there a different way to do this? TIA. -Todd

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-23 Thread Todd Nist
w. As for the spark-cassandra-connector 1.3.0-SNAPSHOT, I am building that from master. Haven't hit any issue with it yet. -Todd On Fri, May 22, 2015 at 9:39 PM, Yana Kadiyska wrote: > Todd, I don't have any answers for you...other than the file is actually > named spark-de

Re: Spark SQL and Streaming Results

2015-06-05 Thread Todd Nist
There use to be a project, StreamSQL ( https://github.com/thunderain-project/StreamSQL), but it appears a bit dated and I do not see it in the Spark repo, but may have missed it. @TD Is this project still active? I'm not sure what the status is but it may provide some insights on how to achieve w

Re: How to pass arguments dynamically, that needs to be used in executors

2015-06-11 Thread Todd Nist
: Broadcast cmdLineArg = sc.broadcast(Inetger.parseInd(args[12])); Then just reference the broadcast variable in you workers. It will get shipped once to all nodes in the cluster and can be referenced by them. HTH. -Todd On Thu, Jun 11, 2015 at 8:23 AM, gaurav sharma wrote: > Hi, > > I am us

Re: Spark 1.4 release date

2015-06-12 Thread Todd Nist
It was released yesterday. On Friday, June 12, 2015, ayan guha wrote: > Hi > > When is official spark 1.4 release date? > Best > Ayan >

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Todd Nist
ere; -Todd On Mon, Jun 15, 2015 at 10:57 AM, Proust GZ Feng wrote: > Thanks a lot Akhil, after try some suggestions in the tuning guide, there > seems no improvement at all. > > And below is the job detail when running locally(8cores) which took 3min > to complete the job, we can

Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Todd Nist
You can get HDP with at least 1.3.1 from Horton: http://hortonworks.com/hadoop-tutorial/using-apache-spark-technical-preview-with-hdp-2-2/ for your convenience from the dos: wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.2.4.4/hdp.repo -O /etc/yum.repos.d/HDP-TP.repo

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
You should use: spark.executor.memory from the docs <https://spark.apache.org/docs/latest/configuration.html>: spark.executor.memory512mAmount of memory to use per executor process, in the same format as JVM memory strings (e.g.512m, 2g). -Todd On Thu, Jul 2, 2015 at 3:36 PM, Mulugeta

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Todd Nist
limitation at this time. -Todd On Thu, Jul 2, 2015 at 4:13 PM, Mulugeta Mammo wrote: > thanks but my use case requires I specify different start and max heap > sizes. Looks like spark sets start and max sizes same value. > > On Thu, Jul 2, 2015 at 1:08 PM, Todd Nist wrote: > &g

Re: [X-post] Saving SparkSQL result RDD to Cassandra

2015-07-09 Thread Todd Nist
utput stream and therefore materialized. Change it to a map, foreach or some other form of transform. HTH -Todd On Thu, Jul 9, 2015 at 5:24 PM, Su She wrote: > Hello All, > > I also posted this on the Spark/Datastax thread, but thought it was also > 50% a spark question (or mostly

Re: Saving RDD into cassandra keyspace.

2015-07-10 Thread Todd Nist
conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1") HTH -Todd On Fri, Jul 10, 2015 at 5:24 AM, Prateek . wrote: > Hi, > > > > I am beginner to spark , I want save the word and its count to cassandra > key

Re: spark streaming job to hbase write

2015-07-15 Thread Todd Nist
There are there connector packages listed on spark packages web site: http://spark-packages.org/?q=hbase HTH. -Todd On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora wrote: > Hi > > I have a requirement of writing in hbase table from Spark streaming app > after some processing. &g

Re: Use rank with distribute by in HiveContext

2015-07-16 Thread Todd Nist
ing functionsrankrankdense_rankdenseRankpercent_rank percentRankntilentilerow_numberrowNumber HTH. -Todd On Thu, Jul 16, 2015 at 8:10 AM, Lior Chaga wrote: > Does spark HiveContext support the rank() ... distribute by syntax (as in > the following article- > http://www.edwardcapriol

Re: Building a REST Service with Spark back-end

2016-03-02 Thread Todd Nist
-apache-spark/ Not sure if that is of value to you or not. HTH. -Todd On Tue, Mar 1, 2016 at 7:30 PM, Don Drake wrote: > I'm interested in building a REST service that utilizes a Spark SQL > Context to return records from a DataFrame (or IndexedRDD?) and even > add/update records.

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Todd Nist
(KafkaUtils.createDirectStream) or Receiver (KafkaUtils.createStream)? You may find this discussion of value on SO: http://stackoverflow.com/questions/28901123/org-apache-spark-shuffle-metadatafetchfailedexception-missing-an-output-locatio -Todd On Mon, Mar 7, 2016 at 5:52 PM, Vinti Maheshwari wrote

Re: "bootstrapping" DStream state

2016-03-10 Thread Todd Nist
n / interval)) val counts = eventsStream.map(event => { (event.timestamp - event.timestamp % interval, event) }).updateStateByKey[Long](PrintEventCountsByInterval.counter _, new HashPartitioner(3), initialRDD = initialRDD) counts.print() HTH. -Todd On Thu, Mar 10, 2016 at 1:35 AM, Zalzber

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Todd Nist
/granturing/a09aed4a302a7367be92 HTH. -Todd On Sat, Mar 12, 2016 at 6:21 AM, Chris Miller wrote: > I'm pretty new to all of this stuff, so bare with me. > > Zeppelin isn't really intended for realtime dashboards as far as I know. > Its reporting features (tables, grap

Re: Apache Flink

2016-04-17 Thread Todd Nist
e as complex event processing > engine. https://stratio.atlassian.net/wiki/display/DECISION0x9/Home I have not used it, only read about it but it may be of some interest to you. -Todd On Sun, Apr 17, 2016 at 5:49 PM, Peyman Mohajerian wrote: > Microbatching is certainly not a waste of ti

Re: How to change akka.remote.startup-timeout in spark

2016-04-21 Thread Todd Nist
I believe you can adjust it by setting the following: spark.akka.timeout 100s Communication timeout between Spark nodes. HTH. -Todd On Thu, Apr 21, 2016 at 9:49 AM, yuemeng (A) wrote: > When I run a spark application,sometimes I get follow ERROR: > > 16/04/21 09:26:45 ERROR Spa

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Todd Nist
Have you looked at these: http://allegro.tech/2015/08/spark-kafka-integration.html http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/ Full example here: https://github.com/mkuthan/example-spark-kafka HTH. -Todd On Thu, Apr 21, 2016 at 2:08 PM, Alexander Gallego wrote

Re: Spark SQL Transaction

2016-04-23 Thread Todd Nist
issue the commit: if (supportsTransactions) { conn.commit() } HTH -Todd On Sat, Apr 23, 2016 at 8:57 AM, Andrés Ivaldi wrote: > Hello, so I executed Profiler and found that implicit isolation was turn > on by JDBC driver, this is the default behavior of MSSQL JDBC driver, but > it's p

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread Todd Nist
Perhaps these may be of some use: https://github.com/mkuthan/example-spark http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ https://github.com/holdenk/spark-testing-base On Wed, May 18, 2016 at 2:14 PM, swetha kasireddy wrote: > Hi Lars, > > Do you have any examples for the methods

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist
What version of Spark are you using? I do not believe that 1.6.x is compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2 and 0.9.0.x. See this for more information: https://issues.apache.org/jira/browse/SPARK-12177 -Todd On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist
Streaming within its checkpoints by default. You can also manage them yourself if desired. How are you dealing with offsets ? Can you verify the offsets on the broker: kafka-run-class.sh kafka.tools.GetOffsetShell --topic --broker-list --time -1 -Todd On Tue, Jun 7, 2016 at 8:17 AM, Dominik

Re: Load selected rows with sqlContext in the dataframe

2016-07-21 Thread Todd Nist
You can set the dbtable to this: .option("dbtable", "(select * from master_schema where 'TID' = '100_0')") HTH, Todd On Thu, Jul 21, 2016 at 10:59 AM, sujeet jog wrote: > I have a table of size 5GB, and want to load selective rows into datafra

Re: HiveThriftServer2.startWithContext no more showing tables in 1.6.2

2016-07-21 Thread Todd Nist
-cant-find-my-tables-in-spark-sql-using-beeline.html HTH. -Todd On Thu, Jul 21, 2016 at 10:30 AM, Marco Colombo wrote: > Thanks. > > That is just a typo. I'm using on 'spark://10.0.2.15:7077' (standalone). > Same url used in --master in spark-submit > > &g

Spark Job Doesn't End on Mesos

2016-08-09 Thread Todd Leo
-5050-4372-0034' However, the process doesn’t quit after all. This is critical, because I’d like to use SparkLauncher to submit such jobs. If my job doesn’t end, jobs will pile up and fill up the memory. Pls help. :-| — BR, Todd Leo ​

Re: Questions on Kerberos usage with YARN and JDBC

2015-12-11 Thread Todd Simmer
Windows DNS and what it's pointing at. Can you do a kinit *username * on that host? It should tell you if it can find the KDC. Let me know if that's helpful at all. Todd On Fri, Dec 11, 2015 at 1:50 PM, Mike Wright wrote: > As part of our implementation, we are utilizing a ful

Re: Securing objects on the thrift server

2015-12-15 Thread Todd Nist
see https://issues.apache.org/jira/browse/SPARK-11043, it is resolved in 1.6. On Tue, Dec 15, 2015 at 2:28 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > The one coming with spark 1.5.2. > > > > y > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* December-15-15 1:59 PM

Re: looking for a easier way to count the number of items in a JavaDStream

2015-12-16 Thread Todd Nist
collects(), just to obtain the count of records on the DStream. HTH. -Todd On Wed, Dec 16, 2015 at 3:34 PM, Bryan Cutler wrote: > To follow up with your other issue, if you are just trying to count > elements in a DStream, you can do that without an Accumulator. foreachRDD > is meant to

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
Tests HTH. -Todd On Wed, Jan 6, 2016 at 2:20 PM, Jade Liu wrote: > I’ve changed the scala version to 2.10. > > With this command: > build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean > package > Build was successful. > > But make a runnable vers

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
That should read "I think your missing the --name option". Sorry about that. On Wed, Jan 6, 2016 at 3:03 PM, Todd Nist wrote: > Hi Jade, > > I think you "--name" option. The makedistribution should look like this: > > ./make-distribution.sh --name h

  1   2   3   >