Re: Lost task - connection closed

2015-02-11 Thread Arush Kharbanda
Hi Can you share the code you are trying to run. Thanks Arush On Wed, Feb 11, 2015 at 9:12 AM, Tianshuo Deng wrote: > I have seen the same problem, It causes some tasks to fail, but not the > whole job to fail. > Hope someone could shed some light on what could be the cause of this. > > On Mon

Re: Writing to HDFS from spark Streaming

2015-02-11 Thread Sean Owen
That kinda dodges the problem by ignoring generic types. But it may be simpler than the 'real' solution, which is a bit ugly. (But first, to double check, are you importing the correct TextOutputFormat? there are two versions. You use .mapred. with the old API and .mapreduce. with the new API.) H

high GC in the Kmeans algorithm

2015-02-11 Thread lihu
Hi, I run the kmeans(MLlib) in a cluster with 12 workers. Every work own a 128G RAM, 24Core. I run 48 task in one machine. the total data is just 40GB. When the dimension of the data set is about 10^7, for every task the duration is about 30s, but the cost for GC is about 20s. When I

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
One additional comment I would make is that you should be careful with Updates in Cassandra, it does support them but large amounts of Updates (i.e changing existing keys) tends to cause fragmentation. If you are (mostly) adding new keys (e.g new records in the the time series) then Cassandra can b

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Christian Betz
Hi Regarding the Cassandra Data model, there's an excellent post on the ebay tech blog: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/. There's also a slideshare for this somewhere. Happy hacking Chris Von: Franc Carter mailto:franc.car...@rozettatech.

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Franc Carter
I forgot to mention that if you do decide to use Cassandra I'd highly recommend jumping on the Cassandra mailing list, if we had taken in come of the advice on that list things would have been considerably smoother cheers On Wed, Feb 11, 2015 at 8:12 PM, Christian Betz < christian.b...@performanc

Re: high GC in the Kmeans algorithm

2015-02-11 Thread Sean Owen
Good, worth double-checking that's what you got. That's barely 1GB per task though. Why run 48 if you have 24 cores? On Wed, Feb 11, 2015 at 9:03 AM, lihu wrote: > I give 50GB to the executor, so it seem that there is no reason the memory > is not enough. > > On Wed, Feb 11, 2015 at 4:50 PM, Se

Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu
I just want to make the best use of CPU, and test the performance of spark if there is a lot of task in a single node. On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen wrote: > Good, worth double-checking that's what you got. That's barely 1GB per > task though. Why run 48 if you have 24 cores? > > O

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data, from JSON documents is totally wrong!

2015-02-11 Thread Costin Leau
Aris, if you encountered a bug, it's best to raise an issue with the es-hadoop/spark project, namely here [1]. When using SparkSQL the underlying data needs to be present - this is mentioned in the docs as well [2]. As for the order, that does look like a bug and shouldn't occur. Note the reas

Re: using spark in web services

2015-02-11 Thread Arush Kharbanda
Hi Are you able to run the code after eliminating all the spark code. To find out if the issue is with Jetty or with Spark itself, it could be also due to conflicting jetty versions in spark and the one you are trying to use. You can check the dependency graph in maven and check if there are any

apply function to all elements of a row matrix

2015-02-11 Thread Donbeo
HI, I have a row matrix x scala> x res3: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@63949747 and I would like to apply a function to each element of this matrix. I was looking for something like: x map (e => exp(-e*e)) How can I d

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Todd Nist
Hi Arush, So yes I want to create the tables through Spark SQL. I have placed the hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was all I should need to do to have the thriftserver use it. Perhaps my hive-site.xml is worng, it currently looks like this: hive.met

Re: How to efficiently utilize all cores?

2015-02-11 Thread Harika
Hi Aplysia, Thanks for the reply. Could you be more specific in terms of what part of the document to look at as I have already seen it and tried a few of the relevant settings for no use. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-efficie

What do you think about the level of resource manager and file system?

2015-02-11 Thread Fangqi (Roy)
[cid:image004.jpg@01D04629.1F451950] [cid:image005.png@01D04629.1F451950] Hi guys~ Comparing these two architectures, why BDAS put Yarn and Mesos under the HDFS, do you have any special consideration? Or just easy to express the AMPLab stack? Best regards!

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Arush Kharbanda
Hi I used this, though its using a embedded driver and is not a good approch.It works. You can configure for some other metastore type also. I have not tried the metastore uri's. javax.jdo.option.ConnectionURL jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true

Question related to Spark SQL

2015-02-11 Thread Ashish Mukherjee
Hi, I am planning to use Spark for a Web-based adhoc reporting tool on massive date-sets on S3. Real-time queries with filters, aggregations and joins could be constructed from UI selections. Online documentation seems to suggest that SharkQL is deprecated and users should move away from it. I u

Re: Question related to Spark SQL

2015-02-11 Thread Arush Kharbanda
I am implementing this approach currently. A 1.Create data tables in spark-sql and cache them. 2. Configure the hive metastore to read the cached tables and share the same metastore as spark-sql (You get the spark caching advantage) 3.Run spark code to fetch form the cached tables. In the spark co

Re: Question related to Spark SQL

2015-02-11 Thread VISHNU SUBRAMANIAN
Hi Ashish, In order to answer your question , I assume that you are planning to process data and cache them in the memory.If you are using a thrift server that comes with Spark then you can query on top of it. And multiple applications can use the cached data as internally all the requests go to t

Spark Streaming: Flume receiver with Kryo serialization

2015-02-11 Thread Antonio Jesus Navarro
Hi, I want to include if possible Kryo serialization in a project and first I'm trying to run FlumeEventCount with Kryo. If I comment setAll method, runs correctly, but if I use Kryo params it returns several errors. 15/02/11 11:42:16 ERROR SparkDeploySchedulerBackend: Asked to remove non-existe

OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
Hello guys, I am trying to run a Ramdom Forest on 30MB of data. I have a cluster of 4 machines. Each machine has 106 MB of RAM and 16 cores. I am getting: 15/02/11 11:01:23 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-3] shutting down Actor

what is behind matrix multiplications?

2015-02-11 Thread Donbeo
In Spark it is possible to multiply a distribuited matrix x and a local matrix w val x = new RowMatrix(distribuited_data) val w: Matrix = Matrices.dense(local_data) val result = x.multiply(w) . What is the process behind this command? Is the matrix w replicated on each worker? Is there a refe

How to log using log4j to local file system inside a Spark application that runs on YARN?

2015-02-11 Thread Emre Sevinc
Hello, I'm building an Apache Spark Streaming application and cannot make it log to a file on the local filesystem *when running it on YARN*. How can achieve this? I've set log4.properties file so that it can successfully write to a log file in /tmp directory on the local file system (shown below

Re: Can't access remote Hive table from spark

2015-02-11 Thread guxiaobo1982
Hi Zhan, My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried to create the /user/xiaobogu directory in hdfs, but both failed with user xiaobogu and root [xiaobogu@lix1 current]$ hadoop dfs -mkdir /user/xiaobogu DEPRECATED: Use of this script to execute hdfs command is depr

Hive/Hbase for low latency

2015-02-11 Thread Siddharth Ubale
Hi , I am new to Spark . We have recently moved from Apache Storm to Apache Spark to build our OLAP tool . Now ,earlier we were using Hbase & Phoenix. We need to re-think what to use in case of Spark. Should we go ahead with Hbase or Hive or Cassandra for query processing with Spark Sql. Please

A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd
After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the LocalPi example,I got the following slf4j related issue. Does anyone know how to fix this? Thanks Error:scalac: bad symbolic reference. A signature in Logging.class refers to type Logger in package org.slf4j which is not av

Re: Hive/Hbase for low latency

2015-02-11 Thread VISHNU SUBRAMANIAN
Hi Siddarth, It depends on what you are trying to solve. But the connectivity for cassandra and spark is good . The answer depends upon what exactly you are trying to solve. Thanks, Vishnu On Wed, Feb 11, 2015 at 7:47 PM, Siddharth Ubale < siddharth.ub...@syncoms.com> wrote: > Hi , > > > > I

Re: Hive/Hbase for low latency

2015-02-11 Thread Ted Yu
Connectivity to hbase is also avaliable. You can take a look at: examples//src/main/python/hbase_inputformat.py examples//src/main/python/hbase_outputformat.py examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala examples//src/main/scala/org/apache/spark/examples/pythonconverters/HBa

Spark ML pipeline

2015-02-11 Thread Jianguo Li
Hi, I really like the pipeline in the spark.ml in Spark1.2 release. Will there be more machine learning algorithms implemented for the pipeline framework in the next major release? Any idea when the next major release comes out? Thanks, Jianguo

Re: A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Ted Yu
Spark depends on slf4j 1.7.5 Please check your classpath and make sure slf4j is included. Cheers On Wed, Feb 11, 2015 at 6:20 AM, Todd wrote: > After compiling the Spark 1.2.0 codebase in Intellj Idea, and run the > LocalPi example,I got the following slf4j related issue. Does anyone know > h

Re:Re: A signature in Logging.class refers to type Logger in package org.slf4j which is not available.

2015-02-11 Thread Todd
Thanks for the reply. I have the following Maven dependencies which looks correct to me? Maven: org.slf4j:slf4j-log4j12:1.7.5 Maven: org.slf4j:jcl-over-slf4j:1.7.5 Maven: org.slf4j:jul-to-slf4j:1.7.5 Maven: org.slf4j:slf4j-api:1.7.5 Maven: log4j:log4j:1.2.17 At 2015-02-11 23:27:54, "Ted Yu"

Re: Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-11 Thread Peng Cheng
You are right. I've checked the overall stage metrics and looks like the largest shuffling write is over 9G. The partition completed successfully but its spilled file can't be removed until all others are finished. It's very likely caused by a stupid mistake in my design. A lookup table grows const

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Yin Huai
Regarding backticks: Right. You need backticks to quote the column name timestamp because timestamp is a reserved keyword in our parser. On Tue, Feb 10, 2015 at 3:02 PM, Mohnish Kodnani wrote: > actually i tried in spark shell , got same error and then for some reason > i tried to back tick the

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-11 Thread Andrew Or
Hi Jianshi, For YARN, there may be an issue with how a recently patch changes the accessibility of the shuffle files by the external shuffle service: https://issues.apache.org/jira/browse/SPARK-5655. It is likely that you will hit this with 1.2.1, actually. For this reason I would have to recommen

Re: How can I read this avro file using spark & scala?

2015-02-11 Thread captainfranz
I am confused as to whether avro support was merged into Spark 1.2 or it is still an independent library. I see some people writing sqlContext.avroFile similarly to jsonFile but this does not work for me, nor do I see this in the Scala docs. -- View this message in context: http://apache-spark-

Re:Re: How can I read this avro file using spark & scala?

2015-02-11 Thread Todd
Databricks provides a sample code on its website...but i can't find it for now. At 2015-02-12 00:43:07, "captainfranz" wrote: >I am confused as to whether avro support was merged into Spark 1.2 or it is >still an independent library. >I see some people writing sqlContext.avroFile similarly

Re: Re: How can I read this avro file using spark & scala?

2015-02-11 Thread VISHNU SUBRAMANIAN
Check this link. https://github.com/databricks/spark-avro Home page for Spark-avro project. Thanks, Vishnu On Wed, Feb 11, 2015 at 10:19 PM, Todd wrote: > Databricks provides a sample code on its website...but i can't find it for > now. > > > > > > > At 2015-02-12 00:43:07, "captainfranz" wro

SPARK_LOCAL_DIRS Issue

2015-02-11 Thread TJ Klein
Hi, Using Spark 1.2 I ran into issued setting SPARK_LOCAL_DIRS to a different path then local directory. On our cluster we have a folder for temporary files (in a central file system), which is called /scratch. When setting SPARK_LOCAL_DIRS=/scratch/ I get: An error occurred while calling z:o

Re: Question related to Spark SQL

2015-02-11 Thread VISHNU SUBRAMANIAN
I dint mean that. When you try the above approach only one client will have access to the cached data. But when you expose your data through a thrift server the case is quite different. In the case of thrift server all the request goes to the thrift server and spark will be able to take the advan

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
A central location, such as NFS? If they are temporary for the purpose of further job processing you'll want to keep them local to the node in the cluster, i.e., in /tmp. If they are centralized you won't be able to take advantage of data locality and the central file store will become a bottlenec

Re: Hive/Hbase for low latency

2015-02-11 Thread Ravi Kiran
Hi Siddharth, With v 4.3 of Phoenix, you can use the PhoenixInputFormat and OutputFormat classes to pull/push to Phoenix from Spark. HTH Thanks Ravi On Wed, Feb 11, 2015 at 6:59 AM, Ted Yu wrote: > Connectivity to hbase is also avaliable. You can take a look at: > > examples//src/main/p

getting the cluster elements from kmeans run

2015-02-11 Thread Harini Srinivasan
Hi, Is there a way to get the elements of each cluster after running kmeans clustering? I am using the Java version. thanks

Re: getting the cluster elements from kmeans run

2015-02-11 Thread VISHNU SUBRAMANIAN
You can use model.predict(point) that will help you identify the cluster center and map it to the point. rdd.map(x => (x,model.predict(x))) Thanks, Vishnu On Wed, Feb 11, 2015 at 11:06 PM, Harini Srinivasan wrote: > Hi, > > Is there a way to get the elements of each cluster after running kmean

Re: getting the cluster elements from kmeans run

2015-02-11 Thread Suneel Marthi
KMeansModel only returns the "cluster centroids". To get the # of elements in each cluster, try calling kmeans.predict() on each of the points in the data used to build the model. See https://github.com/OryxProject/oryx/blob/master/oryx-app-mllib/src/main/java/com/cloudera/oryx/app/mllib/kmeans/K

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
cat ../hadoop/spark-install/conf/spark-env.sh export SCALA_HOME=/home/hadoop/scala-install export SPARK_WORKER_MEMORY=83971m export SPARK_MASTER_IP=spark-m export SPARK_DAEMON_MEMORY=15744m export SPARK_WORKER_DIR=/hadoop/spark/work export SPARK_LOCAL_DIRS=/hadoop/spark/tmp export SPARK_LOG_

Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
cat ../hadoop/spark-install/conf/spark-env.sh export SCALA_HOME=/home/hadoop/scala-install export SPARK_WORKER_MEMORY=83971m export SPARK_MASTER_IP=spark-m export SPARK_DAEMON_MEMORY=15744m export SPARK_WORKER_DIR=/hadoop/spark/work export SPARK_LOCAL_DIRS=/hadoop/spark/tmp export SPARK_LOG_DIR=/ha

Re: Can't access remote Hive table from spark

2015-02-11 Thread Zhan Zhang
You need to have right hdfs account, e.g., hdfs, to create directory and assign permission. Thanks. Zhan Zhang On Feb 11, 2015, at 4:34 AM, guxiaobo1982 mailto:guxiaobo1...@qq.com>> wrote: Hi Zhan, My Single Node Cluster of Hadoop is installed by Ambari 1.7.0, I tried to create the /user/xia

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Su She
Thank you Felix and Kelvin. I think I'll def be using the k-means tools in mlib. It seems the best way to stream data is by storing in hbase and then using an api in my viz to extract data? Does anyone have any thoughts on this? Thanks! On Tue, Feb 10, 2015 at 11:45 PM, Felix C wrote: > Chec

iteratively modifying an RDD

2015-02-11 Thread rok
I was having trouble with memory exceptions when broadcasting a large lookup table, so I've resorted to processing it iteratively -- but how can I modify an RDD iteratively? I'm trying something like : rdd = sc.parallelize(...) lookup_tables = {...} for lookup_table in lookup_tables : rdd

Re: Need a partner

2015-02-11 Thread Nagesh sarvepalli
Hello, Hope below link helps you to kick-start. It has videos and hand-outs for practice. http://spark-summit.org/2014 Regards Nagesh On Wed, Feb 11, 2015 at 5:56 AM, prabeesh k wrote: > Also you can refer this course in edx: Introduction to Big Data with > Apache Spark >

Re: How to log using log4j to local file system inside a Spark application that runs on YARN?

2015-02-11 Thread Marcelo Vanzin
For Yarn, you need to upload your log4j.properties separately from your app's jar, because of some internal issues that are too boring to explain here. :-) Basically: spark-submit --master yarn --files log4j.properties blah blah blah Having to keep it outside your app jar is sub-optimal, and I

Re: iteratively modifying an RDD

2015-02-11 Thread Charles Feduke
If you use mapPartitions to iterate the lookup_tables does that improve the performance? This link is to Spark docs 1.1 because both latest and 1.2 for Python give me a 404: http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions On Wed Feb 11 2015 at 1:48:42 PM rok

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-11 Thread Davies Liu
Could you share a short script to reproduce this problem? On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar wrote: > I didn't notice other errors -- I also thought such a large broadcast is a > bad idea but I tried something similar with a much smaller dictionary and > encountered the same problem. I'm

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Yes I actually do use mapPartitions already On Feb 11, 2015 7:55 PM, "Charles Feduke" wrote: > If you use mapPartitions to iterate the lookup_tables does that improve > the performance? > > This link is to Spark docs 1.1 because both latest and 1.2 for Python give > me a 404: > http://spark.apach

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
We have moved to use Sphinx to generate the Python API docs, so the link is different than 1.0/1 http://spark.apache.org/docs/1.2.0/api/python/pyspark.html#pyspark.RDD.mapPartitions On Wed, Feb 11, 2015 at 10:55 AM, Charles Feduke wrote: > If you use mapPartitions to iterate the lookup_tables do

Re: Open file limit settings for Spark on Yarn job

2015-02-11 Thread Arun Luthra
I'm using Spark 1.1.0 with sort-based shuffle. I found that I can work around the issue by applying repartition(N) with a small enough N after creating the RDD, though I'm losing some speed/parallelism by doing this. For my algorithm I need to stay with groupByKey. On Tue, Feb 10, 2015 at 11:41 P

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-11 Thread Rok Roskar
I think the problem was related to the broadcasts being too large -- I've now split it up into many smaller operations but it's still not quite there -- see http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html Thanks, Rok On Wed, Feb 11, 2015, 19:59 Davie

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Tassilo Klein
Thanks for the info. The file system in use is a Lustre file system. Best, Tassilo On Wed, Feb 11, 2015 at 12:15 PM, Charles Feduke wrote: > A central location, such as NFS? > > If they are temporary for the purpose of further job processing you'll > want to keep them local to the node in the

Re: spark sql registerFunction with 1.2.1

2015-02-11 Thread Mohnish Kodnani
that explains a lot... Is there a list of reserved keywords ? On Wed, Feb 11, 2015 at 7:56 AM, Yin Huai wrote: > Regarding backticks: Right. You need backticks to quote the column name > timestamp because timestamp is a reserved keyword in our parser. > > On Tue, Feb 10, 2015 at 3:02 PM, Mohnis

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: > I was having trouble with memory exceptions when broadcasting a large lookup > table, so I've resorted to processing it iteratively -- but how can I modify > an RDD iteratively? > > I'm trying something like : > > rdd = sc.parallelize(...) > lookup_ta

exception with json4s render

2015-02-11 Thread Jonathan Haddad
I'm trying to use the json4s library in a spark job to push data back into kafka. Everything was working fine when I was hard coding a string, but now that I'm trying to render a string from a simple map it's failing. The code works in sbt console. working console code: https://gist.github.com/r

Re: what is behind matrix multiplications?

2015-02-11 Thread Reza Zadeh
Yes, the local matrix is broadcast to each worker. Here is the code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L407 In 1.3 we will have Block matrix multiplication too, which will allow distributed matrix multiplicati

Re: exception with json4s render

2015-02-11 Thread Mohnish Kodnani
I was getting similar error after I upgraded to spark 1.2.1 from 1.1.1 Are you by any chance using json4s 3.2.11. I downgraded to 3.2.10 and that seemed to have worked. But I didnt try to spend much time debugging the issue than that. On Wed, Feb 11, 2015 at 11:13 AM, Jonathan Haddad wrote: >

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
Take a look at this: http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf (linked from that article) to get a better idea of what your options are. If its possible to avoid writing to [any] disk I'd recommend that rout

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Tassilo Klein
Thanks a lot. I will have a look at it. On Wed, Feb 11, 2015 at 2:20 PM, Charles Feduke wrote: > Take a look at this: > > http://wiki.lustre.org/index.php/Running_Hadoop_with_Lustre > > Particularly: http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf > (linked from that article) > > to get

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Charles Feduke
And just glancing at the Spark source code around where the stack trace originates: val lockFile = new File(localDir, lockFileName) val raf = new RandomAccessFile(lockFile, "rw") // Only one executor entry. // The FileLock is only used to control synchronization for executors dow

Re: exception with json4s render

2015-02-11 Thread Jonathan Haddad
Actually, yes, I was using 3.2.11. I thought I would need the UUID encoder that seems to have been added in that version, but I'm not using it. I've downgraded to 3.2.10 and it seems to work. I searched through the spark repo and it looks like it's got 3.2.10 in a pom. I don't know the first th

Re: SPARK_LOCAL_DIRS Issue

2015-02-11 Thread Tassilo Klein
Thanks. Yes, I think it might not always make sense to lock files, particularly if every executor is getting its own path. On Wed, Feb 11, 2015 at 2:31 PM, Charles Feduke wrote: > And just glancing at the Spark source code around where the stack trace > originates: > > val lockFile = new File(lo

Re: exception with json4s render

2015-02-11 Thread Mohnish Kodnani
Same here.. I am a newbie to all this as well. But this is just what I found and I lack the expertise to figure out why things dont work in 3.2.11 json4s. May be some one in the group with more expertise can take a crack at it. But this is what unblocked me from moving forward. On Wed, Feb 11, 20

Re: exception with json4s render

2015-02-11 Thread Charles Feduke
I was having a similar problem to this trying to use the Scala Jackson module yesterday. I tried setting `spark.files.userClassPathFirst` to true but I was still having problems due to the older version of Jackson that Spark has a dependency on. (I think its an old org.codehaus version.) I ended u

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
I'm using mysql as the metastore DB with Spark 1.2.I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to spark-env.sh in /etc/spark/, everything works fine now. My setup looks like this. Tableau => Spark ThriftServer2 => HiveServer2 It's talking to Tableau Desktop 8.3. In

Re: Mesos coarse mode not working (fine grained does)

2015-02-11 Thread Hans van den Bogert
Bumping 1on1 conversation to mailinglist: On 10 Feb 2015, at 13:24, Hans van den Bogert wrote: > > It’s self built, I can’t otherwise as I can’t install packages on the cluster > here. > > The problem seems with libtool. When compiling Mesos on a host with apr-devel > and apr-util-devel the

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
I have ThriftServer2 up and running, however, I notice that it relays the query to HiveServer2 when I pass the hive-site.xml to it. I'm not sure if this is the expected behavior, but based on what I have up and running, the ThriftServer2 invokes HiveServer2 that results in MapReduce or Tez query

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Aha great! Thanks for the clarification! On Feb 11, 2015 8:11 PM, "Davies Liu" wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: > > I was having trouble with memory exceptions when broadcasting a large > lookup > > table, so I've resorted to processing it iteratively -- but how can I > modi

RE: Is the Thrift server right for me?

2015-02-11 Thread Judy Nash
It should relay the queries to spark (i.e. you shouldn't see any MR job on Hadoop & you should see activities on the spark app on headnode UI). Check your hive-site.xml. Are you directing to the hive server 2 port instead of spark thrift port? Their default ports are both 1. From: Andrew Le

RE: Is the Thrift server right for me?

2015-02-11 Thread Andrew Lee
Thanks Judy. You are right. The query is going to Spark ThriftServer2. I have it setup on a different port number. I got the wrong perception b/c there were other jobs running at the same time. It should be Spark jobs instead of Hive jobs. From: judyn...@exchange.microsoft.com To: alee...@hotmail

RE: SparkSQL + Tableau Connector

2015-02-11 Thread Andrew Lee
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the logs since there were other activities going on on the cluster. From: alee...@hotmail.com To: ar...@sigmoidanalytics.com; tsind...@gmail.com CC: user@spark.apache.org Subject: RE: SparkSQL + Tableau Connector Date: Wed,

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2015-02-11 Thread Lan
Hi Alexey and Daniel, I'm using Spark 1.2.0 and still having the same error, as described below. Do you have any news on this? Really appreciate your responses!!! "a Spark cluster of 1 master VM SparkV1 and 1 worker VM SparkV4 (the error is the same if I have 2 workers). They are connected witho

No executors allocated on yarn with latest master branch

2015-02-11 Thread Anders Arpteg
Hi, Compiled the latest master of Spark yesterday (2015-02-10) for Hadoop 2.2 and failed executing jobs in yarn-cluster mode for that build. Works successfully with spark 1.2 (and also master from 2015-01-16), so something has changed since then that prevents the job from receiving any executors o

Re: Bug in ElasticSearch and Spark SQL: Using SQL to query out data from JSON documents is totally wrong!

2015-02-11 Thread Aris
Thank you Costin. I wrote out to the user list, I got no replies there. I will take this exact message and put it on the Github bug tracking system. One quick clarification: I read the elasticsearch documentation thoroughly, and I saw the warning about structured data vs. unstructured data, but i

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like: http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.14

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like: http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.14

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
the runtime for each consecutive iteration is still roughly twice as long as for the previous one -- is there a way to reduce whatever overhead is accumulating? On Feb 11, 2015, at 8:11 PM, Davies Liu wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: >> I was having trouble with memory e

Re: Spark ML pipeline

2015-02-11 Thread Reynold Xin
Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar. On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li wrote: > Hi, > > I really like the pipeline in the spark.ml in Spark1.2 release. Will > there be more machine learning algorithms implemented for the pipeline > framework in the next m

Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
I think the word "partition" here is a tad different than the term "partition" that we use in Spark. Basically, I want something similar to Guava's Iterables.partition [1], that is, If I have an RDD[People] and I want to run an algorithm that can be optimized by working on 30 people at a time, I'd

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
rdd.mapPartitions { iter => val grouped = iter.grouped(batchSize) for (group <- grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet wrote: > I think the word "partition" here is a tad different than the term > "partition" that we use in Spark. Basically, I want something similar

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like: http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.14

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra wrote: > rdd.mapPartitions { iter => > val grouped = iter.grouped(batchSize) > for (group <- grouped) { ... } > } > > On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet wrote: > >> I think the word

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Mark Hamstra
No, only each group should need to fit. On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet wrote: > Doesn't iter still need to fit entirely into memory? > > On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra > wrote: > >> rdd.mapPartitions { iter => >> val grouped = iter.grouped(batchSize) >> for (gro

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-Spar

Re: iteratively modifying an RDD

2015-02-11 Thread Davies Liu
On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar wrote: > the runtime for each consecutive iteration is still roughly twice as long as > for the previous one -- is there a way to reduce whatever overhead is > accumulating? Sorry, I didn't fully understand you question, which two are you comparing?

Containers on EC2 instances go over their memory limits

2015-02-11 Thread Sina Samangooei
Hello, I have many questions about joins, but arguably just one. specifically about memory and containers that are overstepping their limits, as per errors dotted around all over the place, but something like: http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.14

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Ted Yu
See earlier thread: http://search-hadoop.com/m/JW1q5BZhf92 On Wed, Feb 11, 2015 at 3:04 PM, Dima Zhiyanov wrote: > Hello > > Has Spark implemented computing statistics for Parquet files? Or is there > any other way I can enable broadcast joins between parquet file RDDs in > Spark Sql? > > Thanks

Re: Similar code in Java

2015-02-11 Thread Eduardo Costa Alfaia
Thanks Ted. > On Feb 10, 2015, at 20:06, Ted Yu wrote: > > Please take a look at: > examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaDirectKafkaWordCount.java > which was checked in yesterday. > > On Sat, Feb 7, 2015 at 10:53 AM, Eduardo Costa Alfaia

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-S

SPARK_LOCAL_DIRS and SPARK_WORKER_DIR

2015-02-11 Thread gtinside
Hi , What is the difference between SPARK_LOCAL_DIRS and SPARK_WORKER_DIR ? Also does spark clean these up after the execution ? Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-LOCAL-DIRS-and-SPARK-WORKER-DIR-tp21612.html Sent from t

Re: No executors allocated on yarn with latest master branch

2015-02-11 Thread Sandy Ryza
Hi Anders, I just tried this out and was able to successfully acquire executors. Any strange log messages or additional color you can provide on your setup? Does yarn-client mode work? -Sandy On Wed, Feb 11, 2015 at 1:28 PM, Anders Arpteg wrote: > Hi, > > Compiled the latest master of Spark y

Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Hi spark. is there anything in the works for a typesafe HQL like API for building spark queries from case classes ? i.e. where, given a domain object "Product" with a "cost" associated with it , we can do something like: query.select(Product).filter({ _.cost > 50.00f }).join(ProductMetaData).by

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Todd Nist
First sorry for the long post. So back to tableau and Spark SQL, I'm still missing something. TL;DR To get the Spark SQL Temp table associated with the metastore are there additional steps required beyond doing the below? Initial SQL on connection: create temporary table test using org.apache.

Spark based ETL pipelines

2015-02-11 Thread Jagat Singh
Hi, I want to work on some use case something like below. Just want to know if something similar has been already done which can be reused. Idea is to use Spark for ETL / Data Science / Streaming pipeline. So when data comes inside the cluster front door we will do following steps 1) Upload

  1   2   >