Re: Read a HDFS file from Spark source code

2014-11-11 Thread rapelly kartheek
Hi Sean, I was following this link; http://mund-consulting.com/Blog/Posts/file-operations-in-HDFS-using-java.aspx But, I was facing FileSystem ambiguity error. I really don't have any idea as to how to go about doing this. Can you please help me how to start off with this? On Wed, Nov 12, 2014

回复: How did the RDD.union work

2014-11-11 Thread qiaou
thanks for you reply and patience Best regards -- qiaou 已使用 Sparrow (http://www.sparrowmailapp.com/?sig) 在 2014年11月12日 星期三,下午3:45,Shixiong Zhu 写道: > The `conf` object will be sent to other nodes via Broadcast. > > Here is the scaladoc of Broadcast: > http://spark.apache.org/docs/lates

Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
The `conf` object will be sent to other nodes via Broadcast. Here is the scaladoc of Broadcast: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.broadcast.Broadcast In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes ge

Task time measurement

2014-11-11 Thread Romi Kuntsman
Hello, Currently in Spark standalone console, I can only see how long the entire job took. How can I know how long it was in WAITING and how long in RUNNING, and also when running, how much each of the jobs inside took? Thanks, *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com

Re: Spark and Play

2014-11-11 Thread John Meehan
You can also build a Play 2.2.x + Spark 1.1.0 fat jar with sbt-assembly for, e.g. yarn-client support or using with spark-shell for debugging: play.Project.playScalaSettings libraryDependencies ~= { _ map { case m if m.organization == "com.typesafe.play" => m.exclude("commons-logging", "com

About Join operator in PySpark

2014-11-11 Thread 夏俊鸾
Hi all I have noticed that “Join” operator has been transferred to union and groupByKey operator instead of cogroup operator in PySpark, this change will probably generate more shuffle stage, for example rdd1 = sc.makeRDD(...).partitionBy(2) rdd2 = sc.makeRDD(...).partitionBy(2) r

Re: ISpark class not found

2014-11-11 Thread MEETHU MATHEW
Hi, I was also trying Ispark..But I couldnt even start the notebook..I am getting the following error. ERROR:tornado.access:500 POST /api/sessions (127.0.0.1) 10.15ms referer=http://localhost:/notebooks/Scala/Untitled0.ipynb How did you start the notebook?  Thanks & Regards, Meethu M O

回复: How did the RDD.union work

2014-11-11 Thread qiaou
this work! but can you explain why should use like this? -- qiaou 已使用 Sparrow (http://www.sparrowmailapp.com/?sig) 在 2014年11月12日 星期三,下午3:18,Shixiong Zhu 写道: > You need to create a new configuration for each RDD. Therefore, "val > hbaseConf = HBaseConfigUtil.getHBaseConfiguration" should be

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu: Thank you for your replay. I will set up an experimental environment for spark-1.1 and test it. On Wed, Nov 12, 2014 at 2:30 PM, Davies Liu-2 [via Apache Spark User List] < ml-node+s1001560n1868...@n3.nabble.com> wrote: > Yes, your broadcast should be about 300M, much smaller than 2G,

Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
You need to create a new configuration for each RDD. Therefore, "val hbaseConf = HBaseConfigUtil.getHBaseConfiguration" should be changed to "val hbaseConf = new Configuration(HBaseConfigUtil.getHBaseConfiguration)" Best Regards, Shixiong Zhu 2014-11-12 14:53 GMT+08:00 qiaou : > ok here is the

Re: groupBy for DStream

2014-11-11 Thread Akhil Das
1. Use foreachRDD over the dstream and on the each rdd you can call the groupBy() 2. DStream.count() Return a new DStream in which each RDD has a single element generated by counting each RDD of this DStream. Thanks Best Regards On Wed, Nov 12, 2014 at 2:49 AM, SK wrote: > > Hi. > > 1) I dont

spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-11-11 Thread tridib
Hi Friends, I am trying to save a json file to parquet. I got error "Unsupported datatype TimestampType". Is not parquet support date? Which parquet version does spark uses? Is there any work around? Here the stacktrace: java.lang.RuntimeException: Unsupported datatype TimestampType at

回复: How did the RDD.union work

2014-11-11 Thread qiaou
ok here is the code def hbaseQuery:(String)=>RDD[Result] = { val generateRdd = (area:String)=>{ val startRowKey = s"$area${RowKeyUtils.convertToHex(startId, 10)}" val stopRowKey = s"$area${RowKeyUtils.convertToHex(endId, 10)}" println

Re: Imbalanced shuffle read

2014-11-11 Thread Akhil Das
When you calls the groupByKey() try providing the number of partitions like groupByKey(100) depending on your data/cluster size. Thanks Best Regards On Wed, Nov 12, 2014 at 6:45 AM, ankits wrote: > Im running a job that uses groupByKey(), so it generates a lot of shuffle > data. Then it process

Re: How did the RDD.union work

2014-11-11 Thread Shixiong Zhu
Could you provide the code of hbaseQuery? It maybe doesn't support to execute in parallel. Best Regards, Shixiong Zhu 2014-11-12 14:32 GMT+08:00 qiaou : > Hi: > I got a problem with using the union method of RDD > things like this > I get a function like > def hbaseQuery(area:string):RD

Re: spark-shell exception while running in YARN mode

2014-11-11 Thread hmxxyy
The Pi example gives same error in yarn mode HADOOP_CONF_DIR=/home/gs/conf/current ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client ../examples/target/spark-examples_2.10-1.2.0-SNAPSHOT.jar What could be wrong here? -- View this message in context: http://apache

Re: MLLIB usage: BLAS dependency warning

2014-11-11 Thread Xiangrui Meng
Could you try "jar tf" on the assembly jar and grep "netlib-native_system-linux-x86_64.so"? -Xiangrui On Tue, Nov 11, 2014 at 7:11 PM, jpl wrote: > Hi, > I am having trouble using the BLAS libs with the MLLib functions. I am > using org.apache.spark.mllib.clustering.KMeans (on a single machine)

How did the RDD.union work

2014-11-11 Thread qiaou
Hi: I got a problem with using the union method of RDD things like this I get a function like def hbaseQuery(area:string):RDD[Result]= ??? when i use hbaseQuery('aa').union(hbaseQuery(‘bb’)).count() it returns 0 however when use like this sc.parallize(hbaseQuery('aa’).collect.toL

How did the RDD.union work

2014-11-11 Thread qiaou
Hi: I got a problem with using the union method of RDD things like this I get a function like def hbaseQuery(area:string):RDD[Result]= ??? when i use hbaseQuery('aa').union(hbaseQuery(‘bb’)).count() it returns 0 however when use like this sc.parallize(hbaseQuery('aa’).collect.toL

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
Yes, your broadcast should be about 300M, much smaller than 2G, I didn't read your post carefully. The broadcast in Python had been improved much since 1.1, I think it will work in 1.1 or upcoming 1.2 release, could you upgrade to 1.1? Davies On Tue, Nov 11, 2014 at 8:37 PM, bliuab wrote: > Dea

Is there a way to clone a JavaRDD without persisting it

2014-11-11 Thread Steve Lewis
In my problem I have a number of intermediate JavaRDDs and would like to be able to look at their sizes without destroying the RDD for sibsequent processing. persist will do this but these are big and perisist seems expensive and I am unsure of which StorageLevel is needed, Is there a way to clone

Re: Read a HDFS file from Spark source code

2014-11-11 Thread Samarth Mailinglist
Instead of a file path, use a HDFS URI. For example: (In Python) data = sc.textFile("hdfs://localhost/user/someuser/data") ​ On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek wrote: > Hi > > I am trying to access a file in HDFS from spark "source code". Basically, > I am tweaking the spark

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer wrote: > Hi, > > On Wed, Nov 12, 2014 at 5:42 AM, SK wrote: >> >> But getLang() is one of the methods o

spark-shell exception while running in YARN mode

2014-11-11 Thread hmxxyy
I am following the 1.1.0 document to run spark-shell in yarn client mode, just getting exceptions flooding out. bin/spark-shell --master yarn-client Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-default

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread Nick Pentreath
Feel free to add that converter as an option in the Spark examples via a PR :) — Sent from Mailbox On Wed, Nov 12, 2014 at 3:27 AM, alaa wrote: > Hey freedafeng, I'm exactly where you are. I want the output to show the > rowkey and all column qualifiers that correspond to it. How did you write

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread hmxxyy
Thanks guys for the info. I have to use yarn to access a kerberos cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Strange-behavior-of-spark-shell-while-accessing-hdfs-tp18549p18677.html Sent from the Apache Spark User List mailing list archive at N

Read a HDFS file from Spark source code

2014-11-11 Thread rapelly kartheek
Hi I am trying to access a file in HDFS from spark "source code". Basically, I am tweaking the spark source code. I need to access a file in HDFS from the source code of the spark. I am really not understanding how to go about doing this. Can someone please help me out in this regard. Thank you!!

Re: External table partitioned by date using Spark SQL

2014-11-11 Thread ehalpern
I just realized my mistake. The name of the partition subdirectory needs to include the field name and value. Instead of it should be With this fix, the partitioned table is working as expected. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Externa

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
Dear Liu: Thank you very much for your help. I will update that patch. By the way, as I have succeed to broadcast an array of size(30M) the log said that such array takes around 230MB memory. As a result, I think the numpy array that leads to error is much smaller than 2G. On Wed, Nov 12, 2014 at

Re: Pyspark Error when broadcast numpy array

2014-11-11 Thread Davies Liu
This PR fix the problem: https://github.com/apache/spark/pull/2659 cc @josh Davies On Tue, Nov 11, 2014 at 7:47 PM, bliuab wrote: > In spark-1.0.2, I have come across an error when I try to broadcast a quite > large numpy array(with 35M dimension). The error information except the > java.lang.N

How to solve this core dump error

2014-11-11 Thread shiwentao
I got a core dump when I used spark 1.1.0 . The enviorment is shown here. software enviorment: OS: cent os 6.3 jvm : Java HotSpot(TM) 64-Bit Server VM(build 21.0-b17,mixed mode) hardware enviorment:  memory: 64G I run three spark process with jvm -Xmx args like this: -Xmx 28G -Xmx 2G

Re: SVMWithSGD default threshold

2014-11-11 Thread Sean Owen
I think you need to use setIntercept(true) to get it to allow a non-zero intercept. I also kind of agree that's not obvious or the intuitive default. Is your data set highly imbalanced, with lots of positive examples? that could explain why predictions are heavily skewed. Iterations should defini

Re: Still struggling with building documentation

2014-11-11 Thread Sean Owen
(I don't think that's the same issue. This looks like some local problem with tool installation?) On Tue, Nov 11, 2014 at 9:56 PM, Patrick Wendell wrote: > The doc build appears to be broken in master. We'll get it patched up > before the release: > > https://issues.apache.org/jira/browse/SPARK-

Re: Converting Apache log string into map using delimiter

2014-11-11 Thread Sean Owen
I think it would be faster/more compact as: z.map(_.map { element => val tokens = element.split("=") (tokens(0), tokens(1)) }.toMap) (That's probably 95% right but I didn't compile or test it.) On Wed, Nov 12, 2014 at 12:18 AM, YaoPau wrote: > OK I got it working with: > > z.map(row

Re: concat two Dstreams

2014-11-11 Thread Sean Owen
Concatenate? no. It doesn't make sense in this context to think about one potentially infinite stream coming after another one. Do you just want the union of batches from two streams? yes, just union(). You can union() with non-streaming RDDs too. On Tue, Nov 11, 2014 at 10:41 PM, Josh J wrote:

Re: groupBy for DStream

2014-11-11 Thread Sean Owen
A DStream is a sequence of RDDs. Just groupBy each RDD. Likewise, count() does not return a count over all history. It returns a count of each RDD in the stream, not one count. You can head or take an RDD in the stream, but it doesn't make as much sense to talk about the first element of the entir

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tobias Pfeiffer
Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK wrote: > > But getLang() is one of the methods of twitter4j.Status since version 3.0.6 > according to the doc at: >http://twitter4j.org/javadoc/twitter4j/Status.html#getLang-- > > What version of twitter4j does Spark Streaming use? > 3.0.3 https://gith

External table partitioned by date using Spark SQL

2014-11-11 Thread ehalpern
I have large json files stored in S3 grouped under a sub-key for each year like this: I've defined an external table that's partitioned by year to keep the year limited queries efficient. The table definition looks like this: But alas, a simple query like: yields no results. If I remove the

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Tobias Pfeiffer
Hi, also there is Spindle which was introduced on this list some time ago. I haven't looked into it deeply, but you might gain some valuable insights from their architecture, they are also using Spark to fulfill requests coming from the web. Tobias

Pyspark Error when broadcast numpy array

2014-11-11 Thread bliuab
In spark-1.0.2, I have come across an error when I try to broadcast a quite large numpy array(with 35M dimension). The error information except the java.lang.NegativeArraySizeException error and details is listed below. Moreover, when broadcast a relatively smaller numpy array(30M dimension), every

RE: Help with processing multiple RDDs

2014-11-11 Thread Khandeshi, Ami
I am running as Local in client mode. I have allocated as high as 85g to the driver, executor, and daemon. When I look at java processes. I see two. I see 20974 SparkSubmitDriverBootstrapper 21650 Jps 21075 SparkSubmit I have tried repartition before, but my understanding is that comes with

MLLIB usage: BLAS dependency warning

2014-11-11 Thread jpl
Hi, I am having trouble using the BLAS libs with the MLLib functions. I am using org.apache.spark.mllib.clustering.KMeans (on a single machine) and running the Spark-shell with the kmeans example code (from https://spark.apache.org/docs/latest/mllib-clustering.html) which runs successfully but I

Re: Help with processing multiple RDDs

2014-11-11 Thread buring
i think you can try to set lower spark.storage.memoryFraction,for example 0.4 conf.set("spark.storage.memoryFraction","0.4") //default 0.6 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-processing-multiple-RDDs-tp18628p18659.html Sent from the Ap

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread ramblingpolak
Only YARN mode is supported with kerberos. You can't use a spark:// master with kerberos. Tobias Pfeiffer wrote > When you give a "spark://*" master, Spark will run on a different machine, > where you have not yet authenticated to HDFS, I think. I don't know how to > solve this, though, maybe som

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread ramblingpolak
You need to set the spark configuration property: spark.yarn.access.namenodes to your namenode. e.g. spark.yarn.access.namenodes=hdfs://mynamenode:8020 Similarly, I'm curious if you're also running high availability HDFS with an HA nameservice. I currently have HA HDFS and kerberos and I've noti

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Sean Owen
Yes although I think this difference is on purpose as part of that commercial strategy. If future versions change license it would still be possible to not upgrade. Or fork / recreate the bean classes. Not worried so much but it is a good point. On Nov 11, 2014 10:06 PM, "DB Tsai" wrote: > I also

Re: Strange behavior of spark-shell while accessing hdfs

2014-11-11 Thread Tobias Pfeiffer
Hi, On Tue, Nov 11, 2014 at 2:04 PM, hmxxyy wrote: > > If I run bin/spark-shell without connecting a master, it can access a hdfs > file on a remote cluster with kerberos authentication. [...] However, if I start the master and slave on the same host and using > bin/spark-shell --master spark:/

RE: Help with processing multiple RDDs

2014-11-11 Thread Kapil Malik
Hi, How is 78g distributed in driver, daemon, executor ? Can you please paste the logs regarding " that I don't have enough memory to hold the data in memory" Are you collecting any data in driver ? Lastly, did you try doing a re-partition to create smaller and evenly distributed partitions?

Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Cheng Lian
Hey Sadhan, Sorry for my previous abrupt reply. Submitting a MR job is definitely wrong here, I'm investigating. Would you mind to provide the Spark/Hive/Hadoop versions you are using? If you're using most recent master branch, a concrete commit sha1 would be very helpful. Thanks! Cheng On

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread alaa
Hey freedafeng, I'm exactly where you are. I want the output to show the rowkey and all column qualifiers that correspond to it. How did you write HBaseResultToStringConverter to do what you wanted it to do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/py

ISpark class not found

2014-11-11 Thread Laird, Benjamin
I've been experimenting with the ISpark extension to IScala (https://github.com/tribbloid/ISpark) Objects created in the REPL are not being loaded correctly on worker nodes, leading to a ClassNotFound exception. This does work correctly in spark-shell. I was curious if anyone has used ISpark an

Imbalanced shuffle read

2014-11-11 Thread ankits
Im running a job that uses groupByKey(), so it generates a lot of shuffle data. Then it processes this and writes files to HDFS in a forEachPartition block. Looking at the forEachPartition stage details in the web console, all but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with a

Does spark can't work with HBase?

2014-11-11 Thread gzlj
hello,all I have tested reading Hbase table with spark1.1 using SparkContext.newAPIHadoopRDD.I found the performance is much slower than reading from HIVE.I also try read data using HFileScanner on one region HFile,but the performance is not good.So,How do I improve performance spark reading fro

Re: how to create a Graph in GraphX?

2014-11-11 Thread ankurdave
You should be able to construct the edges in a single map() call without using collect(): val edges: RDD[Edge[String]] = sc.textFile(...).map { line => val row = line.split(",") Edge(row(0), row(1), row(2) } val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1) -- View th

SVMWithSGD default threshold

2014-11-11 Thread Caron
I'm hoping to get a linear classifier on a dataset. I'm using SVMWithSGD to train the data. After running with the default options: val model = SVMWithSGD.train(training, numIterations), I don't think SVM has done the classification correctly. My observations: 1. the intercept is always 0.0 2. th

"overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-11 Thread spr
I am creating a workflow; I have an existing call to updateStateByKey that works fine, but when I created a second use where the key is a Tuple2, it's now failing with the dreaded "overloaded method value updateStateByKey with alternatives ... cannot be applied to ..." Comparing the two uses I'm

Re: Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
OK I got it working with: z.map(row => (row.map(element => element.split("=")(0)) zip row.map(element => element.split("=")(1))).toMap) But I'm guessing there is a more efficient way than to create two separate lists and then zip them together and then convert the result into a map. -- View th

in function prototypes?

2014-11-11 Thread spr
I am creating a workflow; I have an existing call to updateStateByKey that works fine, but when I created a second use where the key is a Tuple2, it's now failing with the dreaded "overloaded method value updateStateByKey with alternatives ... cannot be applied to ..." Comparing the two uses I'm

Converting Apache log string into map using delimiter

2014-11-11 Thread YaoPau
I have an RDD of logs that look like this: /no_cache/bi_event?Log=0&pg_inst=517638988975678942&pg=fow_mwe&ver=c.2.1.8&site=xyz.com&pid=156431807121222351&rid=156431666543211500&srch_id=156431666581865115&row=6&seq=1&tot=1&tsp=1&cmp=thmb_12&co_txt_url=Viewing&et=click&thmb_type=p&ct=u&c=579855&lnx=

Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data.

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
just wrote a custom convert in scala to replace HBaseResultToStringConverter. Just couple of lines of code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-get-column-family-and-qualifier-names-from-hbase-table-tp18613p18639.html Sent from the Apach

RE: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Mohammed Guller
David, Here is what I would suggest: 1 - Does a new SparkContext get created in the web tier for each new request for processing? Create a single SparkContext that gets shared across multiple web requests. Depending on the framework that you are using for the web-tier, it should not be difficul

RE: Spark and Play

2014-11-11 Thread Mohammed Guller
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x Here is a sample build.sbt file: name := """xyz""" version := "0.1 " scalaVersion := "2.10.4" libraryDependencies ++= Seq( jdbc, anorm, cache, "org.apache.spark" %% "spark-core" % "1.1.0", "com.typesafe.akka" %% "akka-

Re: S3 table to spark sql

2014-11-11 Thread Rishi Yadav
simple scala> val date = new java.text.SimpleDateFormat("mmdd").parse(fechau3m) should work. Replace "mmdd" with the format fechau3m is in. If you want to do it at case class level: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) //HiveContext always a good idea import s

Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called "union" On Tue, Nov 11, 2014 at 2:41 PM, Josh J wrote: > Hi, > > Is it possible to concatenate or append two Dstreams together? I have an > incoming stream that I wish to combine with data that's generated by a > utility. I then need to process the combined Dstream. > >

concat two Dstreams

2014-11-11 Thread Josh J
Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh

Re: which is the recommended workflow engine for Apache Spark jobs?

2014-11-11 Thread Harry Brundage
We've had success with Azkaban from LinkedIn over Oozie and Luigi. http://azkaban.github.io/ Azkaban has support for many different job types, a fleshed out web UI with decent log reporting, a decent failure / retry model, a REST api, and I think support for multiple executor slaves is coming in t

Help with processing multiple RDDs

2014-11-11 Thread akhandeshi
I have been struggling to process a set of RDDs. Conceptually, it is is not a large data set. It seems, no matter how much I provide to JVM or partition, I can't seem to process this data. I am caching the RDD. I have tried persit(disk and memory), perist(memory) and persist(off_heap) with no su

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
I also worry about that the author of JPMML changed the license of jpmml-evaluator due to his interest of his commercial business, and he might change the license of jpmml-model in the future. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.c

Re: Spark and Play

2014-11-11 Thread Patrick Wendell
Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya wrote: > Hi, > > Sorry if this has been asked before; I didn't find a satisfactory answer > when sear

Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell
The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta wrote: > Nichols and Patrick, > > Thanks for your help, but, no, it still does not work. The latest mast

Re: closure serialization behavior driving me crazy

2014-11-11 Thread Sandy Ryza
I tried turning on the extended debug info. The Scala output is a little opaque (lots of "- field (class "$iwC$$iwC$$iwC$$iwC$$iwC$$iwC", name: "$iw", type: "class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC""), but it seems like, as expected, somehow the full array of OLSMultipleLinearRegression objects i

groupBy for DStream

2014-11-11 Thread SK
Hi. 1) I dont see a groupBy() method for a DStream object. Not sure why that is not supported. Currently I am using filter () to separate out the different groups. I would like to know if there is a way to convert a DStream object to a regular RDD so that I can apply the RDD methods like groupBy.

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Small typo in my code in the previous post. That should be: tweets.filter(_.getLang()=="en") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/filtering-out-non-English-tweets-using-TwitterUtils-tp18614p18622.html Sent from the Apache Spark User List mail

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Thanks for the response. I tried the following : tweets.filter(_.getLang()="en") I get a compilation error: value getLang is not a member of twitter4j.Status But getLang() is one of the methods of twitter4j.Status since version 3.0.6 according to the doc at: http://twitter4j.org/javadoc

failed to create a table with python (single node)

2014-11-11 Thread Pagliari, Roberto
I'm executing this example from the documentation (in single node mode) # sc is an existing SparkContext. from pyspark.sql import HiveContext sqlContext = HiveContext(sc) sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") # Queries can be expressed in HiveQL. results = sqlC

Re: pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
checked the source, found the following, class HBaseResultToStringConverter extends Converter[Any, String] { override def convert(obj: Any): String = { val result = obj.asInstanceOf[Result] Bytes.toStringBinary(result.value()) } } I feel using 'result.value()' here is a big limitation

Failed jobs showing as SUCCEEDED on web UI

2014-11-11 Thread Brett Meyer
I¹m running a Python script using spark-submit on YARN in an EMR cluster, and if I have a job that fails due to ExecutorLostFailure or if I kill the job, it still shows up on the web UI with a FinalStatus of SUCCEEDED. Is this due to PySpark, or is there potentially some other issue with the job f

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tathagata Das
You could get all the tweets in the stream, and then apply "filter" transformation on the DStream of tweets to filter away non-english tweets. The tweets in the DStream is of type twitter4j.Status which has a field describing the language. You can use that in the filter. Though in practice, a lot

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Sean Owen
Yes, jpmml-evaluator is AGPL, but things like jpmml-model are not; they're 3-clause BSD: https://github.com/jpmml/jpmml-model So some of the scoring components are off-limits for an AL2 project but the core model components are OK. On Tue, Nov 11, 2014 at 7:40 PM, DB Tsai wrote: > JPMML evalua

filtering out non English tweets using TwitterUtils

2014-11-11 Thread SK
Hi, Is there a way to extract only the English language tweets when using TwitterUtils.createStream()? The "filters" argument specifies the strings that need to be contained in the tweets, but I am not sure how this can be used to specify the language. thanks -- View this message in context:

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread DB Tsai
JPMML evaluator just changed their license to AGPL or commercial license, and I think AGPL is not compatible with apache project. Any advice? https://github.com/jpmml/jpmml-evaluator Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com Linked

pyspark get column family and qualifier names from hbase table

2014-11-11 Thread freedafeng
Hello there, I am wondering how to get the column family names and column qualifier names when using pyspark to read an hbase table with multiple column families. I have a hbase table as follows, hbase(main):007:0> scan 'data1' ROW COLUMN+CELL

S3 table to spark sql

2014-11-11 Thread Franco Barrientos
How can i create a date field in spark sql? I have a S3 table and i load it into a RDD. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int, sku: String, unidades: Double, mon

Re: How to execute a function from class in distributed jar on each worker node?

2014-11-11 Thread aaronjosephs
I'm not sure that this will work but it makes sense to me. Basically you write the functionality in a static block in a class and broadcast that class. Not sure what your use case is but I need to load a native library and want to avoid running the init in mapPartitions if it's not necessary (just

Re: Still struggling with building documentation

2014-11-11 Thread Alessandro Baretta
Nichols and Patrick, Thanks for your help, but, no, it still does not work. The latest master produces the following scaladoc errors: [error] /home/alex/git/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:55: not found: type Type [error] protected

Re: scala.MatchError

2014-11-11 Thread Michael Armbrust
Xiangrui is correct that is must be a java bean, also nested classes are not yet supported in java. On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng wrote: > I think you need a Java bean class instead of a normal class. See > example here: > http://spark.apache.org/docs/1.1.0/sql-programming-guid

Re: scala.MatchError

2014-11-11 Thread Xiangrui Meng
I think you need a Java bean class instead of a normal class. See example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html (switch to the java tab). -Xiangrui On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala wrote: > Hi, > > > > This is my Instrument java constructor. > > >

Re: MLLib Decision Tress algorithm hangs, others fine

2014-11-11 Thread Xiangrui Meng
Could you provide more information? For example, spark version, dataset size (number of instances/number of features), cluster size, error messages from both the drive and the executor. -Xiangrui On Mon, Nov 10, 2014 at 11:28 AM, tsj wrote: > Hello all, > > I have some text data that I am running

Re: Broadcast failure with variable size of ~ 500mb with "key already cancelled ?"

2014-11-11 Thread Davies Liu
There is a open PR [1] to support broadcast larger than 2G, could you try it? [1] https://github.com/apache/spark/pull/2659 On Tue, Nov 11, 2014 at 6:39 AM, Tom Seddon wrote: > Hi, > > Just wondering if anyone has any advice about this issue, as I am > experiencing the same thing. I'm working w

Re: Status of MLLib exporting models to PMML

2014-11-11 Thread Xiangrui Meng
Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris wrote: > Hello Spark and MLLib folks, >

data locality, task distribution

2014-11-11 Thread Nathan Kronenfeld
Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I see several inconsistencies

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks
For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ . - On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal wrote: > I believe the S

What should be the number of partitions after a union and a subtractByKey

2014-11-11 Thread Darin McBeath
Assume the following where both updatePairRDD and deletePairRDD are both HashPartitioned.  Before the union, each one of these has 512 partitions.   The new created updateDeletePairRDD has 1024 partitions.  Is this the general/expected behavior for a union (the number of partitions to double)? J

Re: thrift jdbc server probably running queries as hive query

2014-11-11 Thread Sadhan Sood
Hi Cheng, I made sure the only hive server running on the machine is hivethriftserver2. /usr/lib/jvm/default-java/bin/java -cp /usr/lib/hadoop/lib/hadoop-lzo.jar::/mnt/sadhan/spark-3/sbin/../conf:/mnt/sadhan/spark-3/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.0.2.jar:/etc/hadoop/conf -Xms512m

Spark and Play

2014-11-11 Thread Akshat Aranya
Hi, Sorry if this has been asked before; I didn't find a satisfactory answer when searching. How can I integrate a Play application with Spark? I'm getting into issues of akka-actor versions. Play 2.2.x uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither of which work fine wi

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Sonal Goyal
I believe the Spark Job Server by Ooyala can help you share data across multiple jobs, take a look at http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems to fit closely to what you need. Best Regards, Sonal Founder, Nube Technologies

Re: Cassandra spark connector exception: "NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;"

2014-11-11 Thread shahab
Thanks Helena. I think I will wait for the new release and try it. Again thanks, /Shahab On Tue, Nov 11, 2014 at 3:41 PM, Helena Edelson wrote: > Hi, > It looks like you are building from master > (spark-cassandra-connector-assembly-1.2.0). > - Append this to your com.google.guava declaration:

Combining data from two tables in two databases postgresql, JdbcRDD.

2014-11-11 Thread akshayhazari
I want to be able to perform a query on two tables in different databases. I want to know whether it can be done. I've heard about union of two RDD's but here I want to connect to something like different partitions of a table. Any help is appreciated import java.io.Serializable; //import org.ju

scala.MatchError

2014-11-11 Thread Naveen Kumar Pokala
Hi, This is my Instrument java constructor. public Instrument(Issue issue, Issuer issuer, Issuing issuing) { super(); this.issue = issue; this.issuer = issuer; this.issu

  1   2   >