Streaming Receiver Imbalance Problem

2015-09-22 Thread SLiZn Liu
Hi spark users, In our Spark Streaming app via Kafka integration on Mesos, we initialed 3 receivers to receive 3 Kafka partitions, whereas records receiving rate imbalance been observed, with spark.streaming.receiver.maxRate is set to 120, sometimes 1 of which receives very close to the limit whil

Re: How to speed up MLlib LDA?

2015-09-22 Thread Marko Asplund
Hi, I did some profiling for my LDA prototype code that requests topic distributions from a model. According to Java Mission Control more than 80 % of execution time during sample interval is spent in the following methods: org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% or

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-22 Thread Zhiliang Zhu
Dear Sujit, Since you are senior with Spark, I might not know whether it is convenient for you to help comment some on my dilemma while using spark to deal with R background application ... Thank you very much!Zhiliang On Tuesday, September 22, 2015 1:45 AM, Zhiliang Zhu wrote: Hi

RE: Support of other languages?

2015-09-22 Thread Sun, Rui
Although the data is RDD[Array[Byte]] where content is not meaningful to Spark Core, it has to be on heap, as Spark Core manipulates RDD transformations on heap. SPARK-10399 is irrelevant. it aims to manipulate off-heap data using C++library via JNI. This is done in-process. From: Rahul Palamu

Re: spark.mesos.coarse impacts memory performance on mesos

2015-09-22 Thread Tim Chen
Hi Utkarsh, Just to be sure you originally set coarse to false but then to true? Or is it the other way around? Also what's the exception/stack trace when the driver crashed? Coarse grain mode per-starts all the Spark executor backends, so has the least overhead comparing to fine grain. There is

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak
To complete design pattern: http://stackoverflow.com/questions/30450763/spark-streaming-and-connection-pool-implementation Petr On Mon, Sep 21, 2015 at 10:02 PM, Romi Kuntsman wrote: > Cody, that's a great reference! > As shown there - the best way to connect to an external database from the >

Re: passing SparkContext as parameter

2015-09-22 Thread Petr Novak
And probably the original source code https://gist.github.com/koen-dejonghe/39c10357607c698c0b04 On Tue, Sep 22, 2015 at 10:37 AM, Petr Novak wrote: > To complete design pattern: > > http://stackoverflow.com/questions/30450763/spark-streaming-and-connection-pool-implementation > > Petr > > On Mo

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
If MQTT can be configured with long enough timeout for ACK and can buffer enough events while waiting for Spark Job restart then I think one could do even without WAL assuming that Spark job shutdowns gracefully. Possibly saving its own custom metadata somewhere, f.e. Zookeeper, if required to rest

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
Ahh the problem probably is async ingestion to Spark receiver buffers, hence WAL is required I would say. Petr On Tue, Sep 22, 2015 at 10:53 AM, Petr Novak wrote: > If MQTT can be configured with long enough timeout for ACK and can buffer > enough events while waiting for Spark Job restart then

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Borisa Zivkovic
thanks for answer. I need this in order to be able to track schema metadata. basically when I create parquet files from Spark I want to be able to "tag" them in some way (giving the schema appropriate name or attaching some key/values) and then it is fairly easy to get basic metadata about parque

Re: Deploying spark-streaming application on production

2015-09-22 Thread Petr Novak
Or if there is an option on MQTT server to block events ingestion towards Spark but still keep receiving and buffering them in MQTT and wait for ACK, then it would be possible just to gracefully shutdown Spark job to finish what is in its buffers and restart. Petr On Tue, Sep 22, 2015 at 10:53 AM

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-22 Thread Huy Banh
The header should be sent from driver to workers already by spark. And therefore in sparkshell it works. In scala IDE, the code inside an app class. Then you need to check if the app class is serializable. On Tue, Sep 22, 2015 at 9:13 AM Alexis Gillain < alexis.gill...@googlemail.com> wrote: >

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread dmytro
Could it be that your data is skewed? Do you have variable-length column types? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html Sent from the Apache Spark User List mailing list ar

Re: Lost tasks in Spark SQL join jobs

2015-09-22 Thread Akhil Das
If you look a bit in the error logs, you can possibly see other issues like GC over head etc, which causes the next set of tasks to fail. Thanks Best Regards On Thu, Sep 17, 2015 at 9:26 AM, Gang Bai wrote: > Hi all, > > I’m joining two tables on a specific attribute. The job is like > `sqlCont

Re: Spark Web UI + NGINX

2015-09-22 Thread Akhil Das
Can you not just tunnel it? Like on Machine A: ssh -L 8080:127.0.0.1:8080 machineB And on your local machine: ssh -L 80:127.0.0.1:8080 machineA And then simply open http://localhost/ that will show up the spark ui running on machineB. People at digitalOcean has made wonder article on how to

Re: passing SparkContext as parameter

2015-09-22 Thread Priya Ch
I have scenario like this - I read dstream of messages from kafka. Now if my rdd contains 10 messages, for each message I need to query the cassandraDB, do some modification and update the records in DB. If there is no option of passing sparkContext to workers to read.write into DB, the only opti

Uneven distribution of tasks among workers in Spark/GraphX 1.5.0

2015-09-22 Thread dmytro
I have a large list of edges as a 5000 partition RDD. Now, I'm doing a simple but shuffle-heavy operation: val g = Graph.fromEdges(edges, ...).partitionBy(...) val subs = Graph(g.collectEdges(...), g.edges).collectNeighbors() subs.saveAsObjectFile("hdfs://...") The job gets divided into 9 stages.

Re: Has anyone used the Twitter API for location filtering?

2015-09-22 Thread Akhil Das
​That's because sometime getPlace returns null and calling getLang over null throws up either null pointer exception or noSuchMethodError. You need to filter out those statuses which doesn't include location data.​ Thanks Best Regards On Fri, Sep 18, 2015 at 12:46 AM, Jo Sunad wrote: > I've bee

Re: Cache after filter Vs Writing back to HDFS

2015-09-22 Thread Akhil Das
Instead of .map you can try doing a .mapPartitions and see the performance. Thanks Best Regards On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote: > For a large dataset, I want to filter out something and then do the > computing intensive work. > > What I am doing now: > > Data.filter(somerules)

Re: SparkContext declared as object variable

2015-09-22 Thread Akhil Das
Its a "value" not a variable, and what are you parallelizing here? Thanks Best Regards On Fri, Sep 18, 2015 at 11:21 PM, Priya Ch wrote: > Hello All, > > Instead of declaring sparkContext in main, declared as object variable > as - > > object sparkDemo > { > > val conf = new SparkConf > va

Re: Stopping criteria for gradient descent

2015-09-22 Thread Yanbo Liang
Hi Nishanth, The convergence tolerance is a condition which decides iteration termination. In LogisticRegression with SGD optimization, it depends on the difference of weight vectors. But in GBT it depends on the validate error on the held out test set. 2015-09-18 4:09 GMT+08:00 nishanthps : > H

Re: Spark Lost executor && shuffle.FetchFailedException

2015-09-22 Thread Akhil Das
If you can look a bit deeper in the executor logs, then you might find the root cause for this issue. Also make sure the ports (seems 34869 here) are accessible between all the machines. Thanks Best Regards On Mon, Sep 21, 2015 at 12:40 PM, wrote: > Hi All: > > > When I write the data to the hi

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Akhil Das
No, you can either set the configurations within your SparkConf's hadoop configuration: val hadoopConf = sparkContext.hadoopConfiguration hadoopConf.set("fs.s3n.awsAccessKeyId", s3Key) hadoopConf.set("fs.s3n.awsSecretAccessKey", s3Secret) or you can set it in the environment as: export A

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Steve Loughran
On 22 Sep 2015, at 10:40, Akhil Das mailto:ak...@sigmoidanalytics.com>> wrote: or you can set it in the environment as: export AWS_ACCESS_KEY_ID= export AWS_SECRET_ACCESS_KEY= didn't think the Hadoop code looked at those. There aren't any references to the env vars in that Hadoop s3n codeb

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Gourav Sengupta
Hi, I think that it is a very bad practice to use your keys in nodes. Please start EC2 nodes/ EMR Clusters with proper roles and you do not have to worry about any keys at all. Kindly refer to AWS documentation for further details. Regards, Gourav On Mon, Sep 21, 2015 at 4:34 PM, Michel Lemay

Re: Slow Performance with Apache Spark Gradient Boosted Tree training runs

2015-09-22 Thread Yashwanth Kumar
Hi vkutsenko, Can you just give partitions to the input labeled rdd, like: data = MLUtils.loadLibSVMFile(jsc.sc(), "s3://somebucket/somekey/plaintext_libsvm_file").toJavaRDD().*repartition(5)*; Here, i used 5, since you have have 5 cores. Also for further benchmark and performance tuning: h

SparkR for accumulo

2015-09-22 Thread madhvi.gupta
Hi, I want to process accumulo data in R through sparkR.Can anyone help me and let me know how to get accumulo data in spark to be used in R? -- Thanks and Regards Madhvi Gupta - To unsubscribe, e-mail: user-unsubscr...@spar

Querying on multiple Hive stores using Apache Spark

2015-09-22 Thread Karthik
I have a spark application which will successfully connect to hive and query on hive tables using spark engine. To build this, I just added hive-site.xml to classpath of the application and spark will read the hive-site.xml to connect to its metastore. This method was suggested in spark's mailing

Re: Why are executors on slave never used?

2015-09-22 Thread Joshua Fox
Thank you Hemant and Andrew, I got it working. On Mon, Sep 21, 2015 at 11:48 PM, Andrew Or wrote: > Hi Joshua, > > What cluster manager are you using, standalone or YARN? (Note that > standalone here does not mean local mode). > > If standalone, you need to do `setMaster("spark://[CLUSTER_URL]:7

Re: Creating BlockMatrix with java API

2015-09-22 Thread Yanbo Liang
This is due to the distributed matrices like BlockMatrix/RowMatrix/IndexedRowMatrix/CoordinateMatrix do not provide Java friendly constructors. I have file a SPARK-10757 to track this issue. 2015-09-18 3:36 GMT+08:00 Pulasthi Supun Wickramasinghe

Why RDDs are being dropped by Executors?

2015-09-22 Thread Uthayan Suthakar
Hello All, We have a Spark Streaming job that reads data from DB (three tables) and cache them into memory ONLY at the start then it will happily carry out the incremental calculation with the new data. What we've noticed occasionally is that one of the RDDs caches only 90% of the data. Therefore,

Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Chirag Dewan
Hi, I am using Spark to access around 300m rows in Cassandra. My job is pretty simple as I am just mapping my row into a CSV format and saving it as a text file. public String call(CassandraRow row) throws Excepti

Re: passing SparkContext as parameter

2015-09-22 Thread Priya Ch
Suppose I use rdd.joinWithCassnadra("keySpace", "table1"), does this do a full table scan which is not required at any cost On Tue, Sep 22, 2015 at 3:03 PM, Artem Aliev wrote: > All that code should looks like: > stream.filter(...).map(x=>(key, > )).joinWithCassandra().map(...).save

Heap Space Error

2015-09-22 Thread Yusuf Can Gürkan
I run the code below and getting error: val dateUtil = new DateUtil() val usersInputDF = sqlContext.sql( s""" | select userid,concat_ws(' ',collect_list(concat_ws(' ',if(productname is not NULL,lower(productname),''),lower(regexp_replace(regexp_replace(substr(productcategory,2,length(pr

Re: MLlib inconsistent documentation

2015-09-22 Thread Yashwanth Kumar
Hi, I guess, the double values are number of visits rather than a visit flag (obviously it should be more useful than visit flag i.e 1/0) this is based on the assumption that while doing matrix factorisation, rating trained using implicit cannot be binary, as it gives poor feature values. In tur

spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
Hi, We are trying to load around 10k avro files (each file holds only one record) using spark-avro but it takes over 15 minutes to load. It seems that most of the work is being done at the driver where it created a broadcast variable for each file. Any idea why is it behaving that way ? Thank you.

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Jonathan Coveney
having a file per record is pretty inefficient on almost any file system El martes, 22 de septiembre de 2015, Daniel Haviv < daniel.ha...@veracity-group.com> escribió: > Hi, > We are trying to load around 10k avro files (each file holds only one > record) using spark-avro but it takes over 15 min

Re: Heap Space Error

2015-09-22 Thread Ted Yu
Have you tried suggestions given in this thread ? http://stackoverflow.com/questions/26256061/using-itext-java-lang-outofmemoryerror-requested-array-size-exceeds-vm-limit Can you pastebin complete stack trace ? What release of Spark are you using ? Cheers > On Sep 22, 2015, at 4:28 AM, Yusuf C

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Adrian Tanase
By looking through the docs and source code, I think you can get away with rdd.zipWithIndex to get the index of each line in the file, as long as you define the parallelism upfront: sc.textFile("README.md", 4) You can then just do .groupBy(…).mapValues(_.sortBy(…).head) - I’m skimming through s

Re: Invalid checkpoint url

2015-09-22 Thread Adrian Tanase
Have you tried simply ssc.checkpoint("checkpoint”)? This should create it in the local folder, has always worked for me when in development on local mode. For the others (/tmp/..) make sure you have rights to write there. -adrian From: srungarapu vamsi Date: Tuesday, September 22, 2015 at 7:59

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do you have variable-len

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Ted Yu
Have you tried using repartition to spread the load ? Cheers > On Sep 22, 2015, at 4:22 AM, Chirag Dewan wrote: > > Hi, > > I am using Spark to access around 300m rows in Cassandra. > > My job is pretty simple as I am just mapping my row into a CSV format and > saving it as a text file. >

Re: Spark Streaming distributed job

2015-09-22 Thread Adrian Tanase
I think you need to dig into the custom receiver implementation. As long as the source is distributed and partitioned, the downstream .map, .foreachXX are all distributed as you would expect. You could look at how the “classic” Kafka receiver is instantiated in the streaming guide and try to st

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using. Yong > Date: Tue, 22 Sep 2015 02:06:15 -0700 > From: belevts...@gmail.com > To: user@spark.apache.org > Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables > > Could it be that your data is skewed? Do you have variable-len

Py4j issue with Python Kafka Module

2015-09-22 Thread ayan guha
Hi I have added spark assembly jar to SPARK CLASSPATH >>> print os.environ['SPARK_CLASSPATH'] D:\sw\spark-streaming-kafka-assembly_2.10-1.3.0.jar Now I am facing below issue with a test topic >>> ssc = StreamingContext(sc, 2) >>> kvs = KafkaUtils.createDirectStream(ssc,['spark'],{"metadata.br

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
I Agree but it's a constraint I have to deal with. The idea is load these files and merge them into ORC. When using hive on Tez it takes less than a minute. Daniel > On 22 בספט׳ 2015, at 16:00, Jonathan Coveney wrote: > > having a file per record is pretty inefficient on almost any file system

Performance Spark SQL vs Dataframe API faster

2015-09-22 Thread sanderg
Is there a difference in performance between writing a spark job using only SQL statements and writing it using the dataframe api or does it translate to the same thing under the hood? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Performance-Spark-SQL-vs-

spark on mesos gets killed by cgroups for too much memory

2015-09-22 Thread oggie
I'm using spark 1.2.2 on mesos 0.21 I have a java job that is submitted to mesos from marathon. I also have cgroups configured for mesos on each node. Even though the job, when running, uses 512MB, it tries to take over 3GB at startup and is killed by cgroups. When I start mesos-slave, It's sta

RE: Performance Spark SQL vs Dataframe API faster

2015-09-22 Thread Cheng, Hao
Yes, should be the same, as they are just different frontend, but the same thing in optimization / execution. -Original Message- From: sanderg [mailto:s.gee...@wimionline.be] Sent: Tuesday, September 22, 2015 10:06 PM To: user@spark.apache.org Subject: Performance Spark SQL vs Dataframe

Re: Invalid checkpoint url

2015-09-22 Thread srungarapu vamsi
Yes, I tried ssc.checkpoint("checkpoint"), it works for me as long as i don't use reduceByKeyAndWindow. When i start using "reduceByKeyAndWindow" it complains me with the error "Exception in thread "main" org.apache.spark.SparkException: Invalid checkpoint directory: file:/home/ubuntu/checkpoint/3

Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska
Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka queue into HDFS. Being very new to Kafka, not sure if I'm messing something up on that side...My hope is to read the messages presently in the queue (or at least the first 100 for now) Here is what I have: Kafka side: ./bin/

Error while saving parquet

2015-09-22 Thread gtinside
Please refer to the code snippet below . I get following error */tmp/temp/trade.parquet/part-r-00036.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [20, -28, -93, 93] at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418) at org

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
Thanks. If textFile can be used in a way that preserves order, than both the partition index and the index within each partition should be consistent, right? I overcomplicated the question by asking about removing duplicates. Fundamentally I think my question is, how does one sort lines in a file

Re: Help getting started with Kafka

2015-09-22 Thread Cody Koeninger
You need type parameters for the call to createRDD indicating the type of the key / value and the decoder to use for each. See https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/BasicRDD.scala Also, you need to check to see if offsets 0 through 100 are still actua

RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964
Your performance problem sounds like in the driver, which is trying to boardcast 10k files by itself alone, which becomes the bottle neck. What you wants is just transfer the data from AVRO format per file to another format. In MR, most likely each mapper process one file, and you utilized the w

Re: Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska
Thanks a lot Cody! I was punting on the decoders by calling count (or trying to, since my types require a custom decoder) but your sample code is exactly what I was trying to achieve. The error message threw me off, will work on the decoders now On Tue, Sep 22, 2015 at 10:50 AM, Cody Koeninger wr

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
I don't know of a way to do this, out of the box, without maybe digging into custom InputFormats. The RDD from textFile doesn't have an ordering. I can't imagine a world in which partitions weren't iterated in line order, of course, but there's also no real guarantee about ordering among partitions

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Adrian Tanase
just give zipWithIndex a shot, use it early in the pipeline. I think it provides exactly the info you need, as the index is the original line number in the file, not the index in the partition. Sent from my iPhone On 22 Sep 2015, at 17:50, Philip Weaver mailto:philip.wea...@gmail.com>> wrote:

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
I have used the mapPartitionsWithIndex/zipWithIndex solution and so far it has done the correct thing. On Tue, Sep 22, 2015 at 8:38 AM, Adrian Tanase wrote: > just give zipWithIndex a shot, use it early in the pipeline. I think it > provides exactly the info you need, as the index is the origina

Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread juljoin
Hello, I am trying to figure Spark out and I still have some problems with its speed, I can't figure them out. In short, I wrote two programs that loop through a 3.8Gb file and filter each line depending of if a certain word is present. I wrote a one-thread python program doing the job and I obt

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Sean Owen
The point is that this only works if you already knew the file was presented in order within and across partitions, which was the original problem anyway. I don't think it is in general, but in practice, I do imagine it's already in the expected order from textFile. Maybe under the hood this ends u

Re: Spark Web UI + NGINX

2015-09-22 Thread Ruslan Dautkhanov
It should be really simple to setup.. Check this Hue + NGINX setup page http://gethue.com/using-nginx-to-speed-up-hue-3-8-0/ In that config file change 1) > server_name NGINX_HOSTNAME; to "Machine A, with a public IP" 2) > server HUE_HOST1: max_fails=3; > server HUE_HOST2: max_fails=3; to

Re: Using Spark for portfolio manager app

2015-09-22 Thread Thúy Hằng Lê
That's great answer Andrian. I find a lots of information here. I have direction for application now, i will try your suggestion :) Vào Thứ Ba, ngày 22 tháng 9 năm 2015, Adrian Tanase đã viết: > >1. reading from kafka has exactly once guarantees - we are using it in >production today (wi

Re: Py4j issue with Python Kafka Module

2015-09-22 Thread Saisai Shao
I think you're using the wrong version of kafka assembly jar, I think Python API from direct Kafka stream is not supported for Spark 1.3.0, you'd better change to version 1.5.0, looks like you're using Spark 1.5.0, why you choose Kafka assembly 1.3.0? D:\sw\spark-streaming-kafka-assembly_2.10-1.3

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Philip Weaver
The indices are definitely necessary. My first solution was just reduceByKey { case (v, _) => v } and that didn't work. I needed to look at both values and see which had the lower index. On Tue, Sep 22, 2015 at 8:54 AM, Sean Owen wrote: > The point is that this only works if you already knew the

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
I see, this makes sense. We should probably add this in Spark SQL. However, there's one corner case to note about user-defined Parquet metadata. When committing a write job, ParquetOutputCommitter writes Parquet summary files (_metadata and _common_metadata), and user-defined key-value metadat

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-22 Thread Luciano Resende
localDF is a pure R data frame and as.vector will work with no problems, as for calling it in the SparkR objects, try calling collect before you call as.vector (or in your case, the algorithms), that should solve your problem. On Mon, Sep 21, 2015 at 8:48 AM, Ellen Kraffmiller < ellen.kraffmil...@

Re: Creating BlockMatrix with java API

2015-09-22 Thread Pulasthi Supun Wickramasinghe
Hi Yanbo, Thanks for the reply. I thought i might be missing something. Anyway i moved to using scala since it is the complete API. Best Regards, Pulasthi On Tue, Sep 22, 2015 at 7:03 AM, Yanbo Liang wrote: > This is due to the distributed matrices like > BlockMatrix/RowMatrix/IndexedRowMatri

Re: How to speed up MLlib LDA?

2015-09-22 Thread Pedro Rodriguez
I helped some with the LDA and worked quite a bit on a Gibbs version. I don't know if the Gibbs version might help, but since it is not (yet) in MLlib, Intel Analytics kindly created a spark package with their adapted version plus a couple other LDA algorithms: http://spark-packages.org/package/int

Re: Count for select not matching count for group by

2015-09-22 Thread Michael Armbrust
This looks like something is wrong with predicate pushdown. Can you include the output of calling explain, and tell us what format the data is stored in? On Mon, Sep 21, 2015 at 8:06 AM, Michael Kelly wrote: > Hi, > > I'm seeing some strange behaviour with spark 1.5, I have a dataframe > that I

Re: spark + parquet + schema name and metadata

2015-09-22 Thread Cheng Lian
Michael reminded me that although we don't support direct manipulation over Parquet metadata, you can still save/query metadata to/from Parquet via DataFrame per-column metadata. For example: import sqlContext.implicits._ import org.apache.spark.sql.types.MetadataBuilder val path = "file:///tm

Re: How to speed up MLlib LDA?

2015-09-22 Thread Charles Earl
It seems that the Vowpal Wabbit version is most similar to what is in https://github.com/intel-analytics/TopicModeling/blob/master/src/main/scala/org/apache/spark/mllib/topicModeling/OnlineHDP.scala Although the Intel seems to implement the Hierarchical Dirichlet Process (topics and subtopics) as

Re: Has anyone used the Twitter API for location filtering?

2015-09-22 Thread Jo Sunad
Thanks Akhil, but I can't seem to get any tweets that include location data. For example, when I do stream.filter(status => status.getPlace().getName) and run the code for 20 minutes I only get null values.It seems like Twitter might purposely be removing the Place for free users? On Tue, Sep 22

Re: How to speed up MLlib LDA?

2015-09-22 Thread Marko Asplund
How optimized are the Commons math3 methods that showed up in profiling? Are there any higher performance alternatives to these? marko

Spark 1.5 UDAF ArrayType

2015-09-22 Thread Deenar Toraskar
Hi I am trying to write an UDAF ArraySum, that does element wise sum of arrays of Doubles returning an array of Double following the sample in https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html. I am getting the following

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
I think that you are hitting a bug (which should be fixed in Spark 1.5.1). I'm hoping we can cut an RC for that this week. Until then you could try building branch-1.5. On Tue, Sep 22, 2015 at 11:13 AM, Deenar Toraskar wrote: > Hi > > I am trying to write an UDAF ArraySum, that does element wis

How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Clément Frison
Hello, My team and I have a 32-core machine and we would like to use a huge object - for example a large dictionary - in a map transformation and use all our cores in parallel by sharing this object among some tasks. We broadcast our large dictionary. dico_br = sc.broadcast(dico) We use it in a

Re: How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Utkarsh Sengar
If broadcast variable doesn't fit in memory, I think is not the right fit for you. You can think about fitting it with an RDD as a tuple with other data you are working on. Say you are working on RDD (rdd in your case), run a map/reduce to convert it to RDD> so now you have relevant data from the

pyspark question: create RDD from csr_matrix

2015-09-22 Thread jeff saremi
i've tried desperately to create an RDD from a matrix i have. Every combination failed. I have a sparse matrix returned from a call to dv = DictVectorizer()sv_tf = dv.fit_transform(tf) which is supposed to be a matrix of document terms and their frequencies. I need to convert this to an

Spark as standalone or with Hadoop stack.

2015-09-22 Thread Shiv Kandavelu
Hi All, We currently have a Hadoop cluster having Yarn as the resource manager. We are planning to use HBase as the data store due to the C-P aspects of the CAP Theorem. We now want to do extensive data processing both stored data in HBase as well as Steam processing from online website / API W

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Sean Owen
Who told you Mesos would make Spark 100x faster? does it make sense that just the resource manager could make that kind of difference? This sounds entirely wrong, or, maybe a mishearing. I don't know if Mesos is somehow easier to use with Cassandra, but it's relatively harder to use it with HBase,

Re: Deploying spark-streaming application on production

2015-09-22 Thread Adrian Tanase
btw I re-read the docs and I want to clarify that reliable receiver + WAL gives you at least once, not exactly once semantics. Sent from my iPhone On 21 Sep 2015, at 21:50, Adrian Tanase mailto:atan...@adobe.com>> wrote: I'm wondering, isn't this the canonical use case for WAL + reliable recei

unsubscribe

2015-09-22 Thread Stuart Layton
-- Stuart Layton

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Michael Armbrust
Check out: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types On Tue, Sep 22, 2015 at 12:49 PM, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > Michael > > Thank you for your prompt answer. I will repost after I try this again on > 1.5.1 or branch-1.5. In ad

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Ted Yu
bq. it's relatively harder to use it with HBase I agree with Sean. I work on HBase. To my knowledge, no one runs HBase on top of Mesos. On Tue, Sep 22, 2015 at 12:31 PM, Sean Owen wrote: > Who told you Mesos would make Spark 100x faster? does it make sense > that just the resource manager could

KafkaProducer using Cassandra as source

2015-09-22 Thread kali.tumm...@gmail.com
Hi All, I am new bee in spark. I managed to write up kafka prodcuder in spark where data comes from Cassandra table but I have few questions. Spark data output from Cassandra looks like below. [2,Joe,Smith] [1,Barack,Obama] I would like something like this in my kafka consumer, so need to rem

Re: Spark 1.5 UDAF ArrayType

2015-09-22 Thread Deenar Toraskar
Michael Thank you for your prompt answer. I will repost after I try this again on 1.5.1 or branch-1.5. In addition a blog post on SparkSQL data types would be very helpful. I am familiar with the Hive data types, but there is very little documentation on Spark SQL data types. Regards Deenar On 2

Re: How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Deenar Toraskar
Clement In local mode all worker threads run in the driver VM. Your dictionary should not be copied 32 times, in fact it wont be broadcast at all. Have you tried increasing spark.driver.memory to ensure that the driver uses all the memory on the machine. Deenar On 22 September 2015 at 19:42, Clé

Partitions on RDDs

2015-09-22 Thread XIANDI
I'm always confused by the partitions. We may have many RDDs in the code. Do we need to partition on all of them? Do the rdds get rearranged among all the nodes whenever we do a partition? What is a wise way of doing partitions? -- View this message in context: http://apache-spark-user-list.100

HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Krishna Sankar
Guys, - We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The current wisdom is that it will support the 1.4.x train (which is good, need DataFrame et al). - What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on HDP 2.3 ? Or will Spark 1.5.x support be in HD

SPARK_WORKER_INSTANCES was detected (set to '2')…This is deprecated in Spark 1.0+

2015-09-22 Thread Jacek Laskowski
Hi, This is for Spark 1.6.0-SNAPSHOT (SHA1 a96ba40f7ee1352288ea676d8844e1c8174202eb). I've been toying with Spark Standalone cluster and have the following file in conf/spark-env.sh: ➜ spark git:(master) ✗ cat conf/spark-env.sh SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=2g # multiple Spark worke

Re: Spark as standalone or with Hadoop stack.

2015-09-22 Thread Jacek Laskowski
On Tue, Sep 22, 2015 at 10:03 PM, Ted Yu wrote: > To my knowledge, no one runs HBase on top of Mesos. Hi, That sentence caught my attention. Could you explain the reasons for not running HBase on Mesos, i.e. what makes Mesos inappropriate for HBase? Jacek -

Re: Partitions on RDDs

2015-09-22 Thread Richard Eggert
In general, RDDs get partitioned automatically without programmer intervention. You generally don't need to worry about them unless you need to adjust the size/number of partitions or the partitioning scheme according to the needs of your application. Partitions get redistributed among nodes whene

Re: Invalid checkpoint url

2015-09-22 Thread Tathagata Das
Are you getting this error in local mode? On Tue, Sep 22, 2015 at 7:34 AM, srungarapu vamsi wrote: > Yes, I tried ssc.checkpoint("checkpoint"), it works for me as long as i > don't use reduceByKeyAndWindow. > > When i start using "reduceByKeyAndWindow" it complains me with the error > "Exceptio

Re: HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Zhan Zhang
Hi Krishna, For the time being, you can download from upstream, and it should be running OK for HDP2.3. For hdp specific problem, you can ask in Hortonworks forum. Thanks. Zhan Zhang On Sep 22, 2015, at 3:42 PM, Krishna Sankar mailto:ksanka...@gmail.com>> wrote: Guys, * We have HDP 2.3

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
I am trying to use pluggable WAL, but it can be used only with checkpointing turned on. Thus I still need have a Hadoop-compatible file system. Is there something like pluggable checkpointing? Or can WAL be used without checkpointing? What happens when WAL is available but the checkpoint director

Re: WAL on S3

2015-09-22 Thread Tathagata Das
1. Currently, the WAL can be used only with checkpointing turned on, because it does not make sense to recover from WAL if there is not checkpoint information to recover from. 2. Since the current implementation saves the WAL in the checkpoint directory, they share the fate -- if checkpoint direct

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

2015-09-22 Thread Richard Eggert
Maybe it's just my phone, but I don't see any code. On Sep 22, 2015 11:46 AM, "juljoin" wrote: > Hello, > > I am trying to figure Spark out and I still have some problems with its > speed, I can't figure them out. In short, I wrote two programs that loop > through a 3.8Gb file and filter each li

Re: Why is 1 executor overworked and other sit idle?

2015-09-22 Thread Richard Eggert
If there's only one partition, by definition it will only be handled by one executor. Repartition to divide the work up. Note that this will also result in multiple output files, however. If you absolutely need them to be combined into a single file, I suggest using the Unix/Linux 'cat' command t

Re: WAL on S3

2015-09-22 Thread Michal Čizmazia
My understanding of pluggable WAL was that it eliminates the need for having a Hadoop-compatible file system [1]. What is the use of pluggable WAL when it can be only used together with checkpointing which still requires a Hadoop-compatible file system? [1]: https://issues.apache.org/jira/browse/

  1   2   >