date:20150104

Driver hangs on running mllib word2vec

2015-01-04 Thread Eric Zhen

Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: "main" prio=10 tid=0x40112800 nid=0x46f2 runnable [0x4162e000] java.lang.Thread.State: RUNNABLE at java.io.ObjectOutputStream$BlockDataOutputStream.drain(Object

spark worker nodes getting disassociated while running hive on spark

2015-01-04 Thread Somnath Pandeya

Hi, I have setup the spark 1.2 standalone cluster and trying to run hive on spark by following below link. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started I got the latest build of hive on spark from git and was trying to running few queries. Queries are runn

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Aniket Bhatnagar

Can you confirm your emr version? Could it be because of the classpath entries for emrfs? You might face issues with using S3 without them. Thanks, Aniket On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore wrote: > Just an update on this - I found that the script by Amazon was the culprit > - not exact

Re: Repartition Memory Leak

2015-01-04 Thread Josh Rosen

@Brad, I'm guessing that the additional memory usage is coming from the shuffle performed by coalesce, so that at least explains the memory blowup. On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das wrote: > You can try: > > - Using KryoSerializer > - Enabling RDD Compression > - Setting storage type to

Re: a vague question, but perhaps it might ring a bell

2015-01-04 Thread Akhil Das

What are you trying to do? Can you paste the whole code? I used to see this sort of Exception when i close the fs object inside map/mapPartition etc. Thanks Best Regards On Mon, Jan 5, 2015 at 6:43 AM, Michael Albert < m_albert...@yahoo.com.invalid> wrote: > Greetings! > > So, I think I have dat

Re: Repartition Memory Leak

2015-01-04 Thread Akhil Das

You can try: - Using KryoSerializer - Enabling RDD Compression - Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER Thanks Best Regards On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard wrote: > I have a 10 node cluster with 600gb of ram. I'm loading a fairly large > dataset from json

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Adam Gilmore

Just an update on this - I found that the script by Amazon was the culprit - not exactly sure why. When I installed Spark manually onto the EMR (and did the manual configuration of all the EMR stuff), it worked fine. On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore wrote: > Hi all, > > I've just

Re: Launching Spark app in client mode for standalone cluster

2015-01-04 Thread Simon Chan

Boromir, You may like to take a look at how we make Spray and Spark working together at the PredictionIO project: https://github.com/PredictionIO/PredictionIO Simon On Sun, Jan 4, 2015 at 8:31 PM, Chester At Work wrote: > Just a guess here, may not be correct. > > Spray needs to start akka

Re: Parquet schema changes

2015-01-04 Thread Adam Gilmore

I saw that in the source, which is why I was wondering. I was mainly reading: http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/ "A query that tries to parse the organizationId and userId from the 2 logTypes should be able to do so correctly, though they are positioned differently

Controlling number of executors on Mesos vs YARN

2015-01-04 Thread mvle

I'm trying to compare the performance of Spark running on Mesos vs YARN. However, I am having problems being able to configure the Spark workload to run in a similar way on Mesos and YARN. When running Spark on YARN, you can specify the number of executors per node. So if I have a node with 4 CPUs

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-04 Thread Jörn Franke

Hallo, It really depends on your requirements, what kind of machine learning algorithm your budget, if you do currently something really new or integrate it with an existing application, etc.. You can run MongoDB as well as a cluster. I don't think this question can be answered generally, but depe

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK

The issue is happening when I try to concatenate column values in the query like "col1+'$'+col3". For some reason, this issue is not manifesting itself when I do a single IF query. Is there a concat function in SparkSQL? I can't find anything in the documentation. Thanks,RK On Sunday, Ja

Re: Shuffle write increases in spark 1.2

2015-01-04 Thread 정재부

Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081 --- Original Message --- Sender : Josh Rosen Date : 2015-01-05 06:14 (GMT+09:00) Title : Re: Shuffle write increases in spark 1.2 If you have a small reproduction for this issue, can you open a ticke

Launching Spark app in client mode for standalone cluster

2015-01-04 Thread Boromir Widas

Hello, I am trying to launch a Spark app(client mode for standalone cluster) from a Spray server, using the following code. When I run it as $> java -cp SprayServer the SimpleApp.getA() call from SprayService returns -1(which means it sees the logData RDD as null for HTTP requests), but the s

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK

BTW, I am seeing this issue in Spark 1.1.1. On Sunday, January 4, 2015 7:29 PM, RK wrote: When I use a single IF statement like "select IF(col1 != "", col1+'$'+col3, col2+'$'+col3) from my_table", it works fine. However, when I use a nested IF like "select IF(col1 != "", col1+'$'+col

Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK

When I use a single IF statement like "select IF(col1 != "", col1+'$'+col3, col2+'$'+col3) from my_table", it works fine. However, when I use a nested IF like "select IF(col1 != "", col1+'$'+col3, IF(col2 != "", col2+'$'+col3, '$')) from my_table", I am getting the following exception. Exception

python API for gradient boosting?

2015-01-04 Thread Christopher Thom

Hi, I wonder if anyone knows when a python API will be added for Gradient Boosted Trees? I see that java and scala APIs were added for the 1.2 release, and would love to be able to build GBMs in pyspark too. cheers chris Christopher Thom QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney

Using graphx to calculate average distance of a big graph

2015-01-04 Thread James

Recently we want to use spark to calculate the average shortest path distance between each reachable pair of nodes in a very big graph. Is there any one ever try this? We hope to discuss about the problem.

a vague question, but perhaps it might ring a bell

2015-01-04 Thread Michael Albert

Greetings! So, I think I have data saved so that each partition (part-r-0, etc)is exactly what I wan to translate into an output file of a format not related to hadoop. I believe I've figured out how to tell Spark to read the data set without re-partitioning (in another post I mentioned this

Reading one partition at a time

2015-01-04 Thread Michael Albert

Greetings! I would like to know if the code below will read "one-partition-at-a-time", and whether I am reinventing the wheel. If I may explain, upstream code has managed (I hope) to save an RDD such that each partition file (e.g, part-r-0, part-r-1) contains exactly the data subset whi

Re: spark.akka.frameSize limit error

2015-01-04 Thread Josh Rosen

Ah, so I guess this *is* still an issue since we needed to use a bitmap for tracking zero-sized blocks (see https://issues.apache.org/jira/browse/SPARK-3740; this isn't just a performance issue; it's necessary for correctness). This will require a bit more effort to fix, since we'll either have to

Re: Shuffle write increases in spark 1.2

2015-01-04 Thread Josh Rosen

If you have a small reproduction for this issue, can you open a ticket at https://issues.apache.org/jira/browse/SPARK ? On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com) wrote: Hi all, The size of shuffle write showing in spark web UI is mush different when I execute

Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen

It doesn’t seem like there’s a whole lot of clues to go on here without seeing the job code. The original "org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@130dc7ad)” error suggests that maybe there’s an issue with PySpark’s serialization / tracking of types, but it’s har

Re: A spark newbie question

2015-01-04 Thread Sanjay Subramanian

val sconf = new SparkConf().setMaster("local").setAppName("MedicalSideFx-CassandraLogsMessageTypeCount") val sc = new SparkContext(sconf)val inputDir = "/path/to/cassandralogs.txt" sc.textFile(inputDir).map(line => line.replace("\"", "")).map(line => (line.split(' ')(0) + " " + line.split(' ')(2

Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2015-01-04 Thread Sandy Ryza

Compiling with YARN set to true is not required for Spark working with Hadoop 2.2.0 in standalone mode. -Sandy On Fri, Jan 2, 2015 at 12:06 PM, Moep wrote: > Well that's confusing. I have the same issue. So you're saying I have to > compile Spark with Yarn set to true to make it work with Hadoo

Re: A spark newbie question

2015-01-04 Thread Aniket Bhatnagar

Go through spark API documentation. Basically you have to do group by (date, message_type) and then do a count. On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas wrote: > A spark cassandra newbie question. Thanks in advance for the help. > I have a cassandra table with 2 columns message_timestamp(t

Repartition Memory Leak

2015-01-04 Thread Brad Willard

I have a 10 node cluster with 600gb of ram. I'm loading a fairly large dataset from json files. When I load the dataset it is about 200gb however it only creates 60 partitions. I'm trying to repartition to 256 to increase cpu utilization however when I do that it balloons in memory to way over 2x t

Re: Problem with building spark-1.2.0

2015-01-04 Thread xhudik

The error you provided says that build was unsuccessful. If you write what you did (what command you used), whole error trace - someone might be able to help you ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-building-spark-1-2-0-tp20961p20

Re: Problem with building spark-1.2.0

2015-01-04 Thread Rapelly Kartheek

yeah.. but none of the sites get opened. On Sun, Jan 4, 2015 at 10:35 PM, Ted Yu wrote: > Have you used Google to find some way of accessing github :-) > > > > On Jan 4, 2015, at 8:46 AM, Kartheek.R wrote: > > The problem is that my network is not able to access github.com for > cloning some de

Re: Problem with building spark-1.2.0

2015-01-04 Thread Ted Yu

Have you used Google to find some way of accessing github :-) > On Jan 4, 2015, at 8:46 AM, Kartheek.R wrote: > > The problem is that my network is not able to access github.com for cloning > some dependencies as github is blocked in India. What are the other possible > ways for this problem

Re: Problem with building spark-1.2.0

2015-01-04 Thread Kartheek.R

The problem is that my network is not able to access github.com for cloning some dependencies as github is blocked in India. What are the other possible ways for this problem?? Thank you! On Sun, Jan 4, 2015 at 9:45 PM, Rapelly Kartheek wrote: > Hi, > > I get the following error when I build sp

A spark newbie question on summary statistics

2015-01-04 Thread anondin

A spark cassandra newbie question. Appreciate the help.u...@host.com I have a cassandra table with 2 columns message_timestamp(timestamp) and message_type(text). The data is of the form 2014-06-25 12:01:39 "START" 2014-06-25 12:02:39 "START" 2014-06-25 12:02:39 "PAUSE" 2014-06-25 14:02:39 "STOP"

A spark newbie question

2015-01-04 Thread Dinesh Vallabhdas

A spark cassandra newbie question. Thanks in advance for the help.I have a cassandra table with 2 columns message_timestamp(timestamp) and message_type(text). The data is of the form2014-06-25 12:01:39 "START" 2014-06-25 12:02:39 "START" 2014-06-25 12:02:39 "PAUSE" 2014-06-25 14:02:39 "STOP" 2014

Problem with building spark-1.2.0

2015-01-04 Thread Kartheek.R

Hi, I get the following error when I build spark-1.2.0 using sbt: [error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git /home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader [error] Use 'last' for the full log. Any help please? Thanks -- V

Performance degrade

2015-01-04 Thread Abhideep Chakravarty

Hi, We have recently upgraded to latest version of Spark and suddenly the Spark SQLs are performing bad i.e. response time got increased from sub-seconds to 5-6 seconds. What all info I need to provide so that I can get to know the reason for this performance degrade ? Regards, Abhideep _

Re: building spark1.2 meet error

2015-01-04 Thread xhudik

Hi J_soft mvn do not provide tar packages by default. You got many jar files - each project has its own jar (e.g. mllib has mllib/target/spark-mllib_2.10-1.2.0.jar). However, if you want one big tar package with all dependencies - look here: https://github.com/apache/spark/tree/master/assembly An

RE: does calling cache()/persist() on a RDD trigger its immediate evaluation?

2015-01-04 Thread Kapil Malik

Hi Pengcheng YIN, RDD cache / persist calls do not trigger evaluation. Unpersist call is blocking (it does have an async flavor but am not sure what are the SLAs on behavior). val rdd = sc.textFile().map() rdd.persist() // This does not trigger actual storage while(true){ val count = rdd.filt

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2015-01-04 Thread Tsuyoshi Ozawa

Please check the document added by Andrew. I could run tasks with Spark 1.2.0. * https://github.com/apache/spark/pull/3731/files#diff-c3cbe4cabe90562520f22d2306aa9116R86 * https://github.com/apache/spark/pull/3757/files#diff-c3cbe4cabe90562520f22d2306aa9116R101 Thanks, - Tsuyoshi On Sun, Jan 4

RE: Better way of measuring custom application metrics

2015-01-04 Thread Shao, Saisai

I started to know your requirement, maybe there’s some limitations in current MetricsSystem, I think we can improve it either. Thanks Jerry From: Enno Shioji [mailto:eshi...@gmail.com] Sent: Sunday, January 4, 2015 5:46 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: Better way of mea

Re: Better way of measuring custom application metrics

2015-01-04 Thread Enno Shioji

Hi Jerry, thanks for your answer. I had looked at MetricsSystem, but I couldn't see how I could use it in my use case, which is: stream .map { i => Metriker.mr.meter(Metriker.metricName("testmetric123")).mark(i) i * 2 } >From what I can see, a Source

40 matches

Mail list logo