RE: Possible long lineage issue when using DStream to update a normal RDD

2015-05-08 Thread Shao, Saisai
IIUC only checkpoint will clean the lineage information, cache will not cut the lineage. Also checkpoint will put the data in HDFS, not local disk :) I think you can use foreachRDD to do such RDD update work, it’s OK as I know from your code snippet. From: Chunnan Yao [mailto:yaochun...@gmail.c

Re: (无主题)

2015-05-08 Thread Akhil Das
Since its loading 24 records, it could be that your CSV is corrupted? (may be the new line char isn't \n, but \r\n if it comes from a windows environment. You can check this with *cat -v yourcsvfile.csv | more*). Thanks Best Regards On Fri, May 8, 2015 at 11:23 AM, wrote: > Hi guys, > > I

Re: YARN mode startup takes too long (10+ secs)

2015-05-08 Thread Zoltán Zvara
So is this sleep occurs before allocating resources for the first few executors to start the job? On Fri, May 8, 2015 at 6:23 AM Taeyun Kim wrote: > I think I’ve found the (maybe partial, but major) reason. > > > > It’s between the following lines, (it’s newly captured, but essentially > the sam

Re: Master node memory usage question

2015-05-08 Thread Akhil Das
Whats your usecase and what are you trying to achieve? May be there's a better way of doing it. Thanks Best Regards On Fri, May 8, 2015 at 10:20 AM, Richard Alex Hofer wrote: > Hi, > I'm working on a project in Spark and am trying to understand what's going > on. Right now to try and understand

Re: Getting data into Spark Streaming

2015-05-08 Thread Akhil Das
I don't think you can use rawSocketStream since the RSVP is from a web server and you will have to send a GET request first to initialize the communication. You are better off writing a custom receiver for your usecase. For a st

updateStateByKey - how to generate a stream of state changes?

2015-05-08 Thread minisaw
imagine an input stream transformed by updateStateByKey, based on some state. as an output of the transformation, i would like to have a stream of state changes only - not the stream of states themselves. what is the natural way of obtaining such a stream? -- View this message in context:

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread Akhil Das
Have a look at this SO question, it has discussion on various ways of accessing S3. Thanks Best Regards On Fri, May 8, 2015 at 1:21 AM, in4maniac wrote: > Hi Guys, > > I think th

updateStateByKey - how to generate a stream of state changes?

2015-05-08 Thread mini saw
imagine an input stream transformed by updateStateByKey, based on some state. as an output of the transformation, i would like to have a stream of state changes only - not the stream of states themselves. what is the natural way of obtaining such a stream?

updateStateByKey - how to generate a stream of state changes?

2015-05-08 Thread mini saw
imagine an input stream transformed by updateStateByKey, based on some state. as an output of the transformation, i would like to have a stream of state changes only - not the stream of states themselves. what is the natural way of obtaining such a stream?

SparkStreaming + Flume/PDI+Kafka

2015-05-08 Thread GARCIA MIGUEL, DAVID
Hi! I've been using spark for the last months and it is awesome. I'm pretty new on this topic so don't be too harsh on me. Recently I've been doing some simple tests with Spark Streaming for log processing and I'm considering different ETL input solutions such as Flume or PDI+Kafka. My use case

RE: The explanation of input text format using LDA in Spark

2015-05-08 Thread Yang, Yuhao
Hi Cui, Try to read the scala version of LDAExample, https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala The matrix you're referring to is the corpus after vectorization. One example, given a dict, [apple, orange, banana] 3 doc

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code. D

RE: YARN mode startup takes too long (10+ secs)

2015-05-08 Thread Taeyun Kim
I think so. In fact, the flow is: allocator.allocateResources() -> sleep -> allocator.allocateResources() -> sleep ¡¦ But I guess that on the first allocateResources() the allocation is not fulfilled. So sleep occurs. From: Zoltán Zvara [mailto:zoltan.zv...@gmail.com] Sent: Friday, May 08,

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread in4maniac
HI GUYS... I realised that it was a bug in my code that caused the code to break.. I was running the filter on a SchemaRDD when I was supposed to be running it on an RDD. But I still don't understand why the stderr was about S3 request rather than a type checking error such as "No tuple position

Re: Unable to join table across data sources using sparkSQL

2015-05-08 Thread Ishwardeep Singh
Finally got it working. I was trying to access hive using the jdbc driver like I was trying to access the terradata. It took me some time to figure out that default sqlContext created by Spark supported hive and it uses the hive-site.xml in spark conf folder to access hive. I had to use my data

[SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hi, i have a next question: |val data = sc.textFile("s3:///") val df = data.toDF df.saveAsParquetFile("hdfs://") df.someAction(...) | if during someAction some workers would die, would recomputation download files from s3 or from hdfs parquet? Thanks, Peter Rudenko ​

Re: Virtualenv pyspark

2015-05-08 Thread Nicholas Chammas
This is an interesting question. I don't have a solution for you, but you may be interested in taking a look at Anaconda Cluster . It's made by the same people behind Conda (an alternative to pip focused on data science pacakges) and may offer a better way of

filterRDD and flatMap

2015-05-08 Thread hmaeda
Dear Usergroup, I am struggling to use the SparkR pacakge that comes with apache spark 1.4.0 I am having trouble getting the tutorials in the original amplabs-extra/SparkR-pkg working. Please see my stackoverflow question with a bounty for more details...here (http://stackoverflow.com/questions/

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread ayan guha
>From S3. As the dependency of df will be on s3. And because rdds are not replicated. On 8 May 2015 23:02, "Peter Rudenko" wrote: > Hi, i have a next question: > > val data = sc.textFile("s3:///")val df = data.toDF > df.saveAsParquetFile("hdfs://") > df.someAction(...) > > if during someAction s

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hm, thanks. Do you know what this setting mean: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1178 ? Thanks, Peter Rudenko On 2015-05-08 17:48, ayan guha wrote: From S3. As the dependency of df will be on s3. And because rdds are

Submit Spark application in cluster mode and supervised

2015-05-08 Thread James King
I have two hosts host01 and host02 (lets call them) I run one Master and two Workers on host01 I also run one Master and two Workers on host02 Now I have 1 LIVE Master on host01 and a STANDBY Master on host02 The LIVE Master is aware of all Workers in the cluster Now I submit a Spark application

Re: Submit Spark application in cluster mode and supervised

2015-05-08 Thread James King
BTW I'm using Spark 1.3.0. Thanks On Fri, May 8, 2015 at 5:22 PM, James King wrote: > I have two hosts host01 and host02 (lets call them) > > I run one Master and two Workers on host01 > I also run one Master and two Workers on host02 > > Now I have 1 LIVE Master on host01 and a STANDBY Master

Cluster mode and supervised app with multiple Masters

2015-05-08 Thread James King
Why does this not work ./spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class SomeApp --deploy-mode cluster --supervise --master spark://host01:7077,host02:7077 Some.jar With exception: Caused by: java.lang.NumberFormatException: For input string: "7077,host02:7077" It seems to accept only one ma

dependencies on java-netlib and jblas

2015-05-08 Thread John Niekrasz
Newbie question... Can I use any of the main ML capabilities of MLlib in a Java-only environment, without any native library dependencies? According to the documentation, java-netlib provides a JVM fallback. This suggests that native netlib libraries are not required. It appears that such a fall

Re: dependencies on java-netlib and jblas

2015-05-08 Thread Sonal Goyal
Hi John, I have been using MLLIB without installing jblas native dependence. Functionally I have not got stuck. I still need to explore if there are any performance hits. Best Regards, Sonal Founder, Nube Technologies On Fri, May

Re: spark-shell breaks for scala 2.11 (with yarn)?

2015-05-08 Thread Koert Kuipers
i searched the jiras but couldnt find any recent mention of this. let me try with 1.4.0 branch and see if it goes away... On Wed, May 6, 2015 at 3:05 PM, Koert Kuipers wrote: > hello all, > i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and > scala 2.11. i am running on a secu

Re: Master node memory usage question

2015-05-08 Thread Richard Alex Hofer
Our use case is a bit complicated. Essentially our RDD is storing an N x N matrix A (where each value is a row) and we want to do a computation which involves another N x N matrix B. However, we have the unfortunate property that to update the i-th column of A we need to first update columns 1

parallelism on binary file

2015-05-08 Thread tog
Hi I havé an application that currently run using MR. It currently starts extracting information from a proprietary binary file that is copied to HDFS. The application starts by creating business objects from information extracted from the binary files. Later those objects are used for further pro

Cassandra number of Tasks

2015-05-08 Thread Vijay Pawnarkar
I am using the Spark Cassandra connector to work with a table with 3 million records. Using .where() API to work with only a certain rows in this table. Where clause filters the data to 1 rows. CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE, MY_TABLE, CassandraJavaUtil

Re: Submit Spark application in cluster mode and supervised

2015-05-08 Thread Silvio Fiorito
If you’re using multiple masters with ZooKeeper then you should set your master URL to be spark://host01:7077,host02:7077 And the property spark.deploy.recoveryMode=ZOOKEEPER See here for more info: http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper From:

Re: [SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Michael Armbrust
What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can reproduce the same bug in master, can you open a JIRA? On Fri, May 8, 2015 at 1:36 AM, Haopu Wang wrote: > I want to

Spark Cassandra connector number of Tasks

2015-05-08 Thread vijaypawnarkar
I am using the Spark Cassandra connector to work with a table with 3 million records. Using .where() API to work with only a certain rows in this table. Where clause filters the data to 1 rows. CassandraJavaUtil.javaFunctions(sparkContext) .cassandraTable(KEY_SPACE, MY_TABLE, CassandraJavaUtil

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Michael Armbrust
Thats a feature flag for a new code path for reading parquet files. Its only there in case bugs are found in the old path and will be removed once we are sure the new path is solid. On Fri, May 8, 2015 at 8:04 AM, Peter Rudenko wrote: > Hm, thanks. > Do you know what this setting mean: > https

Re: Map one RDD into two RDD

2015-05-08 Thread anshu shukla
Any update to above mail and Can anyone tell me logic - I have to filter tweets and submit tweets with particular #hashtag1 to SparkSQL databases and tweets with #hashtag2 will be passed to sentiment analysis phase ."Problem is how to split the input data in two streams using hashtags " On F

Re: dependencies on java-netlib and jblas

2015-05-08 Thread Sean Owen
Yes, at this point I believe you'll find jblas used for historical reasons, to not change some APIs. I don't believe it's used for much if any computation in 1.4. On May 8, 2015 5:04 PM, "John Niekrasz" wrote: > Newbie question... > > Can I use any of the main ML capabilities of MLlib in a Java-o

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Michal Haris
+dev On 6 May 2015 10:45, "Michal Haris" wrote: > Just wanted to check if somebody has seen similar behaviour or knows what > we might be doing wrong. We have a relatively complex spark application > which processes half a terabyte of data at various stages. We have profiled > it in several ways

Lambda architecture using Apache Spark

2015-05-08 Thread rafac
I am implementing the lambda architecture using apache spark for both streaming and batch processing. For real time queries i´m using spark streaming with cassandra and for batch queries i am using spark sql and spark mlib. The problem i ´m facing now is: i need to implemente one serving layer, i.e

Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
Hi All, I am submitting the assembled fat jar file by the command: bin/spark-submit --jars /spark-streaming-kinesis-asl_2.10-1.3.0.jar --class com.xxx.Consumer -0.1-SNAPSHOT.jar It reads the data file from kinesis using the stream name defined in a configuration file. It turns out that it re

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris wrote: > +dev > On 6 May 2015 10:45, "Michal Haris" wrote: > > > Just wanted to check if somebody has seen sim

RE: Lambda architecture using Apache Spark

2015-05-08 Thread Mohammed Guller
Why are you not using Cassandra for storing the pre-computed views? Mohammed -Original Message- From: rafac [mailto:rafaelme...@hotmail.com] Sent: Friday, May 8, 2015 1:48 PM To: user@spark.apache.org Subject: Lambda architecture using Apache Spark I am implementing the lambda architec

Spark streaming updating a large window more frequently

2015-05-08 Thread Ankur Chauhan
Hi, I am pretty new to spark/spark_streaming so please excuse my naivety. I have streaming event stream that is timestamped and I would like to aggregate it into, let's say, hourly buckets. Now the simple answer is to use a window operation with window length of 1 hr and sliding interval of 1hr

RE: Spark streaming updating a large window more frequently

2015-05-08 Thread Mohammed Guller
If I understand you correctly, you need Window duration of 1 hour and sliding interval of 5 seconds. Mohammed -Original Message- From: Ankur Chauhan [mailto:achau...@brightcove.com] Sent: Friday, May 8, 2015 2:27 PM To: u...@spark.incubator.apache.org Subject: Spark streaming updating

CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Carlos Pereira
Hi, I would like to create a hive table on top a existent parquet file as described here: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html Due network restrictions, I need to store the metadata definition in a different path than '/user/hive/warehouse', so I

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Michael Armbrust
This is an unfortunate limitation of the datasource api which does not support multiple databases. For parquet in particular (if you aren't using schema merging). You can create a hive table using STORED AS PARQUET today. I hope to fix this limitation in Spark 1.5. On Fri, May 8, 2015 at 2:41 P

Hash Partitioning and Dataframes

2015-05-08 Thread Daniel, Ronald (ELS-SDG)
Hi, How can I ensure that a batch of DataFrames I make are all partitioned based on the value of one column common to them all? For RDDs I would partitionBy a HashPartitioner, but I don't see that in the DataFrame API. If I partition the RDDs that way, then do a toDF(), will the partitioning be

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Carlos Pereira
Thanks Michael for the quick return. I was looking forward the automatic schema inferring (I think that's you mean by 'schema merging' ?), and I think the STORED AS would still require me to define the table columns right? Anyways, I am glad to hear you guys already working to fix this on future r

Re: Hash Partitioning and Dataframes

2015-05-08 Thread Michael Armbrust
What are you trying to accomplish? Internally Spark SQL will add Exchange operators to make sure that data is partitioned correctly for joins and aggregations. If you are going to do other RDD operations on the result of dataframe operations and you need to manually control the partitioning, call

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread Michael Armbrust
Actually, I was talking about the support for inferring different but compatible schemata from various files, automatically merging them into a single schema. However, you are right that I think you need to specify the columns / types if you create it as a Hive table. On Fri, May 8, 2015 at 3:11

RE: Hash Partitioning and Dataframes

2015-05-08 Thread Daniel, Ronald (ELS-SDG)
Just trying to make sure that something I know in advance (the joins will always have an equality test on one specific field) is used to optimize the partitioning so the joins only use local data. Thanks for the info. Ron From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, Ma

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
- [Kinesis stream name]: The Kinesis stream that this streaming application receives from - The application name used in the streaming context becomes the Kinesis application name - The application name must be unique for a given account and region. - The Kinesis backe

Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread barmaley
Given a registered table from data frame, I'm able to execute queries like sqlContext.sql("SELECT STDDEV(col1) FROM table") from Spark Shell just fine. However, when I run exactly the same code in a standalone app on a cluster, it throws an exception: "java.util.NoSuchElementException: key not foun

Re: Spark SQL and Hive interoperability

2015-05-08 Thread jdavidmitchell
So, why isn't this comment/question being posted to the list? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-Hive-interoperability-tp22690p22827.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spark SQL: STDDEV working in Spark Shell but not in a standalone app

2015-05-08 Thread Yin Huai
Can you attach the full stack trace? Thanks, Yin On Fri, May 8, 2015 at 4:44 PM, barmaley wrote: > Given a registered table from data frame, I'm able to execute queries like > sqlContext.sql("SELECT STDDEV(col1) FROM table") from Spark Shell just > fine. > However, when I run exactly the same

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Chris Fregly
hey mike- as you pointed out here from my docs, changing the stream name is sometimes problematic due to the way the Kinesis Client Library manages leases and checkpoints, etc in DynamoDB. I noticed this directly while developing the Kinesis connector which is why I highlighted the issue here.

Re: Map one RDD into two RDD

2015-05-08 Thread ayan guha
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter On 9 May 2015 05:19, "anshu shukla" wrote: > Any update to above mail > and Can anyone tell me logic - I have to filter tweets and submit tweets > with particular #hashtag1 to SparkSQL databases and tweets with > #hashtag2 will be passed

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread ayan guha
I am just wondering if create table supports the syntax of Create table dB.tablename Instead of two step process of use dB and then create table tablename? On 9 May 2015 08:17, "Michael Armbrust" wrote: > Actually, I was talking about the support for inferring different but > compatible schemata

Re: Spark + Kinesis + Stream Name + Cache?

2015-05-08 Thread Mike Trienis
Hey Chris! I was happy to see the documentation outlining that issue :-) However, I must have got into a pretty terrible state because I had to delete and recreate the kinesis streams as well as the DynamoDB tables. Thanks for the reply, everything is sorted. Mike On Fri, May 8, 2015 at 7:55

Using Pandas/Scikit Learning in Pyspark

2015-05-08 Thread Bin Wang
Hey there, I have a CDH cluster where the default Python installed on those Redhat Linux are Python2.6. I am thinking about developing a Spark application using pyspark and I want to be able to use Pandas and Scikit learn package. Anaconda Python interpreter has the most funtionalities out of box

spark and binary files

2015-05-08 Thread tog
Hi I havé an application that currently run using MR. It currently starts extracting information from a proprietary binary file that is copied to HDFS. The application starts by creating business objects from information extracted from the binary files. Later those objects are used for further pro