Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread Akhil Das
Can you dig a bit more in the worker logs? Also make sure that spark has permission to write to /opt/ on that machine as its one machine always throwing up. Thanks Best Regards On Sat, Jul 11, 2015 at 11:18 PM, gaurav sharma wrote: > Hi All, > > I am facing this issue in my production environme

Re: reduceByKeyAndWindow with initial state

2015-07-12 Thread Imran Alam
I'm talking about the variant with inverseReduce. For example, if my batch duration is 1s and window duration is 10s, when I start the streaming job I'd want to start with a complete window instead of empty window, given I already have the RDDs for the batches that are missing at startup. After 1s,

Including additional scala libraries in sparkR

2015-07-12 Thread Michal Haris
I have spark program with a custom optimised rdd for hbase scans and updates. I have a small library of objects in scala to support efficient serialisation, partitioning etc. I would like to use R as an analysis and visualisation front-end. I have tried to use rJava (i.e. not using sparkR) and I go

Re: S3 vs HDFS

2015-07-12 Thread Steve Loughran
On 11 Jul 2015, at 19:20, Aaron Davidson mailto:ilike...@gmail.com>> wrote: Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I

Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread gaurav sharma
the logs i pasted are from worker logs only, spark does have permission to write into /opt, its not like the worker is not able to startit runs perfectly for days, but then abruptly dies. and its not always this machine, sometimes its some other machine. It happens once in a while, but wh

Re: createDirectStream and Stats

2015-07-12 Thread gaurav sharma
Hi guys, I too am facing similar challenge with directstream. I have 3 Kafka Partitions. and running spark on 18 cores, with parallelism level set to 48. I am running simple map-reduce job on incoming stream. Though the reduce stage takes milliseconds-seconds for around 15 million packets, the

Few basic spark questions

2015-07-12 Thread Oded Maimon
Hi All, we are evaluating spark for real-time analytic. what we are trying to do is the following: - READER APP- use custom receiver to get data from rabbitmq (written in scala) - ANALYZER APP - use spark R application to read the data (windowed), analyze it every minute and save the r

Re: Spark performance

2015-07-12 Thread santoshv98
Ravi Spark (or in that case Big Data solutions like Hive) is suited for large analytical loads, where the “scaling up” starts to pale in comparison to “Scaling out” with regards to performance, versatility(types of data) and cost. Without going into the details of MsSQL architecture, there is

Re: Spark performance

2015-07-12 Thread Michael Segel
Not necessarily. It depends on the use case and what you intend to do with the data. 4-6 TB will easily fit on an SMP box and can be efficiently searched by an RDBMS. Again it depends on what you want to do and how you want to do it. Informix’s IDS engine with its extensibility could still o

How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread afarahat
Hello; I am using the ALS recommendation MLLibb. To select the optimal rank, I have a number of users who used multiple items as my test. I then get the prediction on these users and compare it to the observed. I use the RegressionMetrics to estimate the R^2. I keep getting a negative value. r

Re: How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread Sean Owen
In general, R2 means the line that was fit is a very poor fit -- the mean would give a smaller squared error. But it can also mean you are applying R2 where it doesn't apply. Here, you're not performing a linear regression; why are you using R2? On Sun, Jul 12, 2015 at 4:22 PM, afarahat wrote: >

Re: How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread Feynman Liang
This might be a bug... R^2 should always be in [0,1] and variance should never be negative. Can you give more details on which version of Spark you are running? On Sun, Jul 12, 2015 at 8:37 AM, Sean Owen wrote: > In general, R2 means the line that was fit is a very poor fit -- the > mean would

Re: Caching in spark

2015-07-12 Thread Ruslan Dautkhanov
Hi Akhil, It's interesting if RDDs are stored internally in a columnar format as well? Or it is only when an RDD is cached in SQL context, it is converted to columnar format. What about data frames? Thanks! -- Ruslan Dautkhanov On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das wrote: > > https://s

Re: Is it possible to change the default port number 7077 for spark?

2015-07-12 Thread maxdml
Q1: You can change the port number on the master in the file conf/spark-defaults.conf. I don't know what will be the impact on a cloudera distro thought. Q2: Yes: a Spark worker needs to be present on each node which you want to make available to the driver. Q3: You can submit an application from

Re: Issues when combining Spark and a third party java library

2015-07-12 Thread Max Demoulin
Yes, Thank you. -- Henri Maxime Demoulin 2015-07-12 2:53 GMT-04:00 Akhil Das : > Did you try setting the HADOOP_CONF_DIR? > > Thanks > Best Regards > > On Sat, Jul 11, 2015 at 3:17 AM, maxdml wrote: > >> Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4 >> and higher f

Re: Real-time data visualization with Zeppelin

2015-07-12 Thread Ruslan Dautkhanov
Don't think it is a Zeppelin problem.. RDDs are "immutable". Unless you integrate something like IndexedRDD http://spark-packages.org/package/amplab/spark-indexedrdd into Zeppelin I think it's not possible. -- Ruslan Dautkhanov On Wed, Jul 8, 2015 at 3:24 PM, Brandon White wrote: > Can you us

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread Ruslan Dautkhanov
Good question. I'd like to know the same. Although I think you'll loose supportability. -- Ruslan Dautkhanov On Wed, Jul 8, 2015 at 2:03 AM, Ashish Dutt wrote: > > Hi, > I need to upgrade spark version 1.3 to version 1.4 on CDH 5.4. > I checked the documentation here >

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread Ruslan Dautkhanov
You can see what Spark SQL functions are supported in Spark by doing the following in a notebook: %sql show functions https://forums.databricks.com/questions/665/is-hive-coalesce-function-supported-in-sparksql.html I think Spark SQL support is currently around Hive ~0.11? -- Ruslan Dautkhanov

Re: Real-time data visualization with Zeppelin

2015-07-12 Thread andy petrella
Heya, You might be looking for something like this I guess: https://www.youtube.com/watch?v=kB4kRQRFAVc. The Spark-Notebook (https://github.com/andypetrella/spark-notebook/) can bring that to you actually, it uses fully reactive bilateral communication streams to update data and viz, plus it hide

Re: Spark equivalent for Oracle's analytical functions

2015-07-12 Thread Ruslan Dautkhanov
Should be part of Spark 1.4 https://issues.apache.org/jira/browse/SPARK-1442 I don't see it in the documentation though https://spark.apache.org/docs/latest/sql-programming-guide.html -- Ruslan Dautkhanov On Mon, Jul 6, 2015 at 5:06 AM, gireeshp wrote: > Is there any equivalent of Oracle's *

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread Sean Owen
Yeah, it won't technically be supported, and you shouldn't go modifying the actual installation, but if you just make your own build of 1.4 for CDH 5.4 and use that build to launch YARN-based apps, I imagine it will Just Work for most any use case. On Sun, Jul 12, 2015 at 7:34 PM, Ruslan Dautkhano

Master vs. Slave Nodes Clarification

2015-07-12 Thread algermissen1971
Hi, I have a question that I really have problems with figuring out for myself: Does the master node in a spark cluster need to be a node similar to the slave nodes or should I rather view it as a coordinating node, that does not need much computing or storage power? For example, when using Sp

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread David Sabater Dinter
As Sean suggested you can actually build Spark 1.4 for CDH 5.4.x and also include Hive libraries for 0.13.1, but *this will be completely unsupported by Cloudera*. I would suggest to do that only if you just want to experiment with new features from Spark 1.4. I.e. Run SparkSQL with sort-merge join

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread David Sabater Dinter
It seems this feature was added in Hive 0.13. https://issues.apache.org/jira/browse/HIVE-4943 I would assume this is supported as Spark is by default compiled using Hive 0.13.1. On Sun, Jul 12, 2015 at 7:42 PM, Ruslan Dautkhanov wrote: > You can see what Spark SQL functions are supported in Spa

Spark Standalone Mode not working in a cluster

2015-07-12 Thread Eduardo
My installation of spark is not working correctly in my local cluster. I downloaded spark-1.4.0-bin-hadoop2.6.tgz and untar it in a directory visible to all nodes (these nodes are all accessible by ssh without password). In addition, I edited conf/slaves so that it contains the names of the nodes.

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread ayan guha
Spark already provides an explode function on lateral views. Please see https://issues.apache.org/jira/browse/SPARK-5573. On Mon, Jul 13, 2015 at 6:47 AM, David Sabater Dinter < david.sabater.maill...@gmail.com> wrote: > It seems this feature was added in Hive 0.13. > https://issues.apache.org/ji

SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI it seems to try looking at all records at the table (which takes a really long time for big table) instead of just giving me the metadata of the table. Would a

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Ted Yu
Which Spark release do you use ? Cheers On Sun, Jul 12, 2015 at 5:03 PM, Jerrick Hoang wrote: > Hi all, > > I'm new to Spark and this question may be trivial or has already been > answered, but when I do a 'describe table' from SparkSQL CLI it seems to > try looking at all records at the table

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread ayan guha
Describe computes statistics, so it will try to query the table. The one you are looking for is df.printSchema() On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang wrote: > Hi all, > > I'm new to Spark and this question may be trivial or has already been > answered, but when I do a 'describe table'

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Yin Huai
Hi Brandon, Can you explain what did you mean by "It simply does not work"? You did not see new data files? Thanks, Yin On Fri, Jul 10, 2015 at 11:55 AM, Brandon White wrote: > Why does this not work? Is insert into broken in 1.3.1? It does not throw > any errors, fail, or throw exceptions. I

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Yin Huai
Jerrick, Let me ask a few clarification questions. What is the version of Spark? Is the table a hive table? What is the format of the table? Is the table partitioned? Thanks, Yin On Sun, Jul 12, 2015 at 6:01 PM, ayan guha wrote: > Describe computes statistics, so it will try to query the tabl

Re: Spark Streaming - Inserting into Tables

2015-07-12 Thread Brandon White
Hi Yin, Yes there were no new rows. I fixed it by doing a .remember on the context. Obviously, this is not ideal. On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai wrote: > Hi Brandon, > > Can you explain what did you mean by "It simply does not work"? You did > not see new data files? > > Thanks, > >

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Sorry all for not being clear. I'm using spark 1.4 and the table is a hive table, and the table is partitioned. On Sun, Jul 12, 2015 at 6:36 PM, Yin Huai wrote: > Jerrick, > > Let me ask a few clarification questions. What is the version of Spark? Is > the table a hive table? What is the format

Re: Ordering of Batches in Spark streaming

2015-07-12 Thread anshu shukla
Anyone who can give some highlight over HOW SPARK DOES *ORDERING OF BATCHES * . On Sat, Jul 11, 2015 at 9:19 AM, anshu shukla wrote: > Thanks Ayan , > > I was curious to know* how Spark does it *.Is there any *Documentation* > where i can get the detail about that . Will you please point me

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Jong Wook Kim
Based on my experience, YARN containers can get SIGTERM when - it produces too much logs and use up the hard drive - it uses off-heap memory more than what is given by spark.yarn.executor.memoryOverhead configuration. It might be due to too many classes loaded (less than MaxPermGen but more tha

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-12 Thread Ruslan Dautkhanov
>> the executor receives a SIGTERM (from whom???) >From YARN Resource Manager. Check if yarn fair scheduler preemption and/or speculative execution are turned on, then it's quite possible and not a bug. -- Ruslan Dautkhanov On Sun, Jul 12, 2015 at 11:29 PM, Jong Wook Kim wrote: > Based on

Re: Caching in spark

2015-07-12 Thread Akhil Das
There was a discussion happened on that earlier, let me re-post it for you. For the following code: val *df* = sqlContext.parquetFile(path) *df* remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val *cdf* = df.cache() *cdf* is

Re: Few basic spark questions

2015-07-12 Thread Oded Maimon
any help / idea will be appreciated :) thanks Regards, Oded Maimon Scene53. On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon wrote: > Hi All, > we are evaluating spark for real-time analytic. what we are trying to do > is the following: > >- READER APP- use custom receiver to get data from rab

Re: javaRDD.saveasTextfile saves each line enclosed by square brackets

2015-07-12 Thread dineh210
Hi , Please can anyone help me on this post, it seems to be a show stopper for our current project Thanks inAdvance Regards Dinesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/javaRDD-saveasTextfile-saves-each-line-enclosed-by-square-brackets-tp23726p23

Re: Master vs. Slave Nodes Clarification

2015-07-12 Thread Akhil Das
You are a bit confused about master node, slave node and the driver machine. 1. Master node can be kept as a smaller machine in your dev environment, mostly in production you will be using Mesos or Yarn cluster manager. 2. Now, if you are running your driver program (the streaming job) on the mas

Re: Spark Standalone Mode not working in a cluster

2015-07-12 Thread Akhil Das
Just make sure you are having the same installation of spark-1.4.0-bin-hadoop2.6 everywhere. (including the slaves, master, and from where you start the spark-shell). Thanks Best Regards On Mon, Jul 13, 2015 at 4:34 AM, Eduardo wrote: > My installation of spark is not working correctly in my lo