date:20160309

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon

It really shouldn’t, if anything, running as superuser should ALLOW you to bind to ports 0, 1 etc. It seems very strange that it should even be trying to bind to these ports - maybe a JVM issue? I wonder if the old Apple JVM implementations could have used some different native libraries for cor

Running Java Program using Eclipse on Existing Spark Cluster

2016-03-09 Thread Gaini Rajeshwar

Hi All, I have one master & 2 workers on my local machine. I wrote the following Java program to count number of lines in README.md file (I am using Maven project to do this) import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.Spa

Is there Graph Partitioning impl for Scala/Spark?

2016-03-09 Thread Alexander Pivovarov

Is there Graph Partitioning impl (e.g. Spectral ) which can be used in Spark? I guess it should be at least java/scala lib Maybe even tuned to work with GraphX

Re: Installing Spark on Mac

2016-03-09 Thread Gaini Rajeshwar

It should just work with these steps. You don't need to configure much. As mentioned, some settings on your machine are overriding default spark settings. Even running as super-user should not be a problem. It works just fine as super-user as well. Can you tell us what version of Java you are usi

"bootstrapping" DStream state

2016-03-09 Thread Zalzberg, Idan (Agoda)

Hi, I have a spark-streaming application that basically keeps track of a string->string dictionary. So I have messages coming in with updates, like: "A"->"B" And I need to update the dictionary. This seems like a simple use case for the updateStateByKey method. However, my issue is that when t

Re: spark streaming doesn't pick new files from HDFS

2016-03-09 Thread srimugunthan dhandapani

I doubt if thats the problem. Thats how hdfs lists a directory. Output of few more commands below. *$ hadoop fs -ls /tmp/* Found 7 items drwxrwxrwx - hdfs supergroup 0 2016-03-10 11:09 /tmp/.cloudera_health_monitoring_canary_files -rw-r--r-- 3 ndsuser1 supergroup 447873024 2016-

Re: AVRO vs Parquet

2016-03-09 Thread Guru Medasani

+1 Paul. Both have some pros and cons. Hope this helps. Avro: Pros: 1) Plays nice with other tools, 3rd party or otherwise, or you specifically need some data type in AVRO like binary, but gladly that list is shrinking all the time (yay nested types in Impala). 2) Good for event data that cha

[Streaming] Batch interval and bulk export

2016-03-09 Thread Li Ming Tsai

Hi, I am doing a few basic operation like map -> reduceByKey -> filter, which is very similar to world count and I'm saving the result where the count > threshold. Currently the batch window is every 10s, but I would like to save the results to redshift at a lower frequency instead of every

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-09 Thread Eran Chinthaka Withana

Hi I'm also having this issue and can not get the tasks to work inside mesos. In my case, the spark-submit command is the following. $SPARK_HOME/bin/spark-submit \ --class com.mycompany.SparkStarter \ --master mesos://mesos-dispatcher:7077 \ --name SparkStarterJob \ --driver-memory 1G \ --exe

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao

Still I think this information is not enough to explain the reason. 1. Does your yarn cluster has enough resources to start all 10 executors? 2. Would you please try latest version, 1.6.0 or master branch to see if this is a bug and already fixed. 3. you could add "log4j.logger.org.apache.spark.Ex

Fwd: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Jy Chen

Sorry,the last configuration is also --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=60s, "--conf" was lost when I copied it to mail. -- Forwarded message -- From: Jy Chen Date: 2016-03-10 10:09 GMT+08:00 Subject: Re: Dynamic allocation doesn't work on YARN To: Saisai Sha

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Jy Chen

Hi, My Spark version is 1.5.1 with Hadoop 2.5.0-cdh5.2.0. These are my configurations of dynamic allocation: --master yarn-client --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=0 --conf spark.dynamicAllocation.initia

Re: How to obtain JavaHBaseContext to connection SparkStreaming with HBase

2016-03-09 Thread Ted Yu

bq. Question is how to get maven repository As you may have noted, version has SNAPSHOT in it. Please checkout latest code from master branch and build it yourself. 2.0 release is still a few months away - though backport of hbase-spark module should come in 1.3 release. On Wed, Mar 9, 2016 at 9

Re: trouble with NUMPY constructor in UDF

2016-03-09 Thread Ted Yu

bq. epoch2numUDF = udf(foo, FloatType()) Is it possible that return value from foo is not FloatType ? On Wed, Mar 9, 2016 at 3:09 PM, Andy Davidson wrote: > I need to convert time stamps into a format I can use with matplotlib > plot_date(). epoch2num() works fine if I use it in my driver how e

trouble with NUMPY constructor in UDF

2016-03-09 Thread Andy Davidson

I need to convert time stamps into a format I can use with matplotlib plot_date(). epoch2num() works fine if I use it in my driver how ever I get a numpy constructor error if use it in a UDF Any idea what the problem is? Thanks Andy P.s I am using python3 and spark-1.6 from pyspark.sql.functio

Re: spark streaming doesn't pick new files from HDFS

2016-03-09 Thread Ted Yu

bq. drwxr-xr-x - tomcat7 supergroup 0 2016-03-09 23:16 /tmp/swg If I read the above line correctly, the size of the file was 0. On Wed, Mar 9, 2016 at 10:00 AM, srimugunthan dhandapani < srimugunthan.dhandap...@gmail.com> wrote: > Hi all > I am working in cloudera CDH5.6 and version

Re: binary file deserialization

2016-03-09 Thread Andy Sloane

We ended up implementing custom Hadoop InputFormats and RecordReaders by extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to read it as an RDD. On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov wrote: > We have a huge binary file in a custom serialization format (e.g. heade

Re: binary file deserialization

2016-03-09 Thread Ted Yu

bq. there is a varying number of items for that record If the combination of items is very large, using case class would be tedious. On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj wrote: > You can load that binary up as a String RDD, then map over that RDD and > convert each row to your case cla

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon

That’s very strange. I just un-set my SPARK_HOME env param, downloaded a fresh 1.6.0 tarball, unzipped it to local dir (~/Downloads), and it ran just fine - the driver port is some randomly generated large number. So SPARK_HOME is definitely not needed to run this. Aida, you are not running thi

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera

Hi Jakob, Tried running the command env|grep SPARK; nothing comes back Tried env|grep Spark; which is the directory I created for Spark once I downloaded the tgz file; comes back with PWD=/Users/aidatefera/Spark Tried running ./bin/spark-shell ; comes back with same error as below; i.e could

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky

Sorry had a typo in my previous message: > try running just "/bin/spark-shell" please remove the leading slash (/) On Wed, Mar 9, 2016 at 1:39 PM, Aida Tefera wrote: > Hi there, tried echo $SPARK_HOME but nothing comes back so I guess I need to > set it. How would I do that? > > Thanks > > Sent

Re: Installing Spark on Mac

2016-03-09 Thread Jakob Odersky

As Tristan mentioned, it looks as though Spark is trying to bind on port 0 and then 1 (which is not allowed). Could it be that some environment variables from you previous installation attempts are polluting your configuration? What does running "env | grep SPARK" show you? Also, try running just

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera

Hi Tristan, my apologies, I meant to write Spark and not SCALA I feel a bit lost at the moment... Perhaps I have missed steps that are implicit to more experienced people Apart from downloading spark and then following Jakob's steps: 1. curlhttp://apache.arvixe.com/spark/spark-1.6.0/spark-1.6.

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon

SPARK_HOME and SCALA_HOME are different. I was just wondering whether spark is looking in a different dir for the config files than where you’re running it. If you have not set SPARK_HOME, it should look in the current directory for the /conf dir. The defaults should be relatively safe, I’ve be

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera

I don't think I set the SCALA_HOME environment variable Also, I'm unsure whether or not I should launch the scripts defaults to a single machine(local host) Sent from my iPhone > On 9 Mar 2016, at 19:59, Tristan Nixon wrote: > > Also, do you have the SPARK_HOME environment variable set in you

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon

Also, do you have the SPARK_HOME environment variable set in your shell, and if so what is it set to? > On Mar 9, 2016, at 1:53 PM, Tristan Nixon wrote: > > There should be a /conf sub-directory wherever you installed spark, which > contains several configuration files. > I believe that the tw

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon

There should be a /conf sub-directory wherever you installed spark, which contains several configuration files. I believe that the two that you should look at are spark-defaults.conf spark-env.sh > On Mar 9, 2016, at 1:45 PM, Aida Tefera wrote: > > Hi Tristan, thanks for your message > > When

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Suniti Singh

hive 1.6.0 in embed mode doesn't connect to metastore -- https://issues.apache.org/jira/browse/SPARK-9686 https://forums.databricks.com/questions/6512/spark-160-not-able-to-connect-to-hive-metastore.html On Wed, Mar 9, 2016 at 10:48 AM, Suniti Singh wrote: > Hi, > > I am able to reproduce this

Re: Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Sean Owen

Oh yeah I already added it after your earlier message, have a look. On Wed, Mar 9, 2016 at 7:45 PM, Mohammed Guller wrote: > My book on Spark was recently published. I would like to request it to be > added to the Books section on Spark's website. > > > > Here are the details about the book. > >

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera

Hi Tristan, thanks for your message When I look at the spark-defaults.conf.template it shows a spark example(spark://master:7077) where the port is 7077 When you say look to the conf scripts, how do you mean? Sent from my iPhone > On 9 Mar 2016, at 19:32, Tristan Nixon wrote: > > Yeah, accor

Request to add a new book to the Books section on Spark's website

2016-03-09 Thread Mohammed Guller

My book on Spark was recently published. I would like to request it to be added to the Books section on Spark's website. Here are the details about the book. Title: Big Data Analytics with Spark Author: Mohammed Guller Link: www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Cody Koeninger

Yeah, to be clear, I'm talking about having only one constructor for a direct stream, that will give you a stream of ConsumerRecord. Different needs for topic subscription, starting offsets, etc could be handled by calling appropriate methods after construction but before starting the stream. On

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon

Yeah, according to the standalone documentation http://spark.apache.org/docs/latest/spark-standalone.html the default port should be 7077, which means that something must be overriding this on your installation - look to the conf scripts! > On Mar 9, 2016, at 1:26 PM, Tristan Nixon wrote: > >

Re: Installing Spark on Mac

2016-03-09 Thread Tristan Nixon

Looks like it’s trying to bind on port 0, then 1. Often the low-numbered ports are restricted to system processes and “established” servers (web, ssh, etc.) and so user programs are prevented from binding on them. The default should be to run on a high-numbered port like 8080 or such. What do yo

Re: Use cases for kafka direct stream messageHandler

2016-03-09 Thread Alan Braithwaite

I'd probably prefer to keep it the way it is, unless it's becoming more like the function without the messageHandler argument. Right now I have code like this, but I wish it were more similar looking: if (parsed.partitions.isEmpty()) { JavaPairInputDStream kvstream = KafkaUtils

Re: pyspark spark-cassandra-connector java.io.IOException: Failed to open native connection to Cassandra at {192.168.1.126}:9042

2016-03-09 Thread Andy Davidson

Hi Ted and Saurahb If I use —conf arguments with pyspark I am able to connect. Any idea how I can set these values programmatically? (I work on a notebook server and can not easily reconfigure the server This works extraPkgs="--packages com.databricks:spark-csv_2.11:1.3.0 \ --package

Re: Installing Spark on Mac

2016-03-09 Thread Aida Tefera

Hi Jakob, Thanks for your suggestion. I downloaded a pre built version with Hadoop and followed your steps I posted the result on the forum thread, not sure if you can see it? I was just wondering whether this means it has been successfully installed as there are a number of warning/error mess

Re: Specify log4j properties file

2016-03-09 Thread Tristan Nixon

You can also package an alternative log4j config in your jar files > On Mar 9, 2016, at 12:20 PM, Ashic Mahtab wrote: > > Found it. > > You can pass in the jvm parameter log4j.configuration. The following works: > > -Dlog4j.configuration=file:path/to/log4j.properties > > It doesn't work with

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Suniti Singh

Hi, I am able to reproduce this error only when using spark 1.6.0 and hive 1.6.0. The hive-site.xml is in the classpath but somehow spark rejects the classpath search for hive-site.xml and start using the default metastore Derby. 16/03/09 10:37:52 INFO MetaStoreDirectSql: Using direct SQL, underl

Re: Installing Spark on Mac

2016-03-09 Thread Aida

Hi everyone, thanks for all your support I went with your suggestion Cody/Jakob and downloaded a pre-built version with Hadoop this time and I think I am finally making some progress :) ukdrfs01:spark-1.6.0-bin-hadoop2.6 aidatefera$ ./bin/spark-shell --master local[2] log4j:WARN No appenders cou

Re: Specify log4j properties file

2016-03-09 Thread Matt Narrell

You can also use --files, which doesn't require the file scheme. On Wed, Mar 9, 2016 at 11:20 AM Ashic Mahtab wrote: > Found it. > > You can pass in the jvm parameter log4j.configuration. The following works: > > -Dlog4j.configuration=file:path/to/log4j.properties > > It doesn't work without the

RE: Specify log4j properties file

2016-03-09 Thread Ashic Mahtab

Found it. You can pass in the jvm parameter log4j.configuration. The following works: -Dlog4j.configuration=file:path/to/log4j.properties It doesn't work without the file: prefix though. Tested in 1.6.0. Cheers,Ashic. From: as...@live.com To: user@spark.apache.org Subject: Specify log4j propertie

Re: S3 Zip File Loading Advice

2016-03-09 Thread Xinh Huynh

Could you wrap the ZipInputStream in a List, since a subtype of TraversableOnce[?] is required? case (name, content) => List(new ZipInputStream(content.open)) Xinh On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim wrote: > Hi Sabarish, > > I found a similar posting online where I should use the S3

spark streaming doesn't pick new files from HDFS

2016-03-09 Thread srimugunthan dhandapani

Hi all I am working in cloudera CDH5.6 and version of spark is 1.5.0-cdh5.6.0 I have a strange problem that spark streaming works on a directory in local filesystem but doesnt work for hdfs. My spark streaming program: package com.oreilly.learningsparkexamples.java; import java.util.concurren

Specify log4j properties file

2016-03-09 Thread Ashic Mahtab

Hello,Is it possible to provide a log4j properties file when submitting jobs to a cluster? I know that by default spark looks for a log4j.properties file in the conf directory. I'm looking for a way to specify a different log4j.properties file (external to the application) without pointing to a

Re: binary file deserialization

2016-03-09 Thread Saurabh Bajaj

You can load that binary up as a String RDD, then map over that RDD and convert each row to your case class representing the data. In the map stage you could also map the input string into an RDD of JSON values and use the following function to convert it into a DF http://spark.apache.org/docs/late

Re: reading the parquet file

2016-03-09 Thread Xinh Huynh

You might want to avoid that unionAll(), which seems to be repeated over 1000 times. Could you do a collect() in each iteration, and collect your results in a local Array instead of a DataFrame? How many rows are returned in "temp1"? Xinh On Tue, Mar 8, 2016 at 10:00 PM, Angel Angel wrote: > He

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers

i tried with avro input, something like /data/txn_*/* and it works for me On Wed, Mar 9, 2016 at 12:12 PM, Ted Yu wrote: > Koert: > I meant org.apache.hadoop.mapred.FileInputFormat doesn't support multi > level wildcard. > > Cheers > > On Wed, Mar 9, 2016 at 8:22 AM, Koert Kuipers wrote: > >> i

binary file deserialization

2016-03-09 Thread Ruslan Dautkhanov

We have a huge binary file in a custom serialization format (e.g. header tells the length of the record, then there is a varying number of items for that record). This is produced by an old c++ application. What would be best approach to deserialize it into a Hive table or a Spark RDD? Format is kn

How to obtain JavaHBaseContext to connection SparkStreaming with HBase

2016-03-09 Thread Rachana Srivastava

I am trying to integrate SparkStreaming with HBase. I am calling following APIs to connect to HBase HConnection hbaseConnection = HConnectionManager.createConnection(conf); hBaseTable = hbaseConnection.getTable(hbaseTable); Since I cannot get the connection and broadcast the connection each API

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu

Koert: I meant org.apache.hadoop.mapred.FileInputFormat doesn't support multi level wildcard. Cheers On Wed, Mar 9, 2016 at 8:22 AM, Koert Kuipers wrote: > i use multi level wildcard with hadoop fs -ls, which is the exact same > glob function call > > On Wed, Mar 9, 2016 at 9:24 AM, Ted Yu wro

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Dave Maughan

Hi, We're having a similar issue. We have a standalone cluster running 1.5.2 with Hive working fine having dropped hive-site.xml into the conf folder. We've just updated to 1.6.0, using the same configuration. Now when starting a spark-shell we get the following: java.lang.RuntimeException: java.

Multiple Spark taks with Akka FSM

2016-03-09 Thread Andrés Ivaldi

Hello, I'd like to know if this architecture is correct or not. We are studying Spark as our ETL engine, we have a UI designer for the graph, this give us a model that we want to translate in the corresponding Spark executions. What brings to us Akka FSM, Using same sparkContext for all actors, we

Re: Multiple Spark taks with Akka FSM

2016-03-09 Thread Andrés Ivaldi

My Mistake, It's not Akka FSM is Akka Flow Graphs On Wed, Mar 9, 2016 at 1:46 PM, Andrés Ivaldi wrote: > Hello, > > I'd like to know if this architecture is correct or not. We are studying > Spark as our ETL engine, we have a UI designer for the graph, this give us > a model that we want to tra

Re: Streaming job delays

2016-03-09 Thread Juan Leaniz

Hi Batch interval is 5min. I actually managed to fix the issue by turning off dynamic allocation and the external shuffle service. This seems to have helped and now the scheduling delay is between 0-5ms and processing time is about 2.8min which is lower than my batch interval. I also noticed tha

Re: HBASE

2016-03-09 Thread Mich Talebzadeh

I agree with Ted's assessment. Big Data space is getting crowded with an amazing array of tools and utilities some disappearing like meteors. Hadoop is definitely a keeper. So are Hive and Spark. Hive is the most stable Data Warehouse on Big Data and Spark is offering an array of impressive tools.

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers

i use multi level wildcard with hadoop fs -ls, which is the exact same glob function call On Wed, Mar 9, 2016 at 9:24 AM, Ted Yu wrote: > Hadoop glob pattern doesn't support multi level wildcard. > > Thanks > > On Mar 9, 2016, at 6:15 AM, Koert Kuipers wrote: > > if its based on HadoopFsRelatio

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Christophe Préaud

Hi, Unless I've misunderstood what you want to achieve, you could use: sqlContext.read.json(sc.textFile("/mnt/views-p/base/2016/01/*/*-xyz.json")) Regards, Christophe. On 09/03/16 15:24, Ted Yu wrote: Hadoop glob pattern doesn't support multi level wildcard. Thanks On Mar 9, 2016, at 6:15 AM,

RE: How to add a custom jar file to the Spark driver?

2016-03-09 Thread Gerhard Fiedler

Hi guys, Thanks for responding. Re SPARK_CLASSPATH (Daoyuan): I think you are right. We tried it, and that’s what the warning we got said. Re SparkConf (Daoyuan): We need the custom jar in the driver code, so I don’t know how that would work. Re EMR -u (Sonal): The documentation says that thi

Re: HBASE

2016-03-09 Thread Ted Yu

bq. it is kind of columnar NoSQL database. The storage format in HBase is not columnar. I would suggest you build upon what you already know (Spark and Hive) and expand on that. Also, if your work uses Big Data technologies, those would be the first to consider getting to know better. On Wed, Ma

Re: S3 Zip File Loading Advice

2016-03-09 Thread Benjamin Kim

Hi Sabarish, I found a similar posting online where I should use the S3 listKeys. http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd. Is this what you were thinking? And, your assumption is correct. The zipped CSV file contains only a single file. I f

HBASE

2016-03-09 Thread Ashok Kumar

Hi Gurus, I am relatively new to Big Data and know some about Spark and Hive. I was wondering do I need to pick up skills on Hbase as well. I am not sure how it works but know that it is kind of columnar NoSQL database. I know it is good to know something new in Big Data space. Just wondering if

Re: [Streaming + MLlib] Is it only Linear regression supported by online learning?

2016-03-09 Thread diplomatic Guru

Could someone verify this for me? On 8 March 2016 at 14:06, diplomatic Guru wrote: > Hello all, > > I'm using Random Forest for my machine learning (batch), I would like to > use online prediction using Streaming job. However, the document only > states linear algorithm for regression job. Coul

Re: How to use graphx to partition a graph which could assign topologically-close vertices on a same machine?

2016-03-09 Thread Robineast

In GraphX partitioning relates to edges not to vertices - vertices are partitioned however the RDD that was used to create the graph was partitioned. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-acti

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu

Hadoop glob pattern doesn't support multi level wildcard. Thanks > On Mar 9, 2016, at 6:15 AM, Koert Kuipers wrote: > > if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation > handles globs > >> On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu wrote: >> This is currently not supp

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers

if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation handles globs On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu wrote: > This is currently not supported. > > On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote: > > Hey, > > is something like this possible? > > sqlContext.read.json("/m

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Ted Yu

This is currently not supported. > On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote: > > Hey, > > is something like this possible? > > sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json") > > I switched to DataFrames because my source files changed from TSV to JSON > but now I'm not

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread Ted Yu

I am not Spark committer. So I cannot be the shepherd :-) > On Mar 9, 2016, at 2:27 AM, James Hammerton wrote: > > Hi Ted, > > Finally got round to creating this: > https://issues.apache.org/jira/browse/SPARK-13773 > > I hope you don't mind me selecting you as the shepherd for this ticket.

Re: Spark ML - Scaling logistic regression for many features

2016-03-09 Thread Nick Pentreath

Hi Daniel The bottleneck in Spark ML is most likely (a) the fact that the weight vector itself is dense, and (b) the related communication via the driver. A tree aggregation mechanism is used for computing gradient sums (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apac

DataFrame support for hadoop glob patterns

2016-03-09 Thread Jakub Liska

Hey, is something like this possible? sqlContext.read.json("/mnt/views-p/base/2016/01/*/*-xyz.json") I switched to DataFrames because my source files changed from TSV to JSON but now I'm not able to load the files as I did before. I get this error if I try that : https://github.com/apache/spark

Re: Streaming job delays

2016-03-09 Thread Matthias Niehoff

hi, What’s your batch interval? if the processing time is constantly bigger than your batch interval it is totally normal that your scheduling delay is going up. 2016-03-08 23:28 GMT+01:00 jleaniz : > Hi, > > I have a streaming application that reads batches from Flume, does some > transformatio

Re: Installing Spark on Mac

2016-03-09 Thread Steve Loughran

> On 8 Mar 2016, at 18:06, Aida wrote: > > Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3. I'd look at that error message and fix it - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: Spark on RAID

2016-03-09 Thread Steve Loughran

On 8 Mar 2016, at 16:34, Eddie Esquivel mailto:eduardo.esqui...@gmail.com>> wrote: Hello All, In the Spark documentation under "Hardware Requirements" it very clearly states: We recommend having 4-8 disks per node, configured without RAID (just as separate mount points) My question is why not

Re: How to add a custom jar file to the Spark driver?

2016-03-09 Thread Sonal Goyal

Hi Gerhard, I just stumbled upon some documentation on EMR - link below. Seems there is a -u option to add jars in S3 to your classpath, have you tried that ? http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html Best Regards, Sonal Founder, Nube Technologie

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread James Hammerton

Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773 I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton wrote: > Hi Ted, > > Thanks for getting back - I realised my mistake

Re: updating the Books section on the Spark documentation page

2016-03-09 Thread Sean Owen

No, the site itself (the part not under docs/) is in the ASF SVN repo. A PR wouldn't help. Just request here or dev@ with details. On Wed, Mar 9, 2016 at 7:30 AM, Jan Štěrba wrote: > You could try creating a pull-request on github. > > -Jan > -- > Jan Sterba > https://twitter.com/honzasterba | ht

Re: How to display the web ui when running Spark on YARN?

2016-03-09 Thread Shady Xu

Thanks for the reply. I am now trying to configure yarn.web-proxy.address according to https://issues.apache.org/jira/browse/SPARK-5837, but cannot start the standalone web proxy server. I am using CDH 5.0.1 and below is the error log: sbin/yarn-daemon.sh: line 44: /opt/cloudera/parcels/CDH/lib/

Re: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Saisai Shao

Would you please send out the configurations of dynamic allocation so we could know better. On Wed, Mar 9, 2016 at 4:29 PM, Jy Chen wrote: > Hello everyone: > > I'm trying the dynamic allocation in Spark on YARN. I have followed > configuration steps and started the shuffle service. > > Now it c

[Error]Run Spark job as hdfs user from oozie workflow

2016-03-09 Thread Divya Gehlot

Hi, I have non secure Hadoop 2.7.2 cluster on EC2 having Spark 1.5.2 When I am submitting my spark scala script through shell script using Oozie workflow. I am submitting job as hdfs user but It is running as user = "yarn" so all the output should get store under user/yarn directory only . When I

Re: No event log in /tmp/spark-events

2016-03-09 Thread Yu Xin

hi Andrew Thanks for reply. I run SparkPi in cluster with 1 master+2 slaves based on YARN, I did not specify the client mode so I think it should in client mode. I checked the console log and did not find EventLoggingListener keyword.* Seems the spark-default.conf are not passed correctly. * The

Re: S3 Zip File Loading Advice

2016-03-09 Thread Jörn Franke

Oozie may be able to do this for you and integrate with Spark. > On 09 Mar 2016, at 06:03, Benjamin Kim wrote: > > I am wondering if anyone can help. > > Our company stores zipped CSV files in S3, which has been a big headache from > the start. I was wondering if anyone has created a way to i

Fwd: Dynamic allocation doesn't work on YARN

2016-03-09 Thread Jy Chen

Hello everyone: I'm trying the dynamic allocation in Spark on YARN. I have followed configuration steps and started the shuffle service. Now it can request executors when the workload is heavy but it cannot remove executors. I try to open the spark shell and don’t run any command, no executor is

Re: Saving multiple outputs in the same job

2016-03-09 Thread Jeff Zhang

Spark will skip the stage if it is computed by other jobs. That means the common parent RDD of each job only needs to be computed once. But it is still multiple sequential jobs, not concurrent jobs. On Wed, Mar 9, 2016 at 3:29 PM, Jan Štěrba wrote: > Hi Andy, > > its nice to see that we are not

Re: S3 Zip File Loading Advice

2016-03-09 Thread Sabarish Sasidharan

You can use S3's listKeys API and do a diff between consecutive listKeys to identify what's new. Are there multiple files in each zip? Single file archives are processed just like text as long as it is one of the supported compression formats. Regards Sab On Wed, Mar 9, 2016 at 10:33 AM, Benjami

84 matches

Mail list logo