date:20150718

How to restart Twitter spark stream

2015-07-18 Thread Zoran Jeremic

Hi, I have a twitter spark stream initialized in the following way: val ssc:StreamingContext = > SparkLauncher.getSparkScalaStreamingContext() > val config = getTwitterConfigurationBuilder.build() > val auth: Option[twitter4j.auth.Authorization] = > Some(

Spark1.4 application throw java.lang.NoClassDefFoundError: javax/servlet/FilterRegistration

2015-07-18 Thread Wwh 吴

hi I have build a spark application with IDEA. when run SparkPI , IDEA throw exception as that : Exception in thread "main" java.lang.NoClassDefFoundError: javax/servlet/FilterRegistration at org.spark-project.jetty.servlet.ServletContextHandler.(ServletContextHandler.java:136) at org.sp

[General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

2015-07-18 Thread Mike Frampton

I wanted to ask a general question about Hadoop/Yarn and Apache Spark integration. I know that Hadoop on a physical cluster has rack awareness. i.e. It attempts to minimise network traffic by saving replicated blocks within a rack. i.e. I wondered whether, when Spark is configured to use Yarn

Re: No. of Task vs No. of Executors

2015-07-18 Thread David Mitchell

This is likely due to data skew. If you are using key-value pairs, one key has a lot more records, than the other keys. Do you have any groupBy operations? David On Tue, Jul 14, 2015 at 9:43 AM, shahid wrote: > hi > > I have a 10 node cluster i loaded the data onto hdfs, so the no. of > par

Re: spark-shell with Yarn failed

2015-07-18 Thread Chester @work

it might be a network issue. The error states failed to bind the server IP address Chester Sent from my iPhone > On Jul 18, 2015, at 11:46 AM, Amjad ALSHABANI wrote: > > Does anybody have any idea about the error I m having.. I am really > clueless... And appreciate any idea :) > > Thanks i

Re: spark-shell with Yarn failed

2015-07-18 Thread Amjad ALSHABANI

Does anybody have any idea about the error I m having.. I am really clueless... And appreciate any idea :) Thanks in advance Amjad On Jul 17, 2015 5:37 PM, "Amjad ALSHABANI" wrote: > Hello, > > First of all I m a newbie in Spark , > > I m trying to start the spark-shell with yarn cluster by run

RE: Feature Generation On Spark

2015-07-18 Thread Mohammed Guller

Try this (replace ... with the appropriate values for your environment): import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector val sc = new SparkContext(...) val documents = sc.wholeTextFile

Spark-hive parquet schema evolution

2015-07-18 Thread Jerrick Hoang

Hi all, I'm aware of the support for schema evolution via DataFrame API. Just wondering what would be the best way to go about dealing with schema evolution with Hive metastore tables. So, say I create a table via SparkSQL CLI, how would I deal with Parquet schema evolution? Thanks, J

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

2015-07-18 Thread Naveen Madhire

I am facing the same issue, i tried this but getting compilation error for the "$" in the explode function So, I had to modify to the below to make it work. df.select(explode(new Column("entities.user_mentions")).as("mention")) On Wed, Jun 24, 2015 at 2:48 PM, Michael Armbrust wrote: > Star

Re: DataFrame more efficient than RDD?

2015-07-18 Thread Ted Yu

Here is a related thread: http://search-hadoop.com/m/q3RTtPmjSJ1Dod92 > On Jul 15, 2015, at 7:41 AM, k0ala wrote: > > Hi, > > I have been working a bit with RDD, and am now taking a look at DataFrames. > The schema definition using case classes looks very attractive; > > https://spark.apache.

Using Dataframe write with newHdoopApi

2015-07-18 Thread ayan guha

Hi I am trying to use DF and save it to Elasticsearch using newHadoopApi (because I am using python). Can anyone guide me to help if this is even possible? -- Best Regards, Ayan Guha

BigQuery connector for pyspark via Hadoop Input Format example

2015-07-18 Thread lfiaschi

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this int

Re: K Nearest Neighbours

2015-07-18 Thread Gylfi

Hi. What I would do in your case would be something like this.. Lets call the two datasets, qs and ds, where qs is an array of vectors and ds is an RDD[(dsID: Long, Vector)]. Do the following: 1) create a k-NN class that can keep track of the k-Nearest Neighbors so far. It must have a qsID a

Re: Passing Broadcast variable as parameter

2015-07-18 Thread Gylfi

Hi. You can use a broadcast variable to make data available to all the nodes in your cluster that can live longer then just the current distributed task. For example if you need a to access a large structure in multiple sub-tasks, instead of sending that structure again and again with each sub-t

Re: Spark APIs memory usage?

2015-07-18 Thread Harit Vishwakarma

Even if I remove numpy calls. (no matrices loaded), Same exception is coming. Can anyone tell what createDataFrame does internally? Are there any alternatives for it? On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das wrote: > I suspect its the numpy filling up Memory. > > Thanks > Best Regards > > On F

PicklingError: Could not pickle object as excessively deep recursion required.

2015-07-18 Thread Andrej Burja

hi on windows, in local mode, using pyspark i got an error about "excessively deep recursion" i'm using some module for lemmatizing/stemming, which uses some dll and some binary files (module is a python wrapper around c code). spark version 1.4.0 any idea what is going on? --

Re: write a HashMap to HDFS in Spark

2015-07-18 Thread Gylfi

Hi. Assuming your have the data in an RDD you can save your RDD (regardless of structure) with "nameRDD".saveAsObjectFile("path") where "path" can be "hdfs:///myfolderonHDFS" or the local file system. Alternatively you can also use .saveAsTextFile() Regards, Gylfi. -- View this mess

Re: No. of Task vs No. of Executors

2015-07-18 Thread Gylfi

You could even try changing the block size of the input data on HDFS (can be done on a per file basis) and that would get all workers going right from the get-go in Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-of-Task-vs-No-of-Executors-tp23824

Re: Create RDD from output of unix command

2015-07-18 Thread Gylfi

You may want to look into using the pipe command .. http://blog.madhukaraphatak.com/pipe-in-spark/ http://spark.apache.org/docs/0.6.0/api/core/spark/rdd/PipedRDD.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-RDD-from-output-of-unix-command-tp

Re: Using reference for RDD is safe?

2015-07-18 Thread Gylfi

Hi. "All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program

Re: Spark same execution time on 1 node and 5 nodes

2015-07-18 Thread Gylfi

Hi. If I just look at the two pics, I see that there is only one sub-task that takes all the time.. This is the flatmapToPair at Coef... line 52. I also see that there are only two partitions that make up the input and thus probably only two workers active. Try repartitioning the data into mo

Re: Spark and SQL Server

2015-07-18 Thread Davies Liu

I think you have a mistake on call jdbc(), it should be: jdbc(self, url, table, mode, properties) You had use properties as the third parameter. On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T wrote: > Hello, > > I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s >

Re: Flatten list

2015-07-18 Thread Gylfi

Hi. To be honest I don't really understand your problem declaration :( but lets just talk about how .flatmap works. Unlike .map(), that only allows a one-to-one transformation, .flatmap() allows 0, 1 or many outputs per item processed but the output must take the form of a sequence of the same

How to restart Twitter spark stream

Spark1.4 application throw java.lang.NoClassDefFoundError: javax/servlet/FilterRegistration

[General Question] [Hadoop + Spark at scale] Spark Rack Awareness ?

Re: No. of Task vs No. of Executors

Re: spark-shell with Yarn failed

Re: spark-shell with Yarn failed

RE: Feature Generation On Spark

Spark-hive parquet schema evolution

Re: How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

Re: DataFrame more efficient than RDD?

Using Dataframe write with newHdoopApi

BigQuery connector for pyspark via Hadoop Input Format example

Re: K Nearest Neighbours

Re: Passing Broadcast variable as parameter

Re: Spark APIs memory usage?

PicklingError: Could not pickle object as excessively deep recursion required.

Re: write a HashMap to HDFS in Spark

Re: No. of Task vs No. of Executors

Re: Create RDD from output of unix command

Re: Using reference for RDD is safe?

Re: Spark same execution time on 1 node and 5 nodes

Re: Spark and SQL Server

Re: Flatten list

23 matches

Site Navigation

Mail list logo

Footer information