date:20160319

Re: OOM Exception in my spark streaming application

2016-03-19 Thread Jesse F Chen

Somewhat related, though this JIRA is on 1.6. https://issues.apache.org/jira/browse/SPARK-13288#

Error in "java.io.IOException: No input paths specified in job"

2016-03-19 Thread tinyocean

Hello, I am learning sparkR by myself and have little computer background. I am following the examples on http://spark.apache.org/docs/latest/sparkr.html and running /sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3") sqlContext <- sparkRSQL.init(sc) people <- read.df(sqlC

Re: Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Joseph

terminal_type =0, 260,000,000 rows, almost cover half of the whole data.terminal_type =25066, just 3800 rows. orc tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="10","orc.create.index"="true","orc.bloom.filter.columns"

Re: sql timestamp timezone bug

2016-03-19 Thread Davies Liu

In Spark SQL, timestamp is the number of micro seconds since epoch, so it has nothing with timezone. When you compare it again unix_timestamp or string, it's better to convert these into timestamp then compare them. In your case, the where clause should be: where (created > cast('{0}' as timesta

installing packages with pyspark

2016-03-19 Thread Ajinkya Kale

Hi all, I had couple of questions. 1. Is there documentation on how to add the graphframes or any other package for that matter on the google dataproc managed spark clusters ? 2. Is there a way to add a package to an existing pyspark context through a jupyter notebook ? --aj

Error using collectAsMap() in scala

2016-03-19 Thread Shishir Anshuman

I am using following code snippet in scala: *val dict: RDD[String] = sc.textFile("path/to/csv/file")* *val dict_broadcast=sc.broadcast(dict.collectAsMap())* On compiling It generates this error: *scala:42: value collectAsMap is not a member of org.apache.spark.rdd.RDD[String]* *val dict_broad

[Error] : dynamically union All + adding new column

2016-03-19 Thread Divya Gehlot

Hi, I am dynamically doing union all and adding new column too val dfresult = > dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9") > val schemaL = dfresult.schema > var dffiltered = sqlContext.createDataFrame(sc.emptyRDD[Row], schemaL) > for ((key,values) <- lcrMap) {

Re: ClassNotFoundException in RDD.map

2016-03-19 Thread Ted Yu

bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 Do you mind showing more of your code involving the map() ? On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hello, > I found a strange behavior after executing a prediction with MLIB. > My code

Re: HP customer support @ www.globalpccure.com/Support/Support-for-HP.aspx

2016-03-19 Thread nsalian

Please refrain from posting such messages on this email thread. This is specific to the Spark ecosystem and not an avenue to advertise an entity/company. Thank you. - Neelesh S. Salian Cloudera -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HP-customer

Re: Rename Several Aggregated Columns

2016-03-19 Thread Sunitha Kambhampati

One way is to rename the columns using the toDF For eg: val df = Seq((1, 2),(1,4),(2,3) ).toDF("a","b") df.printSchema() val renamedf = df.groupBy('a).agg(sum('b)).toDF("mycola", "mycolb") renamedf.printSchema() Best regards, Sunitha > On Mar 18, 2016, at 9:10 AM, andres.fernan...@wellsfargo.c

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky

> But I guess I cannot add a package once i launch the pyspark context right ? Correct. Potentially, if you really really wanted to, you could maybe (with lots of pain) load packages dynamically with some class-loader black magic, but Spark does not provide that functionality. On Thu, Mar 17, 201

Re: Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-19 Thread Cody Koeninger

There's 1 topic per partition, so you're probably better off dealing with topics that way rather than at the individual message level. http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers Look at the discussion of "HasOffsetRanges" If you r

Re: [Error] : dynamically union All + adding new column

2016-03-19 Thread Ted Yu

It turned out that Col1 appeared twice in the select :-) > On Mar 16, 2016, at 7:29 PM, Divya Gehlot wrote: > > Hi, > I am dynamically doing union all and adding new column too > >> val dfresult = >> dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9") >> val schem

Checkpoint of DStream joined with RDD

2016-03-19 Thread Lubomir Nerad

Hi, I tried to replicate the example of joining DStream with lookup RDD from http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation. It works fine, but when I enable checkpointing for the StreamingContext and let the application to recover from a previously cr

Re: How to add an accumulator for a Set in Spark

2016-03-19 Thread swetha kasireddy

OK. I did take a look at them. So once I have an accumulater for a HashSet, how can I check if a particular key is already present in the HashSet accumulator? I don't see any .contains method there. My requirement is that I need to keep accumulating the keys in the HashSet across all the tasks in v

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter

I'm having trouble with that for pyspark, yarn and graphframes. I'm using:- pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5 which starts and gives me a REPL, but when I try from graphframes import * I get No module names graphframes without '--master yarn' it

Subquery performance

2016-03-19 Thread Younes Naguib

Hi all, I'm running a query that looks like the following: Select col1, count(1) >From (Select col2, count(1) from tab2 group by col2) Inner join tab1 on (col1=col2) Group by col1 This creates a very large shuffle, 10 times the data size, as if the subquery was executed for each row. Anything ca

Re: Unit testing framework for Spark Jobs?

2016-03-19 Thread Lars Albertsson

I would recommend against writing unit tests for Spark programs, and instead focus on integration tests of jobs or pipelines of several jobs. You can still use a unit test framework to execute them. Perhaps this is what you meant. You can use any of the popular unit test frameworks to drive your t

[discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin

Any objections? Please articulate your use case. SparkEnv is a weird one because it was documented as "private" but not marked as so in class visibility. * NOTE: This is not intended for external use. This is exposed for Shark and may be made private * in a future release. I do see Hive

unix_timestamp() time zone problem

2016-03-19 Thread Andy Davidson

I am using python spark 1.6 and the --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 I need to convert a time stamp string into a unix epoch time stamp. The function unix_timestamp() function assume current time zone. How ever my string data is UTC and encodes the time zone as zero. I

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Cody Koeninger

Is that happening only at startup, or during processing? If that's happening during normal operation of the stream, you don't have enough resources to process the stream in time. There's not a clean way to deal with that situation, because it's a violation of preconditions. If you want to modify

best way to do deep learning on spark ?

2016-03-19 Thread charles li

Hi, guys, I'm new to MLlib on spark, after reading the document, it seems that MLlib does not support deep learning, I want to know is there any way to implement deep learning on spark ? *Do I must use 3-party package like caffe or tensorflow ?* or *Does deep learning module list in the MLlib de

Rename Several Aggregated Columns

2016-03-19 Thread Andres.Fernandez

Good morning. I have a dataframe and would like to group by on two fields, and perform a sum aggregation on more than 500 fields, though I would like to keep the same name for the 500 hundred fields (instead of sum(Field)). I do have the field names in an array. Could anybody help with this ques

Re: best way to do deep learning on spark ?

2016-03-19 Thread Daniel Darabos

On Thu, Mar 17, 2016 at 3:51 AM, charles li wrote: > Hi, Alexander, > > that's awesome, and when will that feature be released ? Since I want to > know the opportunity cost between waiting for that release and use caffe or > tensorFlow ? > I don't expect MLlib will be able to compete with major

RE: The build-in indexes in ORC file does not work.

2016-03-19 Thread Wietsma, Tristan A.

Regarding bloom filters, https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417 Sent with Good (www.good.com) From: Joseph Sent: Wednesday, March 16, 2016 9:46:25 AM To: user Cc: user; user Subject: Re: Re: The build-in indexes in ORC file does

Re: best way to do deep learning on spark ?

2016-03-19 Thread charles li

Hi, Alexander, that's awesome, and when will that feature be released ? Since I want to know the opportunity cost between waiting for that release and use caffe or tensorFlow ? great thanks again On Thu, Mar 17, 2016 at 10:32 AM, Ulanov, Alexander < alexander.ula...@hpe.com> wrote: > Hi Charles

Re: Spark configuration with 5 nodes

2016-03-19 Thread Mich Talebzadeh

Thank you for info Steve. I always believed (IMO) that there is an optimal position where one can plot the projected NN memory (assuming 1GB--> 40TB of data) to the number of nodes. For example heuristically how many nodes would be sufficient for 1PB of storage with nodes each having 512GB of mem

[Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Divya Gehlot

Hi, I am adding a new column and renaming it at same time but the renaming doesnt have any effect. dffiltered = > dffiltered.unionAll(dfresult.withColumn("Col1",lit("value1").withColumn("Col2",lit("value2")).cast("int")).withColumn("Col3",lit("values3")).withColumnRenamed("Col1","Col1Rename").drop

RE: Subquery performance

2016-03-19 Thread Younes Naguib

Anyways to cache the subquery or force a broadcast join without persisting it? y From: Michael Armbrust [mailto:mich...@databricks.com] Sent: March-17-16 8:59 PM To: Younes Naguib Cc: user@spark.apache.org Subject: Re: Subquery performance Try running EXPLAIN on both version of the query. Likel

What is the most efficient and scalable way to get all the recommendation results from ALS model ?

2016-03-19 Thread Hiroyuki Yamada

Hi, I'm testing Collaborative Filtering with Milib. Making a model by ALS.trainImplicit (or train) seems scalable as far as I have tested, but I'm wondering how I can get all the recommendation results efficiently. The predictAll method can get all the results, but it needs the whole user-product

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Ted Yu

This is the line where NPE came from: if (conf.get(SCAN) != null) { So Configuration instance was null. On Fri, Mar 18, 2016 at 9:58 AM, Lubomir Nerad wrote: > The HBase version is 1.0.1.1. > > Thanks, > Lubo > > > On 18.3.2016 17:29, Ted Yu wrote: > > I looked at the places in SparkContex

Re: Spark configuration with 5 nodes

2016-03-19 Thread Steve Loughran

On 11 Mar 2016, at 16:25, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: Hi Steve, My argument has always been that if one is going to use Solid State Disks (SSD), it makes sense to have it for NN disks start-up from fsimage etc. Obviously NN lives in memory. Would you also reromme

CfP 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16)

2016-03-19 Thread VHPC 16

CALL FOR PAPERS 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16) held in conjunction with the International Supercomputing Conference - High Performance, June 19-23, 2016, Frankfurt, Germany. ===

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger

That's a networking error when the driver is attempting to contact leaders to get the latest available offsets. If it's a transient error, you can look at increasing the value of spark.streaming.kafka.maxRetries, see http://spark.apache.org/docs/latest/configuration.html If it's not a transient

How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Ramkumar Venkataraman

I am using Spark streaming and reading data from Kafka using KafkaUtils.createDirectStream. I have the "auto.offset.reset" set to smallest. But in some Kafka partitions, I get kafka.common.OffsetOutOfRangeException and my spark job crashes. I want to understand if there is a graceful way to handl

Re: Joins in Spark

2016-03-19 Thread Rishi Mishra

My suspect is your input file partitions are small. Hence small number of tasks are started. Can you provide some more details like how you load the files and how the result size is around 500GBs ? Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/ri

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen

Err, whoops, looks like this is a user app and not building Spark itself, so you'll have to change your deps to use the 2.11 versions of Spark. e.g. spark-streaming_2.10 -> spark-streaming_2.11. On Wed, Mar 16, 2016 at 7:07 PM Josh Rosen wrote: > See the instructions in the Spark documentation:

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin

On Wed, Mar 16, 2016 at 3:29 PM, Mridul Muralidharan wrote: > b) Shuffle manager (to get shuffle reader) > What's the use case for shuffle manager/reader? This seems like using super internal APIs in applications.

Setting up log4j2/logback with Spark 1.6.0

2016-03-19 Thread Yuval.Itzchakov

I've been trying to get log4j2 and logback to get to play nice with Spark 1.6.0 so I can properly offload my logs to a remote server. I've attempted the following things: 1. Setting logback/log4j2 on the class path for both the driver and worker nodes 2. Passing -Dlog4j.configurationFile= and -Dl

Stop spark application when the job is complete.

2016-03-19 Thread Imre Nagi

Hi, I have a spark application for batch processing in standalone cluster. The job is to query the database and then do some transformation, aggregation, and several actions such as indexing the result into the elasticsearch. If I dont call the sc.stop(), the spark application wont stop and take

Re: installing packages with pyspark

2016-03-19 Thread Jakob Odersky

Hi, regarding 1, packages are resolved locally. That means that when you specify a package, spark-submit will resolve the dependencies and download any jars on the local machine, before shipping* them to the cluster. So, without a priori knowledge of dataproc clusters, it should be no different to

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Jörn Franke

How much data are you querying? What is the query? How selective it is supposed to be? What is the block size? > On 16 Mar 2016, at 11:23, Joseph wrote: > > Hi all, > > I have known that ORC provides three level of indexes within each file, file > level, stripe level, and row level. > The fi

Apache Beam Spark runner

2016-03-19 Thread Sela, Amit

Hi all, The Apache Beam Spark runner is now available at: https://github.com/apache/incubator-beam/tree/master/runners/spark Check it out! The Apache Beam (http://beam.incubator.apache.org/) project is a unified model for building data pipelines using Google’s Dataflow programming model, and now

Re: Unit testing framework for Spark Jobs?

2016-03-19 Thread Vikas Kawadia

I just wrote a blog post on Unit testing Apache Spark with py.test https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b If you prefer using the py.test framework, then it might be useful. -vikas On Wed, Mar 2, 2016 at 10:59 AM, radoburansky wrote: > I am sure you ha

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Bryan Jeffrey

Cody et. al, I am seeing a similar error. I've increased the number of retries. Once I've got a job up and running I'm seeing it retry correctly. However, I am having trouble getting the job started - number of retries does not seem to help with startup behavior. Thoughts? Regards, Bryan Jeff

Re: Incomplete data when reading from S3

2016-03-19 Thread DB Tsai

You need to use wholetextfiles to read the whole file at once. Otherwise, It can be split. DB Tsai - Sent From My Phone On Mar 17, 2016 12:45 AM, "Blaž Šnuderl" wrote: > Hi. > > We have json data stored in S3 (json record per line). When reading the > data from s3 using the following code we sta

Re: [Error] : dynamically union All + adding new column

2016-03-19 Thread Ted Yu

bq. .drop("Col9") Could it be due to the above ? On Wed, Mar 16, 2016 at 7:29 PM, Divya Gehlot wrote: > Hi, > I am dynamically doing union all and adding new column too > > val dfresult = >> dfAcStamp.select("Col1","Col1","Col3","Col4","Col5","Col6","col7","col8","col9") >> val schemaL = dfresu

Setting up spark to run on two nodes

2016-03-19 Thread Ashok Kumar

Experts. Please your valued advice. I have spark 1.5.2 set up as standalone for now and I have started the master as below start-master.sh I also have modified config/slave file to have # A Spark Worker will be started on each of the machines listed below. localhostworkerhost On the localhost I

Re: Error using collectAsMap() in scala

2016-03-19 Thread Ted Yu

It is defined in: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman wrote: > I am using following code snippet in scala: > > > *val dict: RDD[String] = sc.textFile("path/to/csv/file")* > *val dict_broadcast=sc.broadcast(dict.collect

Re: spark shuffle service on yarn

2016-03-19 Thread Steve Loughran

On 19 Mar 2016, at 02:25, Koert Kuipers mailto:ko...@tresata.com>> wrote: spark on yarn is nice because i can bring my own spark. i am worried that the shuffle service forces me to use some "sanctioned" spark version that is officially "installed" on the cluster. so... can i safely install th

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Lubomir Nerad

The HBase version is 1.0.1.1. Thanks, Lubo On 18.3.2016 17:29, Ted Yu wrote: I looked at the places in SparkContext.scala where NewHadoopRDD is constrcuted. It seems the Configuration object shouldn't be null. Which hbase release are you using (so that I can see which line the NPE came from)

Improving Spark Scheduler Delay

2016-03-19 Thread Prabhu Joseph

Hi All, On running Concurrent Spark Jobs (huge number of tasks) with same Spark Context, there is high scheduler delay. We have FAIR schedulingPolicy set and also we tried with different pool for each jobs but still no improvement. What are the tuning ways to improve Scheduler Delay. Thanks,

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Jakob Odersky

Doesn't FileInputFormat require type parameters? Like so: class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord] extends FileInputFormat[LW, RD] I haven't verified this but it could be related to the compile error you're getting. On Thu, Mar 17, 2016 at 9:53 AM, Benyi Wang wrote: >

Extra libs in executor classpath

2016-03-19 Thread Леонид Поляков

Hello, guys! I’ve been developing a kind of framework on top of spark, and my idea is to bundle the framework jars and some extra configs with the spark and pass it to other developers for their needs. So that devs can use this bundle and run usual spark stuff but with extra flavor that framewor

Joins in Spark

2016-03-19 Thread Stuti Awasthi

Hi All, I have to join 2 files both not very big say few MBs only but the result can be huge say generating 500GBs to TBs of data. Now I have tried using spark Join() function but Im noticing that join is executing on only 1 or 2 nodes at the max. Since I have a cluster size of 5 nodes , I tri

df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Hello, Looking at https://spark.apache.org/docs/1.5.1/api/python/_modules/pyspark/sql/types.html and can't wrap my head around how to convert string data types names to actual pyspark.sql.types data types? Does pyspark.sql.types has an interface to return StringType() for "string", IntegerType()

Fwd: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread satyajit vegesna

Hi, Scala version:2.11.7(had to upgrade the scala verison to enable case clasess to accept more than 22 parameters.) Spark version:1.6.1. PFB pom.xml Getting below error when trying to setup spark on intellij IDE, 16/03/16 18:36:44 INFO spark.SparkContext: Running Spark version 1.6.1 Exception

best practices: running multi user jupyter notebook server

2016-03-19 Thread Andy Davidson

We are considering deploying a notebook server for use by two kinds of users 1. interactive dashboard. > 1. I.e. Forms allow users to select data sets and visualizations > 2. Review real time graphs of data captured by our spark streams 2. General notebooks for Data Scientists My concern is inter

Re: Unit test with sqlContext

2016-03-19 Thread Vikas Kawadia

If you prefer the py.test framework, I just wrote a blog post with some examples: Unit testing Apache Spark with py.test https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b On Fri, Feb 5, 2016 at 11:43 AM, Steve Annessa wrote: > Thanks for all of the responses. > >

Zip File and XML parsing with Spark Streaming

2016-03-19 Thread tjb305

Hello, I am trying to figure out how to unzip zip files in Spark Streaming. Within each zip file will be a series of xml files which will also need parsing. Are there libraries that work with DStream that parse a zip or parse an xml file?. I have seen the databricks xml library but I do not think

Re: Restarting an executor during execution causes it to lose AWS credentials (anyone seen this?)

2016-03-19 Thread Steve Loughran

On 17 Mar 2016, at 16:01, Allen George mailto:allen.geo...@gmail.com>> wrote: Hi guys, I'm having a problem where respawning a failed executor during a job that reads/writes parquet on S3 causes subsequent tasks to fail because of missing AWS keys. Setup: I'm using Spark 1.5.2 with Hadoop 2

Re: Apache Spark Exception in thread “main” java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class

2016-03-19 Thread Josh Rosen

See the instructions in the Spark documentation: https://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Wed, Mar 16, 2016 at 7:05 PM satyajit vegesna wrote: > > > Hi, > > Scala version:2.11.7(had to upgrade the scala verison to enable case > clasess to accept more tha

Spark jobs running in serial mode on yarn client

2016-03-19 Thread Mich Talebzadeh

Hi, I am testing some parallel processing of Spark applications. I have a two node spark cluster and currently running two worker processes on each in Yarn-client mode. The master has 12 cores and 24GB of RAM. The worker node has 4GB of RAM and 2 cores (well an old 32 bit host). The OS on both is

Re: How to add an accumulator for a Set in Spark

2016-03-19 Thread Adrien Mogenet

Btw, here is a great article about accumulators and all their related traps! http://imranrashid.com/posts/Spark-Accumulators/ (I'm not the author) On 16 March 2016 at 18:24, swetha kasireddy wrote: > OK. I did take a look at them. So once I have an accumulater for a > HashSet, how can I check if

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Tony Liu

Sorry for latter reply. Yep, RDRawDataRecord is my object, It defined in other java project(jar.), I get it with maven. My MapReduce program also use it and works. On Fri, Mar 18, 2016 at 12:48 AM, Mich Talebzadeh wrote: > Hi Tony, > > Is > > com.kiisoo.aegis.bd.common.hdfs.RDRawDataRecord > > O

ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Ascot Moss

Hi, I have a SparkStream (with Kafka) job, after running several days, it failed with following errors: ERROR DirectKafkaInputDStream: ArrayBuffer(java.nio.channels.ClosedChannelException) Any idea what would be wrong? will it be SparkStreaming buffer overflow issue? Regards *** from the

Re: sql timestamp timezone bug

2016-03-19 Thread Davies Liu

On Thu, Mar 17, 2016 at 3:02 PM, Andy Davidson wrote: > I am using pyspark 1.6.0 and > datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series > data > > The data is originally captured by a spark streaming app and written to > Cassandra. The value of the timestamp comes from > >

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-19 Thread Sebastian Piu

Try to toubleshoot why it is happening, maybe some messages are too big to be read from the topic? I remember getting that error and that was the cause On Fri, Mar 18, 2016 at 11:16 AM Ramkumar Venkataraman < ram.the.m...@gmail.com> wrote: > I am using Spark streaming and reading data from Kafka

Issue regarding removal of duplicates from RDD

2016-03-19 Thread Thamme Gowda N.

Hi, I am facing an issue while deduplicating the keys in RDD (Code Snippet below). I have few Sequence Files, some of them have duplicate entries. I am trying to drop duplicate values for each key. Here are two methods with code snippets: val path = "path/to/sequence/file" val rdd1 = ctx.sequenc

Re: Spark streaming with akka association with remote system failure

2016-03-19 Thread David Gomez Saavedra

Solved the issue by setting up the same heartbeat interval and pauses in both actor systems akka { loggers = ["akka.event.slf4j.Slf4jLogger"] loglevel = DEBUG logging-filter = "akka.event.slf4j.Slf4jLoggingFilter" log-dead-letters = on log-dead-letters-during-shutdown = on daemonic =

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Ted Yu

I looked at the places in SparkContext.scala where NewHadoopRDD is constrcuted. It seems the Configuration object shouldn't be null. Which hbase release are you using (so that I can see which line the NPE came from) ? Thanks On Fri, Mar 18, 2016 at 8:05 AM, Lubomir Nerad wrote: > Hi, > > I tri

Re: installing packages with pyspark

2016-03-19 Thread Ajinkya Kale

Thanks Jakob, Felix. I am aware you can do it with --packages but i was wondering if there is a way to do something like "!pip install " like i do for other packages from jupyter notebook for python. But I guess I cannot add a package once i launch the pyspark context right ? On Thu, Mar 17, 2016

Fwd: Spark 2.0 Shell -csv package weirdness

2016-03-19 Thread Vincent Ohprecio

For some reason writing data from Spark shell to csv using the `csv package` takes almost an hour to dump to disk. Am I going crazy or did I do this wrong? I tried writing to parquet first and its fast as normal. On my Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB the machine CPU's goes crazy and i

Re: Fwd: Spark 2.0 Shell -csv package weirdness

2016-03-19 Thread Marco Mistroni

Have u tried df.saveAsParquetFIle? I think that method is on df Api Hth Marco On 19 Mar 2016 7:18 pm, "Vincent Ohprecio" wrote: > > For some reason writing data from Spark shell to csv using the `csv > package` takes almost an hour to dump to disk. Am I going crazy or did I do > this wrong? I tri

Re: unix_timestamp() time zone problem

2016-03-19 Thread Davies Liu

Could you try to cast the timestamp as long? Internally, timestamp are stored as microseconds in UTC, you will got seconds in UTC if you cast it to long. On Thu, Mar 17, 2016 at 1:28 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > I am using python spark 1.6 and the --packages > data

Re: Subquery performance

2016-03-19 Thread Michael Armbrust

Try running EXPLAIN on both version of the query. Likely when you cache the subquery we know that its going to be small so use a broadcast join instead of a shuffling the data. On Thu, Mar 17, 2016 at 5:53 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi all, > > > > I’m running

Re: ERROR ArrayBuffer(java.nio.channels.ClosedChannelException

2016-03-19 Thread Cody Koeninger

Sounds like you're using one of the KafkaUtils.createDirectStream overloads that needs to do some broker communication in order to even construct the stream, because you aren't providing topicpartitions? Just wrap your construction attempt in a try / catch and retry in whatever way makes sense for

CfP 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16)

2016-03-19 Thread VHPC 16

CALL FOR PAPERS 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16) held in conjunction with the International Supercomputing Conference - High Performance, June 19-23, 2016, Frankfurt, Germany. ===

Re: Spark configuration with 5 nodes

2016-03-19 Thread Steve Loughran

> On 17 Mar 2016, at 12:28, Mich Talebzadeh wrote: > > Thanks Steve, > > For NN it all depends how fast you want a start-up. 1GB of NameNode memory > accommodates around 42T so if you are talking about 100GB of NN memory then > SSD may make sense to speed up the start-up. Raid 10 is the best

Re: Spark configuration with 5 nodes

2016-03-19 Thread Mich Talebzadeh

Thanks Steve, For NN it all depends how fast you want a start-up. 1GB of NameNode memory accommodates around 42T so if you are talking about 100GB of NN memory then SSD may make sense to speed up the start-up. Raid 10 is the best one that one can get assuming all internal disks. In general it is

Re: Extra libs in executor classpath

2016-03-19 Thread Леонид Поляков

No, sadly, it's not an option. End users are not my team members, it's for customers, so I have to bundle the framework and ship it. There is more to my project than just libs, so end users will have to use bundle anyway. On Wed, Mar 16, 2016 at 6:41 PM, Silvio Fiorito < silvio.fior...@granturing.

Re: Stop spark application when the job is complete.

2016-03-19 Thread Ted Yu

Can you call sc.stop() after indexing into elastic search ? > On Mar 16, 2016, at 9:17 PM, Imre Nagi wrote: > > Hi, > > I have a spark application for batch processing in standalone cluster. The > job is to query the database and then do some transformation, aggregation, > and several actions

RE: Error in "java.io.IOException: No input paths specified in job"

2016-03-19 Thread Sun, Rui

It complains about the file path "./examples/src/main/resources/people.json" You can try to use absolute path instead of relative path, and make sure the absolute path is correct. If that still does not work, you can prefix the path with "file://" in case the default file schema for Hadoop is HD

Re: Can't zip RDDs with unequal numbers of partitions

2016-03-19 Thread Jiří Syrový

Unfortunately I can't share any snippet quickly as the code is generated, but for now at least can share the plan. (See it here - http://pastebin.dqd.cz/RAhm/) After I've increased spark.sql.autoBroadcastJoinThreshold to 30 from 10 it went through without any problems. With 10 it was a

Re: ClassNotFoundException in RDD.map

2016-03-19 Thread Dirceu Semighini Filho

Hi Ted, thanks for answering. The map is just that, whenever I try inside the map it throws this ClassNotFoundException, even if I do map(f => f) it throws the exception. What is bothering me is that when I do a take or a first it returns the result, which make me conclude that the previous code is

DistributedLDAModel missing APIs in org.apache.spark.ml

2016-03-19 Thread cindymc

I like using the new DataFrame APIs on Spark ML, compared to using RDDs in the older SparkMLlib. But it seems some of the older APIs are missing. In particular, '*.mllib.clustering.DistributedLDAModel' had two APIs that I need now: topDocumentsPerTopic topTopicsPerDocument How can I get at the

sql timestamp timezone bug

2016-03-19 Thread Andy Davidson

I am using pyspark 1.6.0 and datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series data The data is originally captured by a spark streaming app and written to Cassandra. The value of the timestamp comes from Rdd.foreachRDD(new VoidFunction2, Time>() }); I am

Re: Spark 2.0 Shell -csv package weirdness

2016-03-19 Thread Mich Talebzadeh

Hi Vince, We had a similar case a while back. I tried two solutions in both Spark on Hive metastore and Hive on Spark engine. Hive version 2 Spark as Hive engine 1.3.1 Basically --1 Move .CSV data into HDFS: --2 Create an external table (all columns as string) --3 Create the ORC table (majority

Re: installing packages with pyspark

2016-03-19 Thread Felix Cheung

You are running pyspark in Spark client deploy mode. I have ran into the same error as well and I'm not sure if this is graphframes specific - the python process can't find the graphframes Python code when it is loaded as a Spark package. To workaround this, I extract the graphframes Python dire

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Running following: #fix schema for gaid which should not be Double > from pyspark.sql.types import * > customSchema = StructType() > for (col,typ) in tsp_orig.dtypes: > if col=='Agility_GAID': > typ='string' > customSchema.add(col,typ,True) Getting ValueError: Could not parse

Re: installing packages with pyspark

2016-03-19 Thread Franc Carter

Thanks - I'll give that a try cheers On 20 March 2016 at 09:42, Felix Cheung wrote: > You are running pyspark in Spark client deploy mode. I have ran into the > same error as well and I'm not sure if this is graphframes specific - the > python process can't find the graphframes Python code when

Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-19 Thread craigiggy

Slight update I suppose? For some reason, sometimes it will connect and continue and the job will be completed. But most of the time I still run into this error and the job is killed and the application doesn't finish. Still have no idea why this is happening. I could really use some help here.

Re: df.dtypes -> pyspark.sql.types

2016-03-19 Thread Ruslan Dautkhanov

Spark 1.5 is the latest that I have access to and where this problem happens. I don't see it's fixed in master but I might be wrong. diff atatched. https://raw.githubusercontent.com/apache/spark/branch-1.5/python/pyspark/sql/types.py https://raw.githubusercontent.com/apache/spark/d57daf1f7732a7ac

Re: [Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Ted Yu

Can you give a bit more detail ? Release of Spark symptom of renamed column being not recognized Please take a look at "withColumnRenamed" test in: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala On Thu, Mar 17, 2016 at 2:02 AM, Divya Gehlot wrote: > Hi, > I am adding a new

help coercing types

2016-03-19 Thread Bauer, Robert

I have data that I pull in using a sql context and then I convert to an rdd. The problem is that the type in the rdd is [Any, Iterable[Any]] And I need to have the type RDD[Array[String]] -- convert the Iterable to an Array. Here’s more detail: val zdata = sqlContext.read.parquet("s3://.. pa

Re: installing packages with pyspark

2016-03-19 Thread Felix Cheung

For some, like graphframes that are Spark packages, you could also use --packages in the command line of spark-submit or pyspark. Seehttp://spark.apache.org/docs/latest/submitting-applications.html _ From: Jakob Odersky Sent: Thursday, March 17, 2016 6:40 PM Subj

Re: Extra libs in executor classpath

2016-03-19 Thread Ted Yu

For your last point, spark-submit has: if [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" fi Meaning the script would determine the proper SPARK_HOME variable. FYI On Wed, Mar 16, 2016 at 4:22 AM, Леонид Поляков wrote: > Hello, guys! > > > > I’ve been develop

Re: The error to read HDFS custom file in spark.

2016-03-19 Thread Benyi Wang

I would say change class RawDataInputFormat[LW <: LongWritable, RD <: RDRawDataRecord] extends FileInputFormat to class RawDataInputFormat[LongWritable, RDRawDataRecord] extends FileInputFormat On Thu, Mar 17, 2016 at 9:48 AM, Mich Talebzadeh wrote: > Hi Tony, > > Is > > com.kiisoo.aegis.b

Re: The build-in indexes in ORC file does not work.

2016-03-19 Thread Mich Talebzadeh

I did some tests on Hive running on MR to get rid of Spark effects. In an ORC table that has been partitioned, partition elimination with predicate push down works and the query is narrowed to the partition itself. I can see that from the number of rows within that partition. For example below sa

Fwd: Extra libs in executor classpath

2016-03-19 Thread Леонид Поляков

It makes no sense for worker, the issue is with executor classpath, not the driver classpath. Please, answer actual question that is not in "P.S." - that one it's just a note about driver Thanks, Leonid On Wed, Mar 16, 2016 at 6:21 PM, Ted Yu wrote: > For your last point, spark-submit has: > >

1 2 >

1 - 100 of 108 matches

Mail list logo