Re: Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Thank you all for these links. I'll check them. On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack wrote: > +1 to all of the above esp. Dimensionality reduction and locality > sensitive hashing / min hashing. > > There's also an algorithm implemented in MLlib called DIMSUM which was > developed at T

Re: spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
Can I change this param fetch.message.max.bytes or spark.streaming.kafka.maxRatePerPartition at run time across batches. Say I detected some fail condition in my system and I decided to sonsume i next batch interval only 10 messages per partition and if that succeed I reset the max limit to unlimi

spark streaming 1.3 kafka topic error

2015-08-26 Thread Shushant Arora
Hi My streaming application gets killed with below error 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream: ArrayBuffer(kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionException, kafka.common.NotLeaderForPartitionExcep

Re: Differing performance in self joins

2015-08-26 Thread Michael Armbrust
-dev +user I'd suggest running .explain() on both dataframes to understand the performance better. The problem is likely that we have a pattern that looks for cases where you have an equality predicate where either side can be evaluated using one side of the join. We turn this into a hash join.

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Michael Armbrust
I'd suggest setting sbt to fork when running tests. On Wed, Aug 26, 2015 at 10:51 AM, Mike Trienis wrote: > Thanks for your response Yana, > > I can increase the MaxPermSize parameter and it will allow me to run the > unit test a few more times before I run out of memory. > > However, the primar

Re: Spark cluster multi tenancy

2015-08-26 Thread Jerrick Hoang
Would be interested to know the answer too. On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood wrote: > Interestingly, if there is nothing running on dev spark-shell, it recovers > successfully and regains the lost executors. Attaching the log for that. > Notice, the "Registering block manager .." st

Re: query avro hive table in spark sql

2015-08-26 Thread Michael Armbrust
I'd suggest looking at http://spark-packages.org/package/databricks/spark-avro On Wed, Aug 26, 2015 at 11:32 AM, gpatcham wrote: > Hi, > > I'm trying to query hive table which is based on avro in spark SQL and > seeing below errors. > > 15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered Avr

error accessing vertexRDD

2015-08-26 Thread dizzy5112
Hi all, question on an issue im having with a vertexRDD. If i kick of my spark shell with something like this: then run: it will finish and give me the count but is see a few errors (see below). This is okay for this small dataset but when trying with a large data set it doesnt finish because

Re: Join with multiple conditions (In reference to SPARK-7197)

2015-08-26 Thread Michal Monselise
Davies, I created an issue - SPARK-10246 On Tue, Aug 25, 2015 at 12:53 PM, Davies Liu wrote: > It's good to support this, could you create a JIRA for it and target for > 1.6? > > On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise > wrote: > >

Re: Help! Stuck using withColumn

2015-08-26 Thread Silvio Fiorito
Hi Saif, In both cases you’re referencing columns that don’t exist in the current DataFrame. The first email you did a select and then a withColumn for ‘month_date_cur' on the resulting DF, but that column does not exist, because you did a select for only ‘month_balance’. In the second email

Spark.ml vs Spark.mllib

2015-08-26 Thread njoshi
Hi, We are in the process of developing a new product/Spark application. While the official Spark 1.4.1 page invites users and developers to use *Spark.mllib* and optionally contribute to *Spark.ml*, this

RE: Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
I can reproduce this even simpler with the following: val gf = sc.parallelize(Array(3,6,4,7,3,4,5,5,31,4,5,2)).toDF("ASD") val ff = sc.parallelize(Array(4,6,2,3,5,1,4,6,23,6,4,7)).toDF("GFD") gf.withColumn("DSA", ff.col("GFD")) org.apache.spark.sql.AnalysisException: resolved attribute(s) GFD#42

Help! Stuck using withColumn

2015-08-26 Thread Saif.A.Ellafi
This simple comand call: val final_df = data.select("month_balance").withColumn("month_date", data.col("month_date_curr")) Is throwing: org.apache.spark.sql.AnalysisException: resolved attribute(s) month_date_curr#324 missing from month_balance#234 in operator !Project [month_balance#234, mon

Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
Thanks! On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin wrote: > On Wed, Aug 26, 2015 at 2:03 PM, Jerry wrote: > > Assuming your submitting the job from terminal; when main() is called, > if I > > try to open a file locally, can I assume the machine is always the one I > > submitted the job fro

Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Marcelo Vanzin
On Wed, Aug 26, 2015 at 2:03 PM, Jerry wrote: > Assuming your submitting the job from terminal; when main() is called, if I > try to open a file locally, can I assume the machine is always the one I > submitted the job from? See the "--deploy-mode" option. "client" works as you describe; "cluster

Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
Assuming your submitting the job from terminal; when main() is called, if I try to open a file locally, can I assume the machine is always the one I submitted the job from? Currently I'm working off of a single machine, but I'm wondering if I'll run into issues when I move over to a cluster. The fi

suggest configuration for debugging spark streaming, kafka

2015-08-26 Thread Joanne Contact
Hi I have a Ubuntu box with 4GB memory and duo cores. Do you think it won't be enough to run spark streaming and kafka? I try to install standalone mode spark kafka so I can debug them in IDE. Do I need to install hadoop? Thanks! J

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Susan Zhang
Ah, I was using the UI coupled with the job logs indicating that offsets were being "processed" even though it corresponded to 0 events. Looks like I wasn't matching up timestamps correctly: the 0 event batches were queued/processed when offsets were getting skipped: 15/08/26 11:26:05 INFO storage

Dataframe collect() work but count() fails

2015-08-26 Thread Srikanth
Hello, I'm seeing a strange behavior where count() on a DataFrame errors as shown below but collect() works fine. This is what I tried from spark-shell. solrRDD.queryShards() return a javaRDD. val rdd = solrRDD.queryShards(sc, query, "_version_", 2).rdd > rdd: org.apache.spark.rdd.RDD[org.apache.

Re: Can RDD be shared accross the cluster by other drivers?

2015-08-26 Thread Ted Yu
Can you outline your use case ? This seems to be related: SPARK-2389 globally shared SparkContext / shared Spark "application" FYI On Wed, Aug 26, 2015 at 10:47 AM, Tao Lu wrote: > Hi, Guys, > > Is it possible that RDD created by driver A be used driver B? > > Thanks! >

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Cody Koeninger
I'd be less concerned about what the streaming ui shows than what's actually going on with the job. When you say you were losing messages, how were you observing that? The UI, or actual job output? The log lines you posted indicate that the checkpoint was restored and those offsets were processe

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Susan Zhang
Compared offsets, and it continues from checkpoint loading: 15/08/26 11:24:54 INFO DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring KafkaRDD for time 1440612035000 ms [(install-json,5,826112083,826112446), (install-json,4,825772921,825773536), (install-json,1,831654775,8316

Can RDD be shared accross the cluster by other drivers?

2015-08-26 Thread Tao Lu
Hi, Guys, Is it possible that RDD created by driver A be used driver B? Thanks!

Re: Question on take function - Spark Java API

2015-08-26 Thread Pankaj Wahane
Thanks Sonal.. I shall try doing that.. > On 26-Aug-2015, at 1:05 pm, Sonal Goyal wrote: > > You can try using wholeTextFile which will give you a pair rdd of fileName, > content. flatMap through this and manipulate the content. > > Best Regards, > Sonal > Founder, Nube Technologies

Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Interestingly, if there is nothing running on dev spark-shell, it recovers successfully and regains the lost executors. Attaching the log for that. Notice, the "Registering block manager .." statements in the very end after all executors were lost. On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood wr

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Cody Koeninger
When the kafka rdd is actually being iterated on the worker, there should be an info line of the form log.info(s"Computing topic ${part.topic}, partition ${part.partition} " + s"offsets ${part.fromOffset} -> ${part.untilOffset}") You should be able to compare that to log of offsets du

Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Attaching log for when the dev job gets stuck (once all its executors are lost due to preemption). This is a spark-shell job running in yarn-client mode. On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood wrote: > Hi All, > > We've set up our spark cluster on aws running on yarn (running on hadoop >

query avro hive table in spark sql

2015-08-26 Thread gpatcham
Hi, I'm trying to query hive table which is based on avro in spark SQL and seeing below errors. 15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither

Feedback: Feature request

2015-08-26 Thread Murphy, James
Hey all, In working with the DecisionTree classifier, I found it difficult to extract rules that could easily facilitate visualization with libraries like D3. So for example, using : print(model.toDebugString()), I get the following result = If (feature 0 <= -35.0) If (feature 24 <= 176.0

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Susan Zhang
Thanks for the suggestions! I tried the following: I removed createOnError = true And reran the same process to reproduce. Double checked that checkpoint is loading: 15/08/26 10:10:40 INFO DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring KafkaRDD for time 1440608825000 m

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Using TABLESAMPLE(0.1) is actually way worse. Spark first spends 12 minutes to discover all split files on all hosts (for some reason) before it even starts the job, and then it creates 3.5 million tasks (the partition has ~32k split files). On Wed, Aug 26, 2015 at 9:36 AM, Jörn Franke wrote: >

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-26 Thread Mike Trienis
Thanks for your response Yana, I can increase the MaxPermSize parameter and it will allow me to run the unit test a few more times before I run out of memory. However, the primary issue is that running the same unit test in the same JVM (multiple times) results in increased memory (each run of th

Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Hi All, We've set up our spark cluster on aws running on yarn (running on hadoop 2.3) with fair scheduling and preemption turned on. The cluster is shared for prod and dev work where prod runs with a higher fair share and can preempt dev jobs if there are not enough resources available for it. It

Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread cingram
spark-shell-hang-on-exit.tdump -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-saveAsParquetFile-hangs-on-app-exit-tp24460p24461.html Sent from

Persisting sorted parquet tables for future sort merge joins

2015-08-26 Thread Jason
I want to persist a large _sorted_ table to Parquet on S3 and then read this in and join it using the Sorted Merge Join strategy against another large sorted table. The problem is: even though I sort these tables on the join key beforehand, once I persist them to Parquet, they lose the information

Re: JDBC Streams

2015-08-26 Thread Jörn Franke
I would use Sqoop. It has been designed exactly for these types of scenarios. Spark streaming does not make sense here Le dim. 5 juil. 2015 à 1:59, ayan guha a écrit : > Hi All > > I have a requireent to connect to a DB every few minutes and bring data to > HBase. Can anyone suggest if spark str

Re: Efficient sampling from a Hive table

2015-08-26 Thread Jörn Franke
Have you tried tablesample? You find the exact syntax in the documentation, but it exlxactly does what you want Le mer. 26 août 2015 à 18:12, Thomas Dudziak a écrit : > Sorry, I meant without reading from all splits. This is a single partition > in the table. > > On Wed, Aug 26, 2015 at 8:53 AM,

Just Released V1.0.4 Low Level Receiver Based Kafka-Spark-Consumer in Spark Packages having built-in Back Pressure Controller

2015-08-26 Thread Dibyendu Bhattacharya
Dear All, Just now released the 1.0.4 version of Low Level Receiver based Kafka-Spark-Consumer in spark-packages.org . You can find the latest release here : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Here is github location : https://github.com/dibbhatt/kafka-spark-consumer

Re: Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
Sorry, I meant without reading from all splits. This is a single partition in the table. On Wed, Aug 26, 2015 at 8:53 AM, Thomas Dudziak wrote: > I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from > and I don't particularly care which rows. Doing a LIMIT unfortunately > r

Re: Building spark-examples takes too much time using Maven

2015-08-26 Thread Ted Yu
Can you provide a bit more information ? Are Spark artifacts packaged by you have the same names / paths (in maven repo) as the ones published by Apache Spark ? Is Zinc running on the machine where you performed the build ? Cheers On Wed, Aug 26, 2015 at 7:56 AM, Muhammad Haseeb Javed < 11besem

Re: spark streaming 1.3 kafka buffer size

2015-08-26 Thread Cody Koeninger
see http://kafka.apache.org/documentation.html#consumerconfigs fetch.message.max.bytes in the kafka params passed to the constructor On Wed, Aug 26, 2015 at 10:39 AM, Shushant Arora wrote: > whats the default buffer in spark streaming 1.3 for kafka messages. > > Say In this run it has to fet

Efficient sampling from a Hive table

2015-08-26 Thread Thomas Dudziak
I have a sizeable table (2.5T, 1b rows) that I want to get ~100m rows from and I don't particularly care which rows. Doing a LIMIT unfortunately results in two stages where the first stage reads the whole table, and the second then performs the limit with a single worker, which is not very efficien

Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-26 Thread Dhaval Patel
Thanks Davies. HiveContext seems neat to use :) On Thu, Aug 20, 2015 at 3:02 PM, Davies Liu wrote: > As Aram said, there two options in Spark 1.4, > > 1) Use the HiveContext, then you got datediff from Hive, > df.selectExpr("datediff(d2, d1)") > 2) Use Python UDF: > ``` > >>> from datetime impor

spark streaming 1.3 kafka buffer size

2015-08-26 Thread Shushant Arora
whats the default buffer in spark streaming 1.3 for kafka messages. Say In this run it has to fetch messages from offset 1 to 1. will it fetch all in one go or internally it fetches messages in few messages batch. Is there any setting to configure this no of offsets fetched in one batch?

Re: JDBC Streams

2015-08-26 Thread Cody Koeninger
Yes On Wed, Aug 26, 2015 at 10:23 AM, Chen Song wrote: > Thanks Cody. > > Are you suggesting to put the cache in global context in each executor > JVM, in a Scala object for example. Then have a scheduled task to refresh > the cache (or triggered by the expiry if Guava)? > > Chen > > On Wed, Aug

Re: JDBC Streams

2015-08-26 Thread Chen Song
Thanks Cody. Are you suggesting to put the cache in global context in each executor JVM, in a Scala object for example. Then have a scheduled task to refresh the cache (or triggered by the expiry if Guava)? Chen On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger wrote: > If your data only change

Re: DataFrame/JDBC very slow performance

2015-08-26 Thread Dhaval Patel
Thanks Michael, much appreciated! Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere. For your above comments - I don't get any error but just get the NULL as return val

Re: Build k-NN graph for large dataset

2015-08-26 Thread Charlie Hack
+1 to all of the above esp.  Dimensionality reduction and locality sensitive hashing / min hashing.  There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter for this purpose. I've been meaning to try it and would be interested to hear about results you get. 

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
Yes. And a paper that describes using grids (actually varying grids) is  http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf  In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter

Building spark-examples takes too much time using Maven

2015-08-26 Thread Muhammad Haseeb Javed
I checked out the master branch and started playing around with the examples. I want to build a jar of the examples as I wish run them using the modified spark jar that I have. However, packaging spark-examples takes too much time as maven tries to download the jar dependencies rather than use the

Re: JDBC Streams

2015-08-26 Thread Cody Koeninger
If your data only changes every few days, why not restart the job every few days, and just broadcast the data? Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to avoid many mysql reads On Wed, Aug 26, 2015 at 9:46 AM, Chen Song wrote: > Piggyback on this question. > > I

Re: Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread Cheng Lian
Could you please show jstack result of the hanged process? Thanks! Cheng On 8/26/15 10:46 PM, cingram wrote: I have a simple test that is hanging when using s3a with spark 1.3.1. Is there something I need to do to cleanup the S3A file system? The write to S3 appears to have worked but this job

Re: JDBC Streams

2015-08-26 Thread Chen Song
Piggyback on this question. I have a similar use case but a bit different. My job is consuming a stream from Kafka and I need to join the Kafka stream with some reference table from MySQL (kind of data validation and enrichment). I need to process this stream every 1 min. The data in MySQL is not

Spark 1.3.1 saveAsParquetFile hangs on app exit

2015-08-26 Thread cingram
I have a simple test that is hanging when using s3a with spark 1.3.1. Is there something I need to do to cleanup the S3A file system? The write to S3 appears to have worked but this job hangs in the spark-shell and using spark-submit. Any help would be greatly appreciated. TIA. import sqlContext.i

application logs for long lived job on YARN

2015-08-26 Thread Chen Song
When running long-lived job on YARN like Spark Streaming, I found that container logs gone after days on executor nodes, although the job itself is still running. I am using cdh5.4.0 and have aggregated logs enabled. Because the local logs are gone on executor nodes, I don't see any aggregated lo

Re: Finding the number of executors.

2015-08-26 Thread Virgil Palanciuc
As I was writing a long-ish message to explain how it doesn't work, it dawned on me that maybe driver connects to executors only after there's some work to do (while I was trying to find the number of executors BEFORE starting the actual work). So the solution was to simply execute a dummy task (

Re: Issue with building Spark v1.4.1-rc4 with Scala 2.11

2015-08-26 Thread Ted Yu
Have you run dev/change-version-to-2.11.sh ? Cheers On Wed, Aug 26, 2015 at 7:07 AM, Felix Neutatz wrote: > Hi everybody, > > I tried to build Spark v1.4.1-rc4 with Scala 2.11: > ../apache-maven-3.3.3/bin/mvn -Dscala-2.11 -DskipTests clean install > > Before running this, I deleted: > ../.m2/re

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-26 Thread Cody Koeninger
The first thing that stands out to me is createOnError = true Are you sure the checkpoint is actually loading, as opposed to failing and starting the job anyway? There should be info lines that look like INFO DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData: Restoring KafkaRDD for t

Fwd: Issue with building Spark v1.4.1-rc4 with Scala 2.11

2015-08-26 Thread Felix Neutatz
Hi everybody, I tried to build Spark v1.4.1-rc4 with Scala 2.11: ../apache-maven-3.3.3/bin/mvn -Dscala-2.11 -DskipTests clean install Before running this, I deleted: ../.m2/repository/org/apache/spark ../.m2/repository/org/spark-project My changes to the code: I just changed line 174 of org.apac

Setting number of CORES from inside the Topology (JAVA code )

2015-08-26 Thread anshu shukla
Hey , I need to set the number of cores from inside the topology . Its working fine by setting in spark-env.sh but unable to do via setting key/value for conf . SparkConf sparkConf = new SparkConf().setAppName("JavaCustomReceiver").setMaster("local[4]"); if(toponame.equals("IdentityTopolog

Spark-on-YARN LOCAL_DIRS location

2015-08-26 Thread michael.england
Hi, I am having issues with /tmp space filling up during Spark jobs because Spark-on-YARN uses the yarn.nodemanager.local-dirs for shuffle space. I noticed this message appears when submitting Spark-on-YARN jobs: WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the

Re: Build k-NN graph for large dataset

2015-08-26 Thread Kristina Rogale Plazonic
If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation: http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashi

Re: Custom Offset Management

2015-08-26 Thread Cody Koeninger
That argument takes a function from MessageAndMetadata to whatever you want your stream to contain. See https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala#L57 On Wed, Aug 26, 2015 at 7:55 AM, Deepesh Maheshwari < deepesh.maheshwar...@gm

Custom Offset Management

2015-08-26 Thread Deepesh Maheshwari
Hi Folks, My Spark application interacts with kafka for getting data through Java Api. I am using Direct Approach (No Receivers) - which use Kafka’s simple consumer API to Read data. So, kafka offsets need to be handles explicitly. In case of Spark failure i need to save the offset state of kafka

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
Sam, This may be of interest, as far as i can see it suggests that a spark 'task' is always executed as a single thread in the JVM. http://0x0fff.com/spark-architecture/ Thanks, Jem On Wed, Aug 26, 2015 at 10:06 AM Samya MAITI wrote: > Thanks Jem, I do understand your suggestion. Actually

Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East
You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you could successfully compute similarities in the high-dimensional space you would probably run into the curse of dimensionality. > On 26 Aug 2015, at 12:35, Jaonary Rabarisoa wrote: > > Dear all, > > I'm

Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d >>> 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor base

JobScheduler: Error generating jobs for time for custom InputDStream

2015-08-26 Thread Juan Rodríguez Hortalá
Hi, I've developed a ScalaCheck property for testing Spark Streaming transformations. To do that I had to develop a custom InputDStream, which is very similar to QueueInputDStream but has a method for adding new test cases for dstreams, which are objects of type Seq[Seq[A]], to the DStream. You ca

Re: Performance issue with Spark join

2015-08-26 Thread Hemant Bhanawat
Spark joins are different than traditional database joins because of the lack of support of indexes. Spark has to shuffle data between various nodes to perform joins. Hence joins are bound to be much slower than count which is just a parallel scan of the data. Still, to ensure that nothing is wro

Performance issue with Spark join

2015-08-26 Thread lucap
Hi, I'm trying to perform an ETL using Spark, but as soon as I start performing joins performance degrades a lot. Let me explain what I'm doing and what I found out until now. First of all, I'm reading avro files that are on a Cloudera cluster, using commands like this: /val tab1 = sc.hadoopFile(

RE: Relation between threads and executor core

2015-08-26 Thread Samya MAITI
Thanks Jem, I do understand your suggestion. Actually --executor-cores alone doesn’t control the number of tasks, but is also governed by spark.task.cpus (amount of cores dedicated for each task’s execution). Reframing my Question, How many threads can be spawned per executor core? Is it in use

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Todd
Increase the number of executors, :-) At 2015-08-26 16:57:48, "Ted Yu" wrote: Mind sharing how you fixed the issue ? Cheers On Aug 26, 2015, at 1:50 AM, Todd wrote: Sorry for the noise, It's my bad...I have worked it out now. At 2015-08-26 13:20:57, "Todd" wrote: I think the an

Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Ted Yu
Mind sharing how you fixed the issue ? Cheers > On Aug 26, 2015, at 1:50 AM, Todd wrote: > > > Sorry for the noise, It's my bad...I have worked it out now. > > At 2015-08-26 13:20:57, "Todd" wrote: > > I think the answer is No. I only see such message on the console..and #2 is > the th

Re: Relation between threads and executor core

2015-08-26 Thread Jem Tucker
Hi Samya, When submitting an application with spark-submit the cores per executor can be set with --executor-cores, meaning you can run that many tasks per executor concurrently. The page below has some more details on submitting applications: https://spark.apache.org/docs/latest/submitting-appli

Re:Re:Re: How to increase data scale in Spark SQL Perf

2015-08-26 Thread Todd
Sorry for the noise, It's my bad...I have worked it out now. At 2015-08-26 13:20:57, "Todd" wrote: I think the answer is No. I only see such message on the console..and #2 is the thread stack trace。 I am thinking is that in Spark SQL Perf forks many dsdgen process to generate data when the

Re: DataFrame Parquet Writer doesn't keep schema

2015-08-26 Thread Petr Novak
The same as https://mail.google.com/mail/#label/Spark%2Fuser/14f64c75c15f5ccd Please follow the discussion there. On Tue, Aug 25, 2015 at 5:02 PM, Petr Novak wrote: > Hi all, > when I read parquet files with "required" fields aka nullable=false they > are read correctly. Then I save them (df.wr

Relation between threads and executor core

2015-08-26 Thread Samya
Hi All, Few basic queries :- 1. Is there a way we can control the number of threads per executor core? 2. Does this parameter “executor-cores” also has say in deciding how many threads to be run? Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: reduceByKey not working on JavaPairDStream

2015-08-26 Thread Sean Owen
I don't see that you invoke any action in this code. It won't do anything unless you tell it to perform an action that requires the transformations. On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari wrote: > Hi, > I have applied mapToPair and then a reduceByKey on a DStream to obtain a > JavaPa

SPARK_DIST_CLASSPATH, primordial class loader & app ClassNotFound

2015-08-26 Thread Night Wolf
Hey all, I'm trying to do some stuff with a YAML file in the Spark driver using SnakeYAML library in scala. When I put the snakeyaml v1.14 jar on the SPARK_DIST_CLASSPATH and try to de-serialize some objects from YAML into classes in my app JAR on the driver (only the driver). I get the exception

Re: Question on take function - Spark Java API

2015-08-26 Thread Sonal Goyal
You can try using wholeTextFile which will give you a pair rdd of fileName, content. flatMap through this and manipulate the content. Best Regards, Sonal Founder, Nube Technologies Check out Reifier at Spark Summit 2015

Re: MLlib Prefixspan implementation

2015-08-26 Thread Feynman Liang
ReversedPrefix is used because scala's List uses a linked list, which has constant time append to head but linear time append to tail. I'm aware that there are use cases for the gap constraints. My question was more about whether any users of Spark/MLlib have an immediate application for these fea

Re: MLlib Prefixspan implementation

2015-08-26 Thread alexis GILLAIN
A first use case of gap constraint is included in the article. Another application would be customer-shopping sequence analysis where you want to put a constraint on the duration between two purchases for them to be considered as a pertinent sequence. Additional question regarding the code : what'

RE: SparkR: exported functions

2015-08-26 Thread Felix Cheung
I believe that is done explicitly while the final API is being figured out. For the moment you could use DataFrame read.df() > From: csgilles...@gmail.com > Date: Tue, 25 Aug 2015 18:26:50 +0100 > Subject: SparkR: exported functions > To: user@spark.apache.org > > Hi, > > I've just started play

Re: BlockNotFoundException when running spark word count on Tachyon

2015-08-26 Thread Dibyendu Bhattacharya
The URL seems to have changed .. here is the one .. http://tachyon-project.org/documentation/Tiered-Storage-on-Tachyon.html On Wed, Aug 26, 2015 at 12:32 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Sometime back I was playing with Spark and Tachyon and I also found this

Re: BlockNotFoundException when running spark word count on Tachyon

2015-08-26 Thread Dibyendu Bhattacharya
Sometime back I was playing with Spark and Tachyon and I also found this issue . The issue here is TachyonBlockManager put the blocks in WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted from Tachyon Cache when Memory is full and when Spark try to find the block it throws