Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I was running the spark shell and sql with --jars option containing the paths when I got my error. What is the correct way to add jars I am not sure. I tried placing the jar inside the directory you said but still get the error. I will give the code you posted a try. Thanks. -- View this message

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I am running the queries from spark-sql. I don't think it can communicate with thrift server. Can you tell how I should run the quries to make it work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22516.ht

Spark SQL query key/value in Map

2015-04-16 Thread jc.francisco
Hi, I'm new with both Cassandra and Spark and am experimenting with what Spark SQL can do as it will affect my Cassandra data model. What I need is a model that can accept arbitrary fields, similar to Postgres's Hstore. Right now, I'm trying out the map type in Cassandra but I'm getting the excep

Re: Passing Elastic Search Mappings in Spark Conf

2015-04-16 Thread Deepak Subhramanian
Thanks Nick. I understand that we can configure the index by creating the index with the mapping first. I thought it will be a good feature to be added in the es-hadoop /es-spark as we can have the full mapping and code in a single space especially for simple mappings on a particular field. It mak

Streaming problems running 24x7

2015-04-16 Thread Miquel
Hello, I'm finding problems to run a spark streaming job for more than a few hours (3 or 4). It begins working OK, but it degrades until failure. Some of the symptoms: - Consumed memory and CPU keeps getting higher ang higher, and finally some error is being thrown (java.lang.Exception: Could not

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Also you can have each message type in a different topic (needs to be arranged upstream from your Spark Streaming app ie in the publishing systems and the messaging brokers) and then for each topic you can have a dedicated instance of InputReceiverDStream which will be the start of a dedicated D

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
And yet another way is to demultiplex at one point which will yield separate DStreams for each message type which you can then process in independent DAG pipelines in the following way: MessageType1DStream = MainDStream.filter(message type1) MessageType2DStream = MainDStream.filter(message t

Re: executor failed, cannot find compute-classpath.sh

2015-04-16 Thread TimMalt
Hi, has this issue been resolved? I am currently running into similar problems. I am using spark-1.3.0-bin-hadoop2.4 on Windows and Ubuntu. I have setup all path on my Windows machine in an identical manner as on my Ubuntu server (using cygwin, so everything is somewhere under /usr/local/spark...)

Re: How to do dispatching in Streaming?

2015-04-16 Thread Gerard Maas
>From experience, I'd recommend using the dstream.foreachRDD method and doing the filtering within that context. Extending the example of TD, something like this: dstream.foreachRDD { rdd => rdd.cache() messageType.foreach (msgTyp => val selection = rdd.filter(msgTyp.match(_))

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Ooops – what does “more work” mean in a Parallel Programming paradigm and does it always translate in “inefficiency” Here are a few laws of physics in this space: 1. More Work if done AT THE SAME time AND fully utilizes the cluster resources is a GOOD thing 2. More Work whi

Re: Save org.apache.spark.mllib.linalg.Matri to a file

2015-04-16 Thread Spico Florin
Thank you very much for your suggestions, Ignacio! I have posted my solution here: http://stackoverflow.com/questions/29649904/save-spark-org-apache-spark-mllib-linalg-matrix-to-a-file/29671193#29671193 Best regards, Florin On Wed, Apr 15, 2015 at 5:28 PM, Ignacio Blasco wrote: > You can tu

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
Looks a good option. BTW v3.0 is round the corner. http://slick.typesafe.com/news/2015/04/02/slick-3.0.0-RC3-released.html Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22521.html Sent from the Apach

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-16 Thread Nathan McCarthy
Its JTDS 1.3.1; http://sourceforge.net/projects/jtds/files/jtds/1.3.1/ I put that jar in /tmp on the driver/machine I’m running spark shell from. Then I ran with ./bin/spark-shell --jars /tmp/jtds-1.3.1.jar --master yarn-client So I’m guessing that --jars doesn’t set the class path for the prim

Re: Streaming problems running 24x7

2015-04-16 Thread Akhil Das
I used to hit this issue when my processing time exceeds the batch duration. Here's a few workarounds: - Use storage level MEMORY_AND_DISK - Enable WAL and check pointing Above two will slow down things a little bit. If you want low latency, what you can try is: - Use storage level as MEMORY_ON

custom input format in spark

2015-04-16 Thread Shushant Arora
Hi How to specify custom input format in spark and control isSplitable in between file. Need to read a file from HDFS , file format is custom and requirement is file should not be split inbetween when a executor node gets that partition of input dir. Can anyone share a sample in java. Thanks Sh

ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread rkrist
Hello guys, after upgrading spark to 1.3.0 (and performing necessary code changes) an issue appeared making me unable to handle Date fields (java.sql.Date) with Spark SQL module. An exception appears in the console when I try to execute and SQL query on a DataFrame (see below). When I tried to e

Re: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread rkrist
...one additional note: implementation of org.apache.spark.sql.columnar.IntColumnStats is IMHO wrong. Small hint - what will be the resulting upper and lower values for column containing no data (empty RDD or null values in Int column across the whole RDD)? Shouldn't they be null? -- View this

MLLib SVMWithSGD : java.lang.OutOfMemoryError: Java heap space

2015-04-16 Thread sarath
Hi, I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But I'm getting "java.lang.OutOfMemoryError: Java heap space" error. The dataset is really sparse and have around 8 million data points and 20 million features. I'm using a cluster of 8 nodes (each with 8 cores and 64G RAM)

[SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table -- [image: Sigmoid Analytics] *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.c

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread ARose
I take it back. My solution only works when you set the master to "local". I get the same error when I try to run it on the cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22525.html Sent from the Ap

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee
Bummer - out of curiosity, if you were to use the classpath.first or perhaps copy the jar to the slaves could that actually do the trick? The latter isn't really all that efficient but just curious if that could do the trick. On Thu, Apr 16, 2015 at 7:14 AM ARose wrote: > I take it back. My so

[ThriftServer] Urgent -- very slow Metastore query from Spark

2015-04-16 Thread Yana Kadiyska
Hi Sparkers, hoping for insight here: running a simple describe mytable here where mytable is a partitioned Hive table. Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 ​ Whereas Hive over the same

Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type in a parquet file. In my code I just create a DataFrame with my custom data type and write in into a parquet file. When I run my code directly inside idea every thing works like a charm

Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread mas
I have a big data file, i aim to create index on the data. I want to partition the data based on user defined function in Spark-GraphX (Scala). Further i want to keep track the node on which a particular data partition is send and being processed so i could fetch the required data by accessing the

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
How do you intend to "fetch the required data" - from within Spark or using an app / code / module outside Spark -Original Message- From: mas [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:08 PM To: user@spark.apache.org Subject: Data partitioning and node tracking in Spa

Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
I want to use Spark functions/APIs to do this task. My basic purpose is to index the data and divide and send it to multiple nodes. Then at the time of accessing i want to reach the right node and data partition. I don't have any clue how to do this. Thanks, On Thu, Apr 16, 2015 at 5:13 PM, Evo Ef

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
Well you can use a [Key, Value] RDD and partition it based on hash function on the Key and even a specific number of partitions (and hence cluster nodes). This will a) index the data, b) divide it and send it to multiple nodes. Re your last requirement - in a cluster programming environment/fram

Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
Thanks a lot for the reply. Indeed it is useful but to be more precise i have 3D data and want to index it using octree. Thus i aim to build a two level indexing mechanism i.e. First at global level i want to partition and send the data to the nodes then at node level i again want to use octree to

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
Well you can have a two level index structure, still without any need for physical cluster node awareness Level 1 Index is the previously described partitioned [K,V] RDD – this gets you to the value (RDD element) you need on the respective cluster node Level 2 Index – it will be built and

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Does anybody have a solution for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I n

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Sean Owen
This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) wrote: > Does anybody have a solution for this? > > > > > > From: Wang

dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
I have a data frame in which I load data from a hive table. And my issue is that the data frame is missing the columns that I need to query. For example: val newdataset = dataset.where(dataset("label") === 1) gives me an error like the following: ERROR yarn.ApplicationMaster: User class threw e

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Ningjun, to speed up your current design you can do the following: 1.partition the large doc RDD based on the hash function on the key ie the docid 2. persist the large dataset in memory to be available for subsequent queries without reloading and repartitioning for every search query 3. parti

Spark on Windows

2015-04-16 Thread Arun Lists
We run Spark on Mac and Linux but also need to run it on Windows 8.1 and Windows Server. We ran into problems with the Scala 2.10 binary bundle for Spark 1.3.0 but managed to get it working. However, on Mac/Linux, we are on Scala 2.11.6 (we built Spark from the sources). On Windows, however despit

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Akhil Das
You could try repartitioning your RDD using a custom partitioner (HashPartitioner etc) and caching the dataset into memory to speedup the joins. Thanks Best Regards On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I have an RDD that contains milli

Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Hi All I have below code whether distinct is running for more time. blockingRdd is the combination of and it will have 400K records JavaPairRDD completeDataToprocess=blockingRdd.flatMapValues( new Function>(){ @Override public Iterable call(String v1) throws Exception { return ckdao.getSingelkey

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc. Thanks Best Regards On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele wrote: > Hi All I have below code whether distinct is running for more time. > > blockingRdd is the combination of and

Re: MLLib SVMWithSGD : java.lang.OutOfMemoryError: Java heap space

2015-04-16 Thread Akhil Das
Try increasing your driver memory. Thanks Best Regards On Thu, Apr 16, 2015 at 6:09 PM, sarath wrote: > Hi, > > I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But > I'm getting "java.lang.OutOfMemoryError: Java heap space" error. The > dataset > is really sparse and have

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
I already checked and G is taking 1 secs for each task. is this too much? if yes how to avoid this? On 16 April 2015 at 21:58, Akhil Das wrote: > Open the driver ui and see which stage is taking time, you can look > whether its adding any GC time etc. > > Thanks > Best Regards > > On Thu, Apr 16

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can simply override the isSplitable method in your custom inputformat class and make it return false. Here's a sample code snippet: http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop#answers-header Thanks Best Regards On Thu, Apr 16, 2015 at 4:18 PM, Shush

Re: custom input format in spark

2015-04-16 Thread Shushant Arora
Is it for spark? On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das wrote: > You can simply override the isSplitable method in your custom inputformat > class and make it return false. > > Here's a sample code snippet: > > > http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-h

Custom partioner

2015-04-16 Thread Jeetendra Gangele
Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD. and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I redu

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Can you paste your complete code? Did you try repartioning/increasing level of parallelism to speed up the processing. Since you have 16 cores, and I'm assuming your 400k records isn't bigger than a 10G dataset. Thanks Best Regards On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele wrote: > I

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can plug in the native hadoop input formats with Spark's sc.newApiHadoopFile etc which takes in the inputformat. Thanks Best Regards On Thu, Apr 16, 2015 at 10:15 PM, Shushant Arora wrote: > Is it for spark? > > On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das > wrote: > >> You can simply overr

General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Hi, Is there a document/link that describes the general configuration settings to achieve maximum Spark Performance while running on CDH5? In our environment, we did lot of changes (and still doing it) to get decent performance otherwise our 6 node dev cluster with default configurations, lags

Re: Super slow caching in 1.3?

2015-04-16 Thread Christian Perez
Hi Michael, Good question! We checked 1.2 and found that it is also slow cacheing the same flat parquet file. Caching other file formats of the same data were faster by up to a factor of ~2. Note that the parquet file was created in Impala but the other formats were written by Spark SQL. Cheers,

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Well there are a number of performance tuning guidelines in dedicated sections of the spark documentation - have you read and applied them Secondly any performance problem within a distributed cluster environment has two aspects: 1. Infrastructure 2. App Algorithms You s

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Michael what exactly do you mean by "flattened" version/structure here e.g.: 1. An Object with only primitive data types as attributes 2. An Object with no more than one level of other Objects as attributes 3. An Array/List of primitive types 4. An Array/List of Objects This question is in ge

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda wrote: > Hi > > As per JIRA this issue is resolved, but i am still facing this issue. > > SPARK-2734 - DROP TABLE should also uncache table > > > -- > > [image: Sigmoid Analytics]

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Thanks Evo. Yes, my concern is only regarding the infrastructure configurations. Basically, configuring Yarn (Node manager) + Spark is must and default setting never works. And what really happens, is we make changes as and when an issue is faced because of one of the numerous default configurat

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Essentially to change the performance yield of software cluster infrastructure platform like spark you play with different permutations of: - Number of CPU cores used by Spark Executors on every cluster node - Amount of RAM allocated for each executor How disks and net

Re: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Sean Owen
I don't think there's anything specific to CDH that you need to know, other than it ought to set things up sanely for you. Sandy did a couple posts about tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-

saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
I am using Spark Streaming where during each micro-batch I output data to S3 using saveAsTextFile. Right now each batch of data is put into its own directory containing 2 objects, "_SUCCESS" and "part-0." How do I output each batch into a common directory? Thanks, Vadim ᐧ

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
You can't, since that's how it's designed to work. Batches are saved in different "files", which are really directories containing partitions, as is common in Hadoop. You can move them later, or just read them where they are. On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy wrote: > I am using S

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
No I did not tried the partitioning below is the full code public static void matchAndMerge(JavaRDD matchRdd,JavaSparkContext jsc) throws IOException{ long start = System.currentTimeMillis(); JavaPairRDD RddForMarch =matchRdd.zipWithIndex().mapToPair(new PairFunction, Long, MatcherReleventDat

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Sean. I want to load each batch into Redshift. What's the best/most efficient way to do that? Vadim > On Apr 16, 2015, at 1:35 PM, Sean Owen wrote: > > You can't, since that's how it's designed to work. Batches are saved > in different "files", which are really directories containing >

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
The reason for this is as follows: 1. You are saving data on HDFS 2. HDFS as a cluster/server side Service has a Single Writer / Multiple Reader multithreading model 3. Hence each thread of execution in Spark has to write to a separate file in HDFS 4. Moreover the

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, April 16, 2015 6:35 PM To: Vadim Bichutskiy Cc: user@spark.apache.org Subject: Re: saveAsTextFile You can't, since that's how it's designed to work. Batches ar

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them wherever you want - Use foreacPartition and then foreach -Original Message- From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:39 PM To: Sean Owen Cc: user@spark.apache.or

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to Windows. AFAIK it should build on Windows too, the only problem is that Maven might take a long time to download dependencies. What errors are you seeing? Matei > On Apr 16, 2015, at 9:23 AM, Arun Lists wrote: > > We

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Evo for your detailed explanation. > On Apr 16, 2015, at 1:38 PM, Evo Eftimov wrote: > > The reason for this is as follows: > > 1. You are saving data on HDFS > 2. HDFS as a cluster/server side Service has a Single Writer / Multiple > Reader multithreading model > 3.

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
Just copy the files? it shouldn't matter that much where they are as you can find them easily. Or consider somehow sending the batches of data straight into Redshift? no idea how that is done but I imagine it's doable. On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy wrote: > Thanks Sean. I want

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Also to juggle even further the multithreading model of both spark and HDFS you can even publish the data from spark first to a message broker e.g. kafka from where a predetermined number (from 1 to infinity) of parallel consumers will retrieve and store in HDFS in one or more finely controlled

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Copy should be doable but I'm not sure how to specify a prefix for the directory while keeping the filename (ie part-0) fixed in copy command. > On Apr 16, 2015, at 1:51 PM, Sean Owen wrote: > > Just copy the files? it shouldn't matter that much where they are as > you can find them easil

Random pairs / RDD order

2015-04-16 Thread abellet
Hi everyone, I have a large RDD and I am trying to create a RDD of a random sample of pairs of elements from this RDD. The elements composing a pair should come from the same partition for efficiency. The idea I've come up with is to take two random samples and then use zipPartitions to pair each

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele
Hi Ted. This works for me. But since Long takes here 8 bytes. Can I reduce it to 4 bytes. its just a index and I feel 4 bytes was more than enough.is there any method which takes Integer or similar for Index? On 13 April 2015 at 01:59, Ted Yu wrote: > bq. will return something like JavaPairRDD

Re: regarding ZipWithIndex

2015-04-16 Thread Ted Yu
The Long in RDD[(T, Long)] is type parameter. You can create RDD with Integer as the first type parameter. Cheers On Thu, Apr 16, 2015 at 11:07 AM, Jeetendra Gangele wrote: > Hi Ted. > This works for me. But since Long takes here 8 bytes. Can I reduce it to 4 > bytes. its just a index and I fee

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele
I type T i already have Object ... I have RDD and then I am calling ZipWithIndex on this RDD and getting RDD on this I am running MapToPair and converting into RDD so that i can use it later for other operation like lookup and join. On 16 April 2015 at 23:42, Ted Yu wrote: > The Long in RDD[(T,

Re: Super slow caching in 1.3?

2015-04-16 Thread Michael Armbrust
Here are the types that we specialize, other types will be much slower. This is only for Spark SQL, normal RDDs do not serialize data that is cached. I'll also not that until yesterday we were missing FloatType https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sq

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Well normal RDDs can also be serialized if you select that type of Memory Persistence …. Ok thanks, so just to confirm: IF a “normal” RDD is not going to be persisted in-memory as Serialized objects (which would mean it has to be persisted as “actual/hydrated” objects) THEN there are no

StackOverflowError from KafkaReceiver when rate limiting used

2015-04-16 Thread Jeff Nadler
I've got a Kafka topic on which lots of data has built up, and a streaming app with a rate limit. During maintenance for example records will build up on Kafka and we'll burn them off on restart. The rate limit keeps the job stable while burning off the backlog. Sometimes on the first or second i

Re: StackOverflowError from KafkaReceiver when rate limiting used

2015-04-16 Thread Sean Owen
Yeah, this really shouldn't be recursive. It can't be optimized since it's not a final/private method. I think you're welcome to try a PR to un-recursivize it. On Thu, Apr 16, 2015 at 7:31 PM, Jeff Nadler wrote: > > I've got a Kafka topic on which lots of data has built up, and a streaming > app

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
at distinct level I will have 7000 times more elements in my RDD.So should I re partition? because its parent will definitely have less partition how to see through java code number of partition? On 16 April 2015 at 23:07, Jeetendra Gangele wrote: > No I did not tried the partitioning below is

spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
The default for spark.dynamicAllocation.minExecutors is 0, but that value causes a runtime error and a message that the minimum is 1. Perhaps the default should be changed to 1? Mike Stone - To unsubscribe, e-mail: user-unsubs

Re: Random pairs / RDD order

2015-04-16 Thread Sean Owen
Use mapPartitions, and then take two random samples of the elements in the partition, and return an iterator over all pairs of them? Should be pretty simple assuming your sample size n is smallish since you're returning ~n^2 pairs. On Thu, Apr 16, 2015 at 7:00 PM, abellet wrote: > Hi everyone, >

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
IIRC that was fixed already in 1.3 https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b On Thu, Apr 16, 2015 at 7:41 PM, Michael Stone wrote: > The default for spark.dynamicAllocation.minExecutors is 0, but that value > causes a runtime error and a message that the min

Re: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Michael Armbrust
Filed: https://issues.apache.org/jira/browse/SPARK-6967 Shouldn't they be null? > Statistics are only used to eliminate partitions that can't possibly hold matching values. So while you are right this might result in a false positive, that will not result in a wrong answer.

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Yes, i am able to reproduce the problem. Do you need the scripts to create the tables? On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai wrote: > Can your code that can reproduce the problem? > > On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda < > ar...@sigmoidanalytics.com> wrote: > >> Hi >> >> As pe

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: IIRC that was fixed already in 1.3 https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b From that commit: + private val minNumExecutors = conf.getInt("spark.dynamicAllocation.minExecutors", 0) ... + if (max

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. On Thu, Apr 16, 2015 at 8:09 PM, Michael Stone wrote: > On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: >> >> IIRC that was fixed already in 1.3 >> >> >> https://gith

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 08:10:54PM +0100, Sean Owen wrote: Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. How can 0 be a fine minimum if it's rejected? Changing the value is easy enough, but in general it's nice for defau

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Marcelo Vanzin
I think Michael is referring to this: """ Exception in thread "main" java.lang.IllegalArgumentException: You must specify at least 1 executor! Usage: org.apache.spark.deploy.yarn.Client [options] """ spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecut

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 12:16:13PM -0700, Marcelo Vanzin wrote: I think Michael is referring to this: """ Exception in thread "main" java.lang.IllegalArgumentException: You must specify at least 1 executor! Usage: org.apache.spark.deploy.yarn.Client [options] """ Yes, sorry, there were too man

Re: Spark on Windows

2015-04-16 Thread Arun Lists
Thanks, Matei! We'll try that and let you know if it works. You are correct in inferring that some of the problems we had were with dependencies. We also had problems with the spark-submit scripts. I will get the details from the engineer who worked on the Windows builds and provide them to you.

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Looks like that message would be triggered if spark.dynamicAllocation.initialExecutors was not set, or 0, if I read this right. Yeah, that might have to be positive. This requires you set initial executors to 1 if you want 0 min executors. Hm, maybe that shouldn't be an error condition in the args

MLlib - Naive Bayes Problem

2015-04-16 Thread riginos
I have a big dataset of categories of cars and descriptions of cars. So i want to give a description of a car and the program to classify the category of that car. So i decided to use multinomial naive Bayes. I created a unique id for each word and replaced my whole category,description data. //My

Re: Re: spark streaming printing no output

2015-04-16 Thread jay vyas
empty folders generally means that you need to just increase the window intervals; i.e. spark streaming saveAsTxtFiles will save folders for each interval regardless On Wed, Apr 15, 2015 at 5:03 AM, Shushant Arora wrote: > Its printing on console but on HDFS all folders are still empty . > >

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Oh, just noticed that I missed "attach"... Yeah, your scripts will be helpful. Thanks! On Thu, Apr 16, 2015 at 12:03 PM, Arush Kharbanda < ar...@sigmoidanalytics.com> wrote: > Yes, i am able to reproduce the problem. Do you need the scripts to create > the tables? > > On Thu, Apr 16, 2015 at 10:5

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Akhil, any thought on this? On 16 April 2015 at 23:07, Jeetendra Gangele wrote: > No I did not tried the partitioning below is the full code > > public static void matchAndMerge(JavaRDD > matchRdd,JavaSparkContext jsc) throws IOException{ > long start = System.currentTimeMillis(); > JavaPai

Re: dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On

Re: Spark SQL query key/value in Map

2015-04-16 Thread Yin Huai
For Map type column, fields['driver'] is the syntax to retrieve the map value (in the schema, you can see "fields: map"). The syntax of fields.driver is used for struct type. On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco wrote: > Hi, > > I'm new with both Cassandra and Spark and am experimentin

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Evo > partition the large doc RDD based on the hash function on the key ie the docid What API to use to do this? By the way, loading the entire dataset to memory cause OutOfMemory problem because it is too large (I only have one machine with 16GB and 4 cores). I found something called

When querying ElasticSearch, score is 0

2015-04-16 Thread Andrejs Abele
Hi, I have data in my ElasticSearch server, when I query it using rest interface, I get results and score for each result, but when I run the same query in spark using ElasticSearch API, I get results and meta data, but the score is shown 0 for each record. My configuration is ... val conf = new

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa wrote: > Dear all, > > Here is an issue that gets me mad. I wrote a UserDefineType in order to be > able to store a custom type in a parquet file. In my code I just create a > DataFrame with my custom data type and write in into a pa

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
You can use def partitionBy(partitioner: Partitioner): RDD[(K, V)] Return a copy of the RDD partitioned using the specified partitioner The https://github.com/amplab/spark-indexedrdd stuff looks pretty cool and is something which adds valuable functionality to spark e.g. the point lookups PRO

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-16 Thread N B
Hi Guillaume, Interesting that you brought up Shuffle. In fact we are experiencing this issue of shuffle files being left behind and not being cleaned up. Since this is a Spark streaming application, it is expected to stay up indefinitely, so shuffle files being left is a big problem right now. Si

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Jeetendra Gangele
Does this same functionality exist with Java? On 17 April 2015 at 02:23, Evo Eftimov wrote: > You can use > > def partitionBy(partitioner: Partitioner): RDD[(K, V)] > Return a copy of the RDD partitioned using the specified partitioner > > The https://github.com/amplab/spark-indexedrdd stuff lo

AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work fr

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele
Can you please guide me how can I extend RDD and convert into this way you are suggesting. On 16 April 2015 at 23:46, Jeetendra Gangele wrote: > I type T i already have Object ... I have RDD and then I am > calling ZipWithIndex on this RDD and getting RDD on this I am > running MapToPair and con

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Thursday, April 16, 2015 9:57 PM To: Evo Eftimov Cc: Wang, Ningjun (LNG-NPV); user Subject: Re: How to join RDD keyValuePairs efficiently Does this same functiona

  1   2   >