Re: multi-threaded Spark jobs

2016-01-25 Thread Igor Berman
IMHO, you are making mistake. spark manages tasks and cores internally. when you open new threads inside executor - meaning you "over-provisioning" executor(e.g. tasks on other cores will be preempted) On 26 January 2016 at 07:59, Elango Cheran wrote: > Hi everyone, > I've gone through the eff

multi-threaded Spark jobs

2016-01-25 Thread Elango Cheran
Hi everyone, I've gone through the effort of figuring out how to modify a Spark job to have an operation become multi-threaded inside an executor. I've written up an explanation of what worked, what didn't work, and why: http://www.elangocheran.com/blog/2016/01/using-clojure-to-create-multi-threa

Fwd: Spark partition size tuning

2016-01-25 Thread Jia Zou
Dear all, First to update that the local file system data partition size can be tuned by: sc.hadoopConfiguration().setLong("fs.local.block.size", blocksize) However, I also need to tune Spark data partition size for input data that is stored in Tachyon (default is 512MB), but above method can't w

spark-sql[1.4.0] not compatible hive sql when using in with date_sub or regexp_replace

2016-01-25 Thread our...@cnsuning.com
hi , all when migrating hive sql to spark sql encountor a incompatibility problem . Please give me some suggestions. hive table description and data format as following : 1 use spark; drop table spark.test_or1; CREATE TABLE `spark.test_or1`( `statis_date` string, `lbl_nm` string) r

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-25 Thread awzurn
Sorry for the bump, but wondering if anyone else has seen this before. We're hoping to either resolve this soon, or move on with further steps to move this into an issue. Thanks in advance, Andrew Zurn -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Datafr

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Please see also: http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html According to Chris Nauroth, an hdfs committer, it's extremely difficult to use the feature correctly. The feature also brings operational complexity. Since off-heap memory is used

Re: Spark Streaming Write Ahead Log (WAL) not replaying data after restart

2016-01-25 Thread Shixiong(Ryan) Zhu
You need to define a create function and use StreamingContext.getOrCreate. See the example here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#how-to-configure-checkpointing On Thu, Jan 21, 2016 at 3:32 AM, Patrick McGloin wrote: > Hi all, > > To have a simple way of testi

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-25 Thread Ram Sriharsha
Hi David What happens if you provide the class labels via metadata instead of letting OneVsRest determine the labels? Ram On Mon, Jan 25, 2016 at 3:06 PM, David Brooks wrote: > Hi, > > I've run into an exception using MLlib OneVsRest with logistic regression > (v1.6.0, but also in previous ver

Re: Spark Streaming - Custom ReceiverInputDStream ( Custom Source) In java

2016-01-25 Thread Tathagata Das
See how other Java wrapper classes use JavaSparkContext.fakeClassTag example; https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaMapWithStateDStream.scala On Fri, Jan 22, 2016 at 2:00 AM, Nagu Kothapalli wrote: > Hi > > Anyone have any i

MLlib OneVsRest causing intermittent exceptions

2016-01-25 Thread David Brooks
Hi, I've run into an exception using MLlib OneVsRest with logistic regression (v1.6.0, but also in previous versions). The issue is intermittent. When running multiclass classification with K-fold cross validation, there are scenarios where the split does not contain instances for every target l

RE: a question about web ui log

2016-01-25 Thread Mohammed Guller
I am not sure whether you can copy the log files from Spark workers to your local machine and view it from the Web UI. In fact, if you are able to copy the log files locally, you can just view them directly in any text editor. I suspect what you really want to see is the application history. Her

Re: cast column string -> timestamp in Parquet file

2016-01-25 Thread Cheng Lian
The following snippet may help: sqlContext.read.parquet(path).withColumn("col_ts", $"col".cast(TimestampType)).drop("col") Cheng On 1/21/16 6:58 AM, Muthu Jayakumar wrote: DataFrame and udf. This may be more performant than doing an RDD transformation as you'll only transform just the colu

Generic Dataset Aggregator

2016-01-25 Thread Deenar Toraskar
Hi All https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html I have been converting my UDAFs to Dataset (Dataset's are cool BTW) Aggregators. I have an ArraySum aggregator that does an element wise sum or arrays. I have got the simple version working, but

understanding iterative algorithms in Spark

2016-01-25 Thread Raghava
Hello All, I am new to Spark and I am trying to understand how iterative application of operations are handled in Spark. Consider the following program in Scala. var u = sc.textFile(args(0)+"s1.txt").map(line => { line.split("\\|") match { case Array(x,y) => (y.toInt,x.toInt)}})

Standalone scheduler issue - one job occupies the whole cluster somehow

2016-01-25 Thread Mikhail Strebkov
Hi all, Recently we started having issues with one of our background processing scripts which we run on Spark. The cluster runs only two jobs. One job runs for days, and another is usually like a couple of hours. Both jobs have a crob schedule. The cluster is small, just 2 slaves, 24 cores, 25.4 G

Re: Datasets and columns

2016-01-25 Thread Michael Armbrust
There is no public API for custom encoders yet, but since your class looks like a bean you should be able to use the `bean` method instead of `kryo`. This will expose the actual columns. On Mon, Jan 25, 2016 at 2:04 PM, Steve Lewis wrote: > Ok when I look at the schema it looks like KRYO makes o

Re: Datasets and columns

2016-01-25 Thread Steve Lewis
Ok when I look at the schema it looks like KRYO makes one column is there a way to do a custom encoder with my own columns On Jan 25, 2016 1:30 PM, "Michael Armbrust" wrote: > The encoder is responsible for mapping your class onto some set of > columns. Try running: datasetMyType.printSchema() >

Re: Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Supreeth
Hmm, thanks for the response. The current value I have for socket.timeout.ms is 12. I am not sure if this needs a higher value, not much from the logs. The retry aspect makes sense, I can work around the same. -S On Mon, Jan 25, 2016 at 11:51 AM, Cody Koeninger wrote: > Should be socket.t

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Have you read this thread ? http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache&subj=RE+hadoop+hdfs+cache+question+do+client+processes+share+cache+ Cheers On Mon, Jan 25, 2016 at 1:23 PM, Jia Zou wrote: > I configured HDFS to cache file in HDFS's cache, like following: > > hdfs cacheadmin

Re: mapWithState and context start when checkpoint exists

2016-01-25 Thread Andrey Yegorov
Thank you! what would be the best alternative to simulate a stream for testing purposes from e.g. sequence or a text file? In production I'll use kafka as a source but locally I wanted to mock it. Worst case scenario I'll have setup/tear down kafka cluster in tests but I think having a mock will b

Re: streaming textFileStream problem - got only ONE line

2016-01-25 Thread Shixiong(Ryan) Zhu
Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or write into it directly? `textFileStream` requires that files must be written to the monitored directory by "moving" them from another location within the same file system. On Mon, Jan 25, 2016 at 6:30 AM, patcharee wrote: >

Re: Datasets and columns

2016-01-25 Thread Michael Armbrust
The encoder is responsible for mapping your class onto some set of columns. Try running: datasetMyType.printSchema() On Mon, Jan 25, 2016 at 1:16 PM, Steve Lewis wrote: > assume I have the following code > > SparkConf sparkConf = new SparkConf(); > > JavaSparkContext sqlCtx= new JavaSparkContex

Re: mapWithState and context start when checkpoint exists

2016-01-25 Thread Shixiong(Ryan) Zhu
Hey Andrey, `ConstantInputDStream` doesn't support checkpoint as it contains an RDD field. It cannot resume from checkpoints. On Mon, Jan 25, 2016 at 1:12 PM, Andrey Yegorov wrote: > Hi, > > I am new to spark (and scala) and hope someone can help me with the issue > I got stuck on in my experim

Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Jia Zou
I configured HDFS to cache file in HDFS's cache, like following: hdfs cacheadmin -addPool hibench hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench But I didn't see much performance impacts, no matter how I configure dfs.datanode.max.locked.memory Is it possible that Spa

Re: Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-25 Thread Joshua TAYLOR
Thanks Michael, hopefully those will get some attention for a not too distant release. Do you think that this is related to, or separate from, a similar issue [1] that a filed a bit earlier, regarding the way that StringIndexer (and perhaps other ML components) handles some of these columns? (I'v

Datasets and columns

2016-01-25 Thread Steve Lewis
assume I have the following code SparkConf sparkConf = new SparkConf(); JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf); JavaRDD rddMyType= generateRDD(); // some code Encoder evidence = Encoders.kryo(MyType.class); Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(), evidence

Re: bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
Here is a pic of memory If I put --conf spark.driver.memory=3g, it increases the displaid memory, but the problem remains... for a file that is only 13M. Christopher Bourez 06 17 17 50 60 On Mon, Jan 25, 2016 at 10:06 PM, Christopher Bourez < christopher.bou...@gmail.com> wrote: > The same probl

mapWithState and context start when checkpoint exists

2016-01-25 Thread Andrey Yegorov
Hi, I am new to spark (and scala) and hope someone can help me with the issue I got stuck on in my experiments/learning. mapWithState from spark 1.6 seems to be a great way for the task I want to implement with spark but unfortunately I am getting error "RDD transformations and actions can only b

Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
On 25 January 2016 at 21:09, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > No I hadn't. This is useful, but in some cases we do want to share the > same temporary tables between jobs so really wanted a getOrCreate > equivalent on HIveContext. > > Deenar > > > > On 25 January 2016

Re: Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-25 Thread Michael Armbrust
Looks like you found a bug. I've filed them here: SPARK-12987 - Drop fails when columns contain dots SPARK-12988 - Can't drop columns that contain dots On Fri, Jan 22, 2016 at 3:18 PM, Joshua

Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-25 Thread Joshua TAYLOR
(Apologies if this has arrived more than once. I've subscribed to the list, and tried posting via email with no success. This an intentional repost to see if things are going through yet.) I've been having lots of trouble with DataFrames whose columns have dots in their names today. I know that

Re: bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
The same problem occurs on my desktop at work. What's great with AWS Workspace is that you can easily reproduce it. I created the test file with commands : for i in {0..30}; do VALUE="$RANDOM" for j in {0..6}; do VALUE="$VALUE;$RANDOM"; done echo $VALUE >> test.csv done Christoph

Re: bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
Josh, Thanks a lot ! You can download a video I created : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov I created a sample file of 13 MB as explained : https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv Here are the commands I did : I created an Aws Wo

Re: Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-25 Thread Tathagata Das
First of all, if you are running batches of 15 minutes, and you dont need second level latencies, it might be just easier to run batch jobs in a for loop - you will have greater control over what is going on. And if you are using reduceByKeyAndWindow without the inverseReduceFunction, then Spark ha

Re: NA value handling in sparkR

2016-01-25 Thread Deborah Siegel
Maybe not ideal, but since read.df is inferring all columns from the csv containing "NA" as type of strings, one could filter them rather than using dropna(). filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA") head(filtered_aq) Perhaps it would be better to have an option for read.d

Re: Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-25 Thread Cody Koeninger
Where are you calling checkpointing? Metadata checkpointing for a kafa direct stream should just be the offsets, not the data. TD can better speak to reduceByKeyAndWindow behavior when restoring from a checkpoint, but ultimately the only available choices would be replay the prior window data from

Re: Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Cody Koeninger
Should be socket.timeout.ms on the map of kafka config parameters. The lack of retry is probably due to the differences between running spark in local mode vs standalone / mesos / yarn. On Mon, Jan 25, 2016 at 1:19 PM, Supreeth wrote: > We are running a Kafka Consumer in local mode using Spar

Re: Spark integration with HCatalog (specifically regarding partitions)

2016-01-25 Thread Elliot West
Thanks for your response Jorge and apologies for my delay in replying. I took your advice with case 5 and declared the column names explicitly instead of the wildcard. This did the trick and I can now add partitions to an existing table. I also tried removing the 'partitionBy("id")' call as suggest

Re: how to build spark with out hive

2016-01-25 Thread Ted Yu
Spark 1.5.2. depends on slf4j 1.7.10 Looks like there was another version of slf4j on the classpath. FYI On Mon, Jan 25, 2016 at 12:19 AM, kevin wrote: > HI,all > I need to test hive on spark ,to use spark as the hive's execute > engine. > I download the spark source 1.5.2 from apache

Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Supreeth
We are running a Kafka Consumer in local mode using Spark Streaming, KafkaUtils.createDirectStream. The job runs as expected, however once in a very long time time, I see the following exception. Wanted to check if others have faced a similar issue, and what are the right timeout parameters to c

Re: How to setup a long running spark streaming job with continuous window refresh

2016-01-25 Thread Tathagata Das
You can use a 1 minute tumbling window dstream.window(Minutes(1), Minutes(1)).foreachRDD { rdd => // calculate stats per key } On Thu, Jan 21, 2016 at 4:59 AM, Santoshakhilesh < santosh.akhil...@huawei.com> wrote: > Hi, > > I have following scenario in my project; > > 1.I will continue to

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
I haven't gone through much details of spark catalyst optimizer and tungston project but we have been advised by databricks support to use DataFrame to resolve issues with OOM error that we are getting during Join and GroupBy operations. We use spark 1.3.1 and looks like it can not perform external

Re: Concurrent Spark jobs

2016-01-25 Thread emlyn
Jean wrote > Have you considered using pools? > http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools > > I haven't tried that by myself, but it looks like pool setting is applied > per thread so that means it's possible to configure fair scheduler, so > that more, than one

Re: bug for large textfiles on windows

2016-01-25 Thread Josh Rosen
Hi Christopher, What would be super helpful here is a standalone reproduction. Ideally this would be a single Scala file or set of commands that I can run in `spark-shell` in order to reproduce this. Ideally, this code would generate a giant file, then try to read it in a way that demonstrates the

Running kafka consumer in local mode - error - connection timed out

2016-01-25 Thread Supreeth
We are running a Kafka Consumer in local mode using Spark Streaming, KafkaUtils.createDirectStream. The job runs as expected, however once in a very long time time, I see the following exception. Wanted to check if others have faced a similar issue, and what are the right timeout parameters to ch

Re: Spark master takes more time with local[8] than local[1]

2016-01-25 Thread nsalian
Hi, Thanks for the question. Is it possible for you to elaborate on your application? The flow of the application will help to understand what could potentially cause things to slow down? Do logs give you any idea what goes on? Have you had a chance to look? Thank you. - Neelesh S. Salian

Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Ted Yu
Have you noticed the following method of HiveContext ? * Returns a new HiveContext as new session, which will have separated SQLConf, UDF/UDAF, * temporary tables and SessionState, but sharing the same CacheManager, IsolatedClientLoader * and Hive client (both of execution and metadata) w

Re: a question about web ui log

2016-01-25 Thread Philip Lee
As I mentioned before, I am tryint to see the spark log on a cluster via ssh-tunnel 1) The error on application details UI is probably from monitoring porting ​4044. Web UI port is 8088, right? so how could I see job web ui view and application details UI view in the web ui on my local machine? 2

Re: [Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread Kevin Mellott
I think that you may be looking at documentation pertaining to the more recent versions of Spark. Try looking at the examples linked below, which applies to the Spark 1.3 version. There aren't many Java examples, but the code should be very similar to the Scala ones (i.e. using "load" instead of "r

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Mark Hamstra
What do you think is preventing you from optimizing your own RDD-level transformations and actions? AFAIK, nothing that has been added in Catalyst precludes you from doing that. The fact of the matter is, though, that there is less type and semantic information available to Spark from the raw RDD

Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
Hi, Perhaps I should write a blog about this that why spark is focusing more on writing easier spark jobs and hiding underlaying performance optimization details from a seasoned spark users. It's one thing to provide such abstract framework that does optimization for you so you don't have to worry

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Gerard Maas
That's precisely what this constructor does: KafkaUtils.createDirectStream[...](ssc, kafkaConfig, topics) Is there a reason to do that yourself? In that case, look at how it's done in Spark Streaming for inspiration: https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/ap

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni
Correct what i am trying to achieve is that before the streaming job starts query the topic meta data from kafka , determine all the partition and provide those to direct API. So my question is should i consider passing all the partition from command line and query kafka and find and provide , wha

Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Gerard Maas
What are you trying to achieve? Looks like you want to provide offsets but you're not managing them and I'm assuming you're using the direct stream approach. In that case, use the simpler constructor that takes the kafka config and the topics. Let it figure it out the offsets (it will contact kaf

a question about web ui log

2016-01-25 Thread Philip Lee
​Hello, a questino about web UI log. ​I could see web interface log after forwarding the port on my cluster to my local and click completed application, but when I clicked "application detail UI" [image: Inline image 1] It happened to me. I do not know why. I also checked the specific log folder

Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni
Hi All , What is the best way to tell spark streaming job for the no of partition to to a given topic - Should that be provided as a parameter or command line argument or We should connect to kafka in the driver program and query it Map fromOffsets = new HashMap(); fromOffsets.put(new TopicAndPa

bug for large textfiles on windows

2016-01-25 Thread Christopher Bourez
Dears, I would like to re-open a case for a potential bug (current status is resolved but it sounds not) : *https://issues.apache.org/jira/browse/SPARK-12261 * I believe there is something wrong about the memory management under windows It has

Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
Hi I am using a shared sparkContext for all of my Spark jobs. Some of the jobs use HiveContext, but there isn't a getOrCreate method on HiveContext which will allow reuse of an existing HiveContext. Such a method exists on SQLContext only (def getOrCreate(sparkContext: SparkContext): SQLContext).

hivethriftserver2 problems on upgrade to 1.6.0

2016-01-25 Thread james.gre...@baesystems.com
On upgrade from 1.5.0 to 1.6.0 I have a problem with the hivethriftserver2, I have this code: val hiveContext = new HiveContext(SparkContext.getOrCreate(conf)); val thing = hiveContext.read.parquet("hdfs://dkclusterm1.imp.net:8020/user/jegreen1/ex208") thing.registerTempTable("thing") HiveTh

streaming textFileStream problem - got only ONE line

2016-01-25 Thread patcharee
Hi, My streaming application is receiving data from file system and just prints the input count every 1 sec interval, as the code below: val sparkConf = new SparkConf() val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms)) val lines = ssc.textFileStream(args(0)) lines.count().pr

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni
Yeah. I have screenshots and stack traces. I will post them to the ticket. Nothing informative. I should also mention I'm using pyspark but I think the deadlock is inside the Java scheduler code. Sent from my Verizon Wireless 4G LTE smartphone Original message From: "S

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Jeff Zhang
It's very straightforward, please refer the document here http://spark.apache.org/docs/latest/ml-features.html#bucketizer On Mon, Jan 25, 2016 at 10:09 PM, Eli Super wrote: > Thanks Joshua , > > I can't understand what algorithm behind Bucketizer , how discretization > done ? > > Best Regards

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Ted Yu
Yes, thread dump plus log would be helpful for debugging. Thanks > On Jan 25, 2016, at 5:59 AM, Sanders, Isaac B > wrote: > > Is the thread dump the stack trace you are talking about? If so, I will see > if I can capture the few different stages I have seen it in. > > Thanks for the help, I

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Eli Super
Thanks Joshua , I can't understand what algorithm behind Bucketizer , how discretization done ? Best Regards On Mon, Jan 25, 2016 at 3:40 PM, Joshua TAYLOR wrote: > It sounds like you may want the Bucketizer in SparkML. The overview docs > [1] include, "Bucketizer transforms a column of con

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Sanders, Isaac B
Is the thread dump the stack trace you are talking about? If so, I will see if I can capture the few different stages I have seen it in. Thanks for the help, I was able to do it for 0.1% of my data. I will create the JIRA. Thanks, Isaac On Jan 25, 2016, at 8:51 AM, Ted Yu mailto:yuzhih...@gma

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Ted Yu
Opening a JIRA is fine. See if you can capture stack trace during the hung stage and attach to JIRA so that we have more clue. Thanks > On Jan 25, 2016, at 4:25 AM, Darren Govoni wrote: > > Probably we should open a ticket for this. > There's definitely a deadlock situation occurring in spa

Re: How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Joshua TAYLOR
It sounds like you may want the Bucketizer in SparkML. The overview docs [1] include, "Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users." [1]: http://spark.apache.org/docs/latest/ml-features.html#bucketizer On Mon, Jan

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2016-01-25 Thread Darren Govoni
Why not deploy it. Then build a custom distribution with Scala 2.11 and just overlay it. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Nuno Santos Date: 01/25/2016 7:38 AM (GMT-05:00) To: user@spark.apache.org Subject: Re: Launching EC2 ins

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2016-01-25 Thread Nuno Santos
Hello, Any updates on this question? I'm also very interested in a solution, as I'm trying to use Spark on EC2 but need Scala 2.11 support. The scripts in the ec2 directory of the Spark distribution install use Scala 2.10 by default and I can't see any obvious option to change to Scala 2.11. Re

[Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread diplomatic Guru
Hello guys, I've been trying to read avro file using Spark's DataFrame but it's throwing this error: java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.read()Lorg/apache/spark/sql/DataFrameReader; This is what I've done so far: I've added the dependency to pom.xml: co

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni
Probably we should open a ticket for this.There's definitely a deadlock situation occurring in spark under certain conditions. The only clue I have is it always happens on the last stage. And it does seem sensitive to scale. If my job has 300mb of data I'll see the deadlock. But if I only

Getting top distinct strings from arraylist

2016-01-25 Thread Patrick Plaatje
Hi, I’m quite new to Spark and MR, but have a requirement to get all distinct values with their respective counts from a transactional file. Let’s assume the following file format: 0 1 2 3 4 5 6 7 1 3 4 5 8 9 9 10 11 12 13 14 15 16 17 18 1 4 7 11 12 13 19 20 3 4 7 11 15 20 21 22 23 1 2 5 9 11

Undefined job output-path error in Spark on hive

2016-01-25 Thread Akhilesh Pathodia
Hi, I am getting following exception in Spark while writing to hive partitioned table in parquet format: 16/01/25 03:56:40 ERROR executor.Executor: Exception in task 0.2 in stage 1.0 (TID 3) java.io.IOException: Undefined job output-path at org.apache.hadoop.mapred.FileOutputFormat.getTa

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-25 Thread Yash Sharma
For specific offsets you can directly pass the offset ranges and use the KafkaUtils. createRDD to get the events those were missed in the Dstream. - Thanks, via mobile, excuse brevity. On Jan 25, 2016 3:33 PM, "Raju Bairishetti" wrote: > Hi Yash, >Basically, my question is how to avoid stor

SparkSQL return all null fields when FIELDS TERMINATED BY '\t' and have a partition.

2016-01-25 Thread Liu Yiding
Hi, all I am using CDH 5.5(spark 1.5 and hive 1.1). I occurred a strange problem. In hive: hive> create table `tmp.test_d`(`id` int, `name` string) PARTITIONED BY (`dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; hive> load data local inpath '/var/lib/hive/dataimport/mendian/target/te

Re: NA value handling in sparkR

2016-01-25 Thread Devesh Raj Singh
Hi, Yes you are right. I think the problem is with reading of csv files. read.df is not considering NAs in the CSV file So what would be a workable solution in dealing with NAs in csv files? On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel wrote: > Hi Devesh, > > I'm not certain why that's h

Re: SparkSQL : "select non null values from column"

2016-01-25 Thread Deng Ching-Mallete
Hi, Have you tried using IS NOT NULL for the where condition? Thanks, Deng On Mon, Jan 25, 2016 at 7:00 PM, Eli Super wrote: > Hi > > I try to select all values but not NULL values from column contains NULL > values > > with > > sqlContext.sql("select my_column from my_table where my_column <>

How to discretize Continuous Variable with Spark DataFrames

2016-01-25 Thread Eli Super
Hi What is a best way to discretize Continuous Variable within Spark DataFrames ? I want to discretize some variable 1) by equal frequency 2) by k-means I usually use R for this porpoises _http://www.inside-r.org/packages/cran/arules/docs/discretize R code for example : ### equal frequency

SparkSQL : "select non null values from column"

2016-01-25 Thread Eli Super
Hi I try to select all values but not NULL values from column contains NULL values with sqlContext.sql("select my_column from my_table where my_column <> null ").show(15) or sqlContext.sql("select my_column from my_table where my_column != null ").show(15) I get empty result Thanks !

RangePartitioning skewed data

2016-01-25 Thread jluan
Lets say I have a dataset of (K,V) where the keys are really skewed: myDataRDD = [(8, 1), (8, 13), (1,1), (2,4)] [(8, 12), (8, 15), (8, 7), (8, 6), (8, 4), (8, 3), (8, 4), (10,2)] If I applied a RangePartitioner to this set of data, say val rangePart = new RangePartitioner(4, myDataRDD) and then

how to build spark with out hive

2016-01-25 Thread kevin
HI,all I need to test hive on spark ,to use spark as the hive's execute engine. I download the spark source 1.5.2 from apache web-site. I have installed maven3.3.9 and scala 2.10.6 ,so I change the /make-distribution.sh to point to my mvn location where I installed. then I run