Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
I don't see that sorting the data helps. The answer has to be all the associations. In this case the answer has to be: a , b --> it was a error in the question, sorry. b , d c , d x , y y , y I feel like all the data which is associate should be in the same executor. On this case if I order the in

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-25 Thread ankursaxena86
Following are the configuration changes I made to spark-env.sh: export SPARK_EXECUTOR_INSTANCES=1 export SPARK_EXECUTOR_CORES=1 export SPARK_WORKER_CORES=1 export SPARK_WORKER_INSTANCES=1 And yes, I mean to execute only one task per node at a time. -- View this message in context: http://apa

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-25 Thread ankursaxena86
Here's the log messages I'm getting upon calling of this command : /usr/local/spark/bin/spark-submit --verbose --driver-memory 14G --num-executors 1 --driver-cores 1 --executor-memory 10G --class org.apache.spark.mllib.feature.myclass --driver-java-options -Djava.library.path=/usr/local/lib --mas

Re: How to Exploding a Map[String,Int] column in a DataFrame (Scala)

2016-02-25 Thread Anthony Brew
Thanks guys both those answers are really helpful. Really appreciate this. On 25 Feb 2016 12:05 a.m., "Michał Zieliński" wrote: > Hi Anthony, > > Hopefully that's what you wanted :) > > case class Example(id:String, myMap: Map[String,Int]) >> val myDF = sqlContext.createDataFrame( >> Seq( >>

Re: What is the point of alpha value in Collaborative Filtering in MLlib ?

2016-02-25 Thread Sean Owen
This isn't specific to Spark; it's from the original paper. alpha doesn't do a whole lot, and it is a global hyperparam. It controls the relative weight of observed versus unobserved user-product interactions in the factorization. Higher alpha means it's much more important to faithfully reproduce

Re: What is the point of alpha value in Collaborative Filtering in MLlib ?

2016-02-25 Thread Hiroyuki Yamada
Hello Sean, Thank you very much for the quick response. That helps me a lot to understand it better ! Best regards, Hiro On Thu, Feb 25, 2016 at 6:59 PM, Sean Owen wrote: > This isn't specific to Spark; it's from the original paper. > > alpha doesn't do a whole lot, and it is a global hyperpar

Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
When you do a join in Spark, how many partitions are as result? is it a default number if you don't specify the number of partitions?

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
Oh, the letters were just an example, it could be: a , t b, o t, k k, c So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o I don't know if you were thinking about sortBy because the another example where letter were consecutive. 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz : > I don't

which is a more appropriate form of ratings ?

2016-02-25 Thread Hiroyuki Yamada
Hello. I just started working on CF in MLlib. I am using trainImplicit because I only have implicit ratings like page views. I am wondering which is a more appropriate form of ratings. Let's assume that view count is regarded as a rating and user 1 sees page 1 3 times and sees page 2 twice and so

select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Ashok Kumar
Hi, What is the equivalent of this in Spark please select * from mytable where column1 in (select max(column1) from mytable) Thanks

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Sabarish Sasidharan
I believe the ALS algo expects the ratings to be aggregated (A). I don't see why you have to use decimals for rating. Regards Sab On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada wrote: > Hello. > > I just started working on CF in MLlib. > I am using trainImplicit because I only have implicit r

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Robin East
The structures you are describing look like edges of a graph and you want to follow the graph to a terminal vertex and then propagate that value back up the path. On this assumption it would be simple to create the structures as graphs in GraphX and use Pregel for the algorithm implementation. -

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Nick Pentreath
Yes, ALS requires the aggregated version (A). You can use decimal or whole numbers for the rating, depending on your application, as for implicit data they are not "ratings" but rather "weights". A common approach is to apply different weightings to different user events (such as 1.0 for a page vi

Running executors missing in sparkUI

2016-02-25 Thread Jan Štěrba
Hello, I have quite a weird behaviour that I can't quite wrap my head around. I am running Spark on a Hadoop YARN cluster. I have Spark configured in such a way that it utilizes all free vcores in the cluster (setting max vcores per executor and number of executors to use all vcores in cluster).

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Hiroyuki Yamada
Thanks very much, Nick and Sabarish. That helps me a lot. Regards, *Hiro* On Thu, Feb 25, 2016 at 8:52 PM, Nick Pentreath wrote: > Yes, ALS requires the aggregated version (A). You can use decimal or whole > numbers for the rating, depending on your application, as for implicit data > they are

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
I'm taking a look to Pregel. It seems it's a good way to do it. The only negative thing that I see it's not a really complex graph with a lot of edges between the vertex .. They are more like a lot of isolated small graphs 2016-02-25 12:32 GMT+01:00 Robin East : > The structures you are describin

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Sabarish Sasidharan
Like Robin said, pls explore Pregel. You could do it without Pregel but it might be laborious. I have a simple outline below. You will need more iterations if the number of levels is higher. a-b b-c c-d b-e e-f f-c flatmaptopair a -> (a-b) b -> (a-b) b -> (b-c) c -> (b-c) c -> (c-d) d -> (c-d) b

Multiple user operations in spark.

2016-02-25 Thread Udbhav Agarwal
Hi, I am using graphx. I am adding a batch of vertices to a graph with around 100,000 vertices and few edges. Adding around 400 vertices is taking 7 seconds with one machine of 8 core and 8g ram. My trouble is when this process of addition is happening with the graph(name is inputGraph) am not a

Re: Number partitions after a join

2016-02-25 Thread Takeshi Yamamuro
Hi, The number depends on `spark.sql.shuffle.partitions`. See: http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options On Thu, Feb 25, 2016 at 7:42 PM, Guillermo Ortiz wrote: > When you do a join in Spark, how many partitions are as result? is it a > default n

Re: Reindexing in graphx

2016-02-25 Thread Robin East
So first up GraphX is not really designed for real-time graph mutation time situations. That’s not to say it can’t be done but you may be butting up against some of the design limitations in that area. As a first point of interrogation you should look at the WebUI to see what particular tasks/st

Re: Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
thank you, I didn't see that option. 2016-02-25 14:51 GMT+01:00 Takeshi Yamamuro : > Hi, > > The number depends on `spark.sql.shuffle.partitions`. > See: > http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options > > On Thu, Feb 25, 2016 at 7:42 PM, Guillermo Ort

Implementing random walk in spark

2016-02-25 Thread naveenkumarmarri
Hi, I'm new to spark, I'm trying to compute similarity between users/products. I've a huge table which I can't do a self join with the cluster I have. I'm trying to implement do self join using random walk methodology which will approximately give the results. The table is a bipartite graph with

RE: Number partitions after a join

2016-02-25 Thread JOAQUIN GUANTER GONZALBEZ
Actually that only applies to Spark SQL. I believe that in plain RDD, the resulting join will have as many partitions as the RDD with the most partition. Cheers, Ximo De: Guillermo Ortiz [mailto:konstt2...@gmail.com] Enviado el: jueves, 25 de febrero de 2016 15:19 Para: Takeshi Yamamuro CC: use

Re: Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
Good to know, thanks everybody! 2016-02-25 15:29 GMT+01:00 JOAQUIN GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.com>: > Actually that only applies to Spark SQL. I believe that in plain RDD, the > resulting join will have as many partitions as the RDD with the most > partition. > > > >

RE: Reindexing in graphx

2016-02-25 Thread Udbhav Agarwal
That’s a good thing you pointed out. Let me check that. Thanks. Another thing I was struggling with is while this process of addition of vertices is happening with the graph(name is inputGraph) am not able to access it or perform query over it. Currently when I am querying the graph during the

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
Thank you!, I'm trying to do it with Pregel,, it's being hard because I have never used GraphX and Pregel before. 2016-02-25 14:00 GMT+01:00 Sabarish Sasidharan : > Like Robin said, pls explore Pregel. You could do it without Pregel but it > might be laborious. I have a simple outline below. You

RE: How could I do this algorithm in Spark?

2016-02-25 Thread Darren Govoni
This might be hard to do. One generalization of this problem is  https://en.m.wikipedia.org/wiki/Longest_path_problem Given a node (e.g. A), find longest path. All interior relations are transitive and can be inferred. But finding a distributed spark way of doing it in P time would be intere

Re: Running executors missing in sparkUI

2016-02-25 Thread Yin Yang
Which Spark / hadoop release are you running ? Thanks On Thu, Feb 25, 2016 at 4:28 AM, Jan Štěrba wrote: > Hello, > > I have quite a weird behaviour that I can't quite wrap my head around. > I am running Spark on a Hadoop YARN cluster. I have Spark configured > in such a way that it utilizes al

Re: Reindexing in graphx

2016-02-25 Thread Karl Higley
For real time graph mutations and queries, you might consider a graph database like Neo4j or TitanDB. Titan can be backed by HBase, which you're already using, so that's probably worth a look. On Thu, Feb 25, 2016, 9:55 AM Udbhav Agarwal wrote: > That’s a good thing you pointed out. Let me check

Spark SQL partitioned tables - check for partition

2016-02-25 Thread Deenar Toraskar
Hi How does one check for the presence of a partition in a Spark SQL partitioned table (save using dataframe.write.partitionedBy("partCol") not hive compatible tables), other than physically checking the directory on HDFS or doing a count(*) with the partition cols in the where clause ? Regards

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
Once you have loaded information into a DataFrame, you can use the *mapPartitionsi or forEachPartition *operations to both identify the partitions and operate against them. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame On Thu, Feb 25, 2016 at 9:24 AM, De

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
I'm going to try to do it with Pregel.. it there are others ideas... great!. What do you call P time? I think that it's O(Number Vertex * N) 2016-02-25 16:17 GMT+01:00 Darren Govoni : > This might be hard to do. One generalization of this problem is > https://en.m.wikipedia.org/wiki/Longest_path

Re: Multiple user operations in spark.

2016-02-25 Thread Sabarish Sasidharan
I don't have a proper answer to this. But to circumvent if you have 2 independent Spark jobs, you could update one when the other is serving reads. But it's still not scalable for incessant updates. Regards Sab On 25-Feb-2016 7:19 pm, "Udbhav Agarwal" wrote: > Hi, > > I am using graphx. I am add

d.filter("id in max(id)")

2016-02-25 Thread Ashok Kumar
Hi, How can I make that work? val d = HiveContext.table("table") select * from table where ID = MAX(ID) from table Thanks

Spark Streaming - processing/transforming DStreams using a custom Receiver

2016-02-25 Thread Dominik Safaric
Recently, I've implemented the following Receiver and custom Spark Streaming InputDStream using Scala: /** * The GitHubUtils object declares an interface consisting of overloaded createStream * functions. The createStream function takes as arguments the ctx : StreamingContext * passed by the dr

Re: Spark Streaming - processing/transforming DStreams using a custom Receiver

2016-02-25 Thread Bryan Cutler
Using flatmap on a string will treat it as a sequence, which is why you are getting an RDD of char. I think you want to just do a map instead. Like this val timestamps = stream.map(event => event.getCreatedAt.toString) On Feb 25, 2016 8:27 AM, "Dominik Safaric" wrote: > Recently, I've implemen

Re: Running executors missing in sparkUI

2016-02-25 Thread Jan Štěrba
I am running spark1.3 on Cloudera hadoop 5.4 -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Thu, Feb 25, 2016 at 4:22 PM, Yin Yang wrote: > Which Spark / hadoop release are you running ? > > Thanks > > On Thu, Feb 25, 2016 at 4:28

Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread unk1102
Hi I have spark job which I run on yarn and sometimes it behaves in weird manner it shows negative no of tasks in few executors and I keep on loosing executors I also see no of executors are more than I requested. My job is highly tuned not getting OOM or any problem. It is just YARN behaves in a w

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Deenar Toraskar
Kevin I meant the partitions on disk/hdfs not the inmemory RDD/Dataframe partitions. If I am right mapPartitions or forEachPartitions would identify and operate on the in memory partitions. Deenar On 25 February 2016 at 15:28, Kevin Mellott wrote: > Once you have loaded information into a Data

Re: Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread Yin Yang
Which release of hadoop are you using ? Can you share a bit about the logic of your job ? Pastebinning portion of relevant logs would give us more clue. Thanks On Thu, Feb 25, 2016 at 8:54 AM, unk1102 wrote: > Hi I have spark job which I run on yarn and sometimes it behaves in weird > manner

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
If you want to see which partitions exist on disk (without manually checking), you could write code against the Hadoop FileSystem library to check. Is that what you are asking? https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/package-summary.html On Thu, Feb 25, 2016 at 10:54 AM, D

Re: Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread Umesh Kacha
Hi I am using Hadoop 2.4.0 it is not frequent sometimes it happens I dont think my spark logic has any problem if logic would have been wrong it would be failing everyday. I see mostly YARN killed executors so I see executor lost in my driver logs. On Thu, Feb 25, 2016 at 10:30 PM, Yin Yang wrote

Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Mohammad Tariq
Hi group, I have just started working with confluent platform and spark streaming, and was wondering if it is possible to access individual fields from an Avro object read from a kafka topic through spark streaming. As per its default behaviour *KafkaUtils.createDirectStream[Object, Object, KafkaA

Bug in DiskBlockManager subDirs logic?

2016-02-25 Thread Zee Chen
Hi, I am debugging a situation where SortShuffleWriter sometimes fail to create a file, with the following stack trace: 16/02/23 11:48:46 ERROR Executor: Exception in task 13.0 in stage 47827.0 (TID 1367089) java.io.FileNotFoundException: /tmp/spark-9dd8dca9-6803-4c6c-bb6a-0e9c0111837c/executor-1

PLease help: installation of spark 1.6.0 on ubuntu fails

2016-02-25 Thread Marco Mistroni
Hello all could anyone help? i have tried to install spark 1.6.0 on ubuntu, but the installation failed Here are my steps 1. download spark (successful) 31 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0.tgz 33 tar -zxf spark-1.6.0.tgz 2. cd spark-1.6.0 2.1 sbt assembly error] /h

Re: PLease help: installation of spark 1.6.0 on ubuntu fails

2016-02-25 Thread Shixiong(Ryan) Zhu
Please use Java 7 instead. On Thu, Feb 25, 2016 at 1:54 PM, Marco Mistroni wrote: > Hello all > could anyone help? > i have tried to install spark 1.6.0 on ubuntu, but the installation failed > Here are my steps > > 1. download spark (successful) > > 31 wget http://d3kbcqa49mib13.cloudfront.ne

Spark SQL support for sub-queries

2016-02-25 Thread Mich Talebzadeh
Hi, I guess the following confirms that Spark does bot support sub-queries val d = HiveContext.table("test.dummy") d.registerTempTable("tmp") HiveContext.sql("select * from tmp where id IN (select max(id) from tmp)") It crashes The SQL works OK in Hive itself on the underlying table!

Re: LDA topic Modeling spark + python

2016-02-25 Thread Bryan Cutler
I'm not exactly sure how you would like to setup your LDA model, but I noticed there was no Python example for LDA in Spark. I created this issue to add it https://issues.apache.org/jira/browse/SPARK-13500. Keep an eye on this if it could be of help. bryan On Wed, Feb 24, 2016 at 8:34 PM, Mishr

Re: Spark SQL support for sub-queries

2016-02-25 Thread Mohammad Tariq
AFAIK, this isn't supported yet. A ticket is in progress though. [image: http://] Tariq, Mohammad about.me/mti [image: http://] On Fri, Feb 26, 2016 at 4:16 AM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.c

Re: DirectFileOutputCommiter

2016-02-25 Thread Teng Qiu
interesting in this topic as well, why the DirectFileOutputCommitter not included? we added it in our fork, under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala moreover, this DirectFileOutputCommitter is not working for the insert operations in HiveContext, since the Com

Re: DirectFileOutputCommiter

2016-02-25 Thread Yin Yang
The header of DirectOutputCommitter.scala says Databricks. Did you get it from Databricks ? On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu wrote: > interesting in this topic as well, why the DirectFileOutputCommitter not > included? > > we added it in our fork, under > core/src/main/scala/org/apache

Re: DirectFileOutputCommiter

2016-02-25 Thread Teng Qiu
yes, should be this one https://gist.github.com/aarondav/c513916e72101bbe14ec then need to set it in spark-defaults.conf : https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13 Am Freitag, 26. Februar 2016 schrieb Yin Yang : > Th

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Shixiong(Ryan) Zhu
You can use `DStream.map` to transform objects to anything you want. On Thu, Feb 25, 2016 at 11:06 AM, Mohammad Tariq wrote: > Hi group, > > I have just started working with confluent platform and spark streaming, > and was wondering if it is possible to access individual fields from an > Avro o

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Mohammad Tariq
I got it working by using jsonRDD. This is what I had to do in order to make it work : val messages = KafkaUtils.createDirectStream[Object, Object, KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicsSet) val lines = messages.map(_._2.toString) lines.foreachRDD(jsonRDD =>

How to overwrite data dynamically to specific partitions in Spark SQL

2016-02-25 Thread SRK
Hi, I need to overwrite data dynamically to specific partitions depending on filters. How can that be done in sqlContext? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-overwrite-data-dynamically-to-specific-partitions-in-Spark-SQL-

ALS trainImplicit performance

2016-02-25 Thread Roberto Pagliari
Does anyone know about the maximum number of ratings ALS was tested successfully? For example, is 1 billion ratings (nonzero entries) too much for it to work properly? Thank you,

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread ayan guha
Why is this not working for you? Are you trying on dataframe? What error are you getting? On Thu, Feb 25, 2016 at 10:23 PM, Ashok Kumar wrote: > Hi, > > What is the equivalent of this in Spark please > > select * from mytable where column1 in (select max(column1) from mytable) > > Thanks > --

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Harsh J
You should be able to cast the object type to the real underlying type (GenericRecord (if generic, which is so by default), or the actual type class (if specific)). The underlying implementation of KafkaAvroDecoder seems to use either one of those depending on a config switch: https://github.com/co

Saving and Loading Dataframes

2016-02-25 Thread raj.kumar
Hi, I am using mllib. I use the ml vectorization tools to create the vectorized input dataframe for the ml/mllib machine-learning models with schema: root |-- label: double (nullable = true) |-- features: vector (nullable = true) To avoid repeated vectorization, I am trying to save and load t

[Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Divya Gehlot
Hi, I have dataset which looks like below name age alice 35 bob null peter 24 I need to replace null values of columns with 0 so I referred Spark API DataframeNAfunctions.scala I

Re: spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-02-25 Thread Hyukjin Kwon
Hi, it looks you forgot to specify the "rowTag" option, which is "book" for the case of the sample data. Thanks 2016-01-29 8:16 GMT+09:00 Andrés Ivaldi : > Hi, could you get it work, tomorrow I'll be using the xml parser also, On > windows 7, I'll let you know the results. > > Regards, > > > >

Re: ALS trainImplicit performance

2016-02-25 Thread Sabarish Sasidharan
I have tested upto 3 billion. ALS scales, you just need to scale your cluster accordingly. More than building the model, it's getting the final recommendations that won't scale as nicely, especially when number of products is huge. This is the case when you are generating recommendations in a batch

merge join already sorted data?

2016-02-25 Thread Ken Geis
I am loading data from two different databases and joining it in Spark. The data is indexed in the database, so it is efficient to retrieve the data ordered by a key. Can I tell Spark that my data is coming in ordered on that key so that when I join the data sets, they will be joined with little sh

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Mohammad Tariq
Spark doesn't support subqueries in WHERE clause, IIRC. It supports subqueries only in the FROM clause as of now. See this ticket for more on this. [image: http://] Tariq, Mohammad about.me/mti [image: http://] On Fri, F

Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj, Could you share your code which can help others to diagnose this issue? Which version did you use? I can not reproduce this problem in my environment. Thanks Yanbo 2016-02-26 10:49 GMT+08:00 raj.kumar : > Hi, > > I am using mllib. I use the ml vectorization tools to create the vectorize

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin. You can also try with DataFrameStatFunctions.freqItems() to get the frequent items for columns. Thanks Yanbo 2016-02-24 1:21 GMT+08:00 Burak Yavuz : > You could use the Bucketizer transformer in Spark ML. > > Best, > Burak

Re: DirectFileOutputCommiter

2016-02-25 Thread Takeshi Yamamuro
Hi, Great work! What is the concrete performance gain of the committer on s3? I'd like to know. I think there is no direct committer for files because these kinds of committer has risks to loss data (See: SPARK-10063). Until this resolved, ISTM files cannot support direct commits. thanks, On

Survival Curves using AFT implementation in Spark

2016-02-25 Thread Stuti Awasthi
Hi All, I wanted to apply Survival Analysis using Spark AFT algorithm implementation. Now I perform the same in R using coxph model and passing the model in Survfit() function to generate survival curves Then I can visualize the survival curve on validation data to understand how good my model f

Re: Bug in DiskBlockManager subDirs logic?

2016-02-25 Thread Takeshi Yamamuro
Hi, Could you make simple codes to reproduce the issue? I'm not exactly sure why shuffle data on temp dir. are wrongly deleted. thanks, On Fri, Feb 26, 2016 at 6:00 AM, Zee Chen wrote: > Hi, > > I am debugging a situation where SortShuffleWriter sometimes fail to > create a file, with the fo

Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-25 Thread Zhun Shen
Hi, thanks for you advice. I tried your method, I use Gradle to manage my scala code. 'com.snowplowanalytics:scala-maxmind-iplookups:0.2.0’ was imported in Gradle. spark version: 1.6.0 scala: 2.10.4 scala-maxmind-iplookups: 0.2.0 I run my test, got the error as below: java.lang.NoClassDefFound

Re: merge join already sorted data?

2016-02-25 Thread Takeshi Yamamuro
Hi, SparkSQL inside can put order assumptions on columns (OrderedDistribution) though, JDBC datasources does not support this; spark is not sure how columns loaded from databases are ordered. Also, there is no way to let spark know this order. thanks, On Fri, Feb 26, 2016 at 2:22 PM, Ken Geis

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Jan Štěrba
just use coalesce function df.selectExpr("name", "coalesce(age, 0) as age") -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Fri, Feb 26, 2016 at 5:27 AM, Divya Gehlot wrote: > Hi, > I have dataset which looks like below > name age

When I merge some datas,can't go on...

2016-02-25 Thread Bonsen
I have a file,like 1.txt: 1 2 1 3 1 4 1 5 1 6 1 7 2 4 2 5 2 7 2 9 I want to merge them,results like this map(1->List(2,3,4,5,6,7),2->List(4,5,7,9)) what should I do?。。 val file1=sc.textFile("1.txt") val q1=file1.flatMap(_.split(' '))???,maybe I should change RDD[int] to RDD[int,int]? -- Vie

RE: Spark Streaming - graceful shutdown when stream has no more data

2016-02-25 Thread Mao, Wei
I would argue against making it configurable unless there is real production use case. If it’s just for test, there are bunch of ways to achieve it. For example, you can mark if test streaming is finished globally, and stop ssc on another thread when status of that mark changed. Back to origin

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Divya Gehlot
Hi Jan , Thanks for help. Alas.. you suggestion also didnt work scala> import org.apache.spark.sql.types.{StructType, StructField, > StringType,IntegerType,LongType,DoubleType, FloatType}; > import org.apache.spark.sql.types.{StructType, StructField, StringType, > IntegerType, LongType, DoubleType

Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-25 Thread Abhishek Anand
On changing the default compression codec which is snappy to lzf the errors are gone !! How can I fix this using snappy as the codec ? Is there any downside of using lzf as snappy is the default codec that ships with spark. Thanks !!! Abhi On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand wrote