Re: create hive context in spark application

2016-03-15 Thread Akhil Das
Did you ry submitting your application with spark-submit ? You can also try opening a spark-shell and see if it picks up your hive-site.xml. Thanks Best Regards On Tue, Mar 15, 2016 at 11:58 AM, antoniosi wrote: > Hi, > > I am tr

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Akhil Das
You can achieve this with the normal RDD way. Have one extra stage in the pipeline where you will properly standardize all the values (like replacing doc with doctor) for all the columns before the join. Thanks Best Regards On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh wrote: > Hi All, > > I ha

unsubscribe

2016-03-15 Thread satish chandra j
unsubscribe

RE: sparkR issues ?

2016-03-15 Thread Sun, Rui
It seems as.data.frame() defined in SparkR convers the versions in R base package. We can try to see if we can change the implementation of as.data.frame() in SparkR to avoid such covering. From: Alex Kozlov [mailto:ale...@gmail.com] Sent: Tuesday, March 15, 2016 2:59 PM To: roni Cc: user@spark

Re: [MLlib - ALS] Merging two Models?

2016-03-15 Thread Nick Pentreath
By the way, I created a JIRA for supporting initial model for warm start ALS here: https://issues.apache.org/jira/browse/SPARK-13856 On Fri, 11 Mar 2016 at 09:14, Nick Pentreath wrote: > Sean's old Myrrix slides contain an overview of the fold-in math: > http://www.slideshare.net/srowen/big-prac

unsubscribe

2016-03-15 Thread Netwaver
unsubscribe

Re: unsubscribe

2016-03-15 Thread Akhil Das
Send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read more over here http://spark.apache.org/community.html Thanks Best Regards On Tue, Mar 15, 2016 at 12:56 PM, satish chandra j wrote: > unsubscribe >

Re: unsubscribe

2016-03-15 Thread Akhil Das
Send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read more over here http://spark.apache.org/community.html Thanks Best Regards On Tue, Mar 15, 2016 at 1:28 PM, Netwaver wrote: > unsubscribe > > > >

is there any way to make WEB UI auto-refresh?

2016-03-15 Thread charles li
every time I can only get the latest info by refreshing the page, that's a little boring. so is there any way to make the WEB UI auto-refreshing ? great thanks -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

Re: is there any way to make WEB UI auto-refresh?

2016-03-15 Thread Nick Pentreath
You may want to check out https://github.com/hammerlab/spree On Tue, 15 Mar 2016 at 10:43 charles li wrote: > every time I can only get the latest info by refreshing the page, that's a > little boring. > > so is there any way to make the WEB UI auto-refreshing ? > > > great thanks > > > > -- > *

Re: Launch Spark shell using differnt python version

2016-03-15 Thread Prabhu Joseph
Hi Stuti, You can try local mode but not spark master or yarn mode if python-2.7 is not installed on all Spark Worker / NodeManager machines. To run with master mode 1. Check whether user is able to access python2.7 2. Check if you have installed python-2.7 in all NodeManager machines

RE: Launch Spark shell using differnt python version

2016-03-15 Thread Stuti Awasthi
Thanks Prabhu, I tried starting in local mode but still picking Python 2.6 only. I have exported “DEFAULT_PYTHON” in my session variable and also included in PATH. Export: export DEFAULT_PYTHON="/home/stuti/Python/bin/python2.7" export PATH="/home/stuti/Python/bin/python2.7:$PATH $ pyspark --mas

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-15 Thread David Gomez Saavedra
If you are using sbt, I personally use sbt-pack to pack all dependencies under a certain folder and then I set those jars in the spark config // just for demo I load this through config file overridden by environment variables val sparkJars = Seq ("/ROOT_OF_YOUR_PROJECT/target/pack/lib/YOUR_JAR_DE

Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ewan Leith
Has anyone seen a way of updating the Spark streaming job configuration while retaining the existing data in the write ahead log? e.g. if you've launched a job without enough executors and a backlog has built up in the WAL, can you increase the number of executors without losing the WAL data?

Re: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Saisai Shao
Currently configuration is a part of checkpoint data, and when recovering from failure, Spark Streaming will fetch the configuration from checkpoint data, so even if you change the configuration file, recovered Spark Streaming application will not use it. So from my understanding currently there's

RE: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ewan Leith
That’s what I thought, it’s a shame! Thanks Saisai, Ewan From: Saisai Shao [mailto:sai.sai.s...@gmail.com] Sent: 15 March 2016 09:22 To: Ewan Leith Cc: user Subject: Re: Spark streaming - update configuration while retaining write ahead log data? Currently configuration is a part of checkpoi

Re: Failing MiMa tests

2016-03-15 Thread Sean Owen
That PR has passed tests including MiMa checks, but it looks like there are still review comments to address On Tue, Mar 15, 2016 at 2:32 AM, Gayathri Murali wrote: > Here is the PR : https://github.com/apache/spark/pull/11544 > > > > On Mon, Mar 14, 2016 at 7:26 PM, Ted Yu wrote: >> >> Please r

the "DAG Visualiztion" in 1.6 not works fine here

2016-03-15 Thread charles li
sometimes it just shows several *black dots*, and sometimes it can not show the entire graph. did anyone meet this before and how did you fix it? ​ ​ -- *--* a spark lover, a quant, a developer and a good man. http://github.com/litaotao

reading file from S3

2016-03-15 Thread Yasemin Kaya
Hi, I am using Spark 1.6.0 standalone and I want to read a txt file from S3 bucket named yasemindeneme and my file name is deneme.txt. But I am getting this error. Here is the simple code Exception in thread "main" java.lang.IllegalArgumentE

Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Hi, I build my Spark/Scala packages using SBT that works fine. I have created generic shell scripts to build and submit it. Yesterday I noticed that some use Maven and Pom for this purpose. Which approach is recommended? Thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profi

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
For Scala, SBT is recommended. > On Mar 15, 2016, at 10:42 AM, Mich Talebzadeh > wrote: > > Hi, > > I build my Spark/Scala packages using SBT that works fine. I have created > generic shell scripts to build and submit it. > > Yesterday I noticed that some use Maven and Pom for this purpose.

Re: reading file from S3

2016-03-15 Thread Şafak Serdar Kapçı
Hello Yasemin, Maybe your key id or access key has special chars like backslash or something. You need to change it. Best Regards, Safak. 2016-03-15 12:33 GMT+02:00 Yasemin Kaya : > Hi, > > I am using Spark 1.6.0 standalone and I want to read a txt file from S3 > bucket named yasemindeneme and my

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Ted Yu
There're build jobs for both on Jenkins: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/ You can choose either one. I use mvn. On Tue, Mar 15, 2016 at 3:42 AM, Mich Talebzadeh wrote: >

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Puled this from stack overflow: We're using Maven to build Scala projects at work because it integrates well with our CI server. We could just run a shell script to kick off a build, of course, but we've got a bunch of other information coming out of Maven that we want to go into CI. That's abo

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Sean Owen
FWIW, I strongly prefer Maven over SBT even for Scala projects. The Spark build of reference is Maven. On Tue, Mar 15, 2016 at 10:45 AM, Chandeep Singh wrote: > For Scala, SBT is recommended. > > On Mar 15, 2016, at 10:42 AM, Mich Talebzadeh > wrote: > > Hi, > > I build my Spark/Scala packages u

Compress individual RDD

2016-03-15 Thread Nirav Patel
Hi, I see that there's following spark config to compress an RDD. My guess is it will compress all RDDs of a given SparkContext, right? If so, is there a way to instruct spark context to only compress some rdd and leave others uncompressed ? Thanks spark.rdd.compress false Whether to compress

Re: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ted Yu
I did a quick search but haven't found JIRA in this regard. If configuration is separate from checkpoint data, more use cases can be accommodated. > On Mar 15, 2016, at 2:21 AM, Saisai Shao wrote: > > Currently configuration is a part of checkpoint data, and when recovering > from failure,

Re: Compress individual RDD

2016-03-15 Thread Ted Yu
Looks like there is no such capability yet. How would you specify which rdd's to compress ? Thanks > On Mar 15, 2016, at 4:03 AM, Nirav Patel wrote: > > Hi, > > I see that there's following spark config to compress an RDD. My guess is it > will compress all RDDs of a given SparkContext, ri

Re: reading file from S3

2016-03-15 Thread Yasemin Kaya
Hi Safak, I changed the Keys but there is no change. Best, yasemin 2016-03-15 12:46 GMT+02:00 Şafak Serdar Kapçı : > Hello Yasemin, > Maybe your key id or access key has special chars like backslash or > something. You need to change it. > Best Regards, > Safak. > > 2016-03-15 12:33 GMT+02:00

Re: Hive Query on Spark fails with OOM

2016-03-15 Thread Sabarish Sasidharan
Yes, I suggested increasing shuffle partitions to address this problem. The other suggestion to increase shuffle fraction was not for this but makes sense given that you are reserving all that memory and doing nothing with it. By diverting more of it for shuffles you can help improve your shuffle p

Re: Compress individual RDD

2016-03-15 Thread Sabarish Sasidharan
It will compress only rdds with serialization enabled in the persistence mode. So you could skip _SER modes for your other rdds. Not perfect but something. On 15-Mar-2016 4:33 pm, "Nirav Patel" wrote: > Hi, > > I see that there's following spark config to compress an RDD. My guess is > it will c

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Ok. Sounds like opinion is divided :) I will try to build a scala app with Maven. When I build with SBT I follow this directory structure High level directory the package name like ImportCSV under ImportCSV I have a directory src and the sbt file ImportCSV.sbt in directory src I have main an

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Btw, just to add to the confusion ;) I use Maven as well since I moved from Java to Scala but everyone I talk to has been recommending SBT for Scala. I use the Eclipse Scala IDE to build. http://scala-ide.org/ Here is my sample PoM. You can add dependancies based on you

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Great Chandeep. I also have Eclipse Scala IDE below scala IDE build of Eclipse SDK Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe I am no expert on Eclipse so if I create project called ImportCSV where do I need to put the pom file or how do I reference it please. My Eclipse runs on a Linux

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ Once you have it setup, File -> New -> Other -> MavenProject -> Next / Finish. You’ll see a default POM.xml which you can modify / replace. Here is some documentation that should help: http

Spark work distribution among execs

2016-03-15 Thread Borislav Kapukaranov
Hi, I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. I observe a very strange issue. I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster. I run it with 25 executors with s

Re: Compress individual RDD

2016-03-15 Thread Nirav Patel
Thanks Sabarish, I thought of same. will try that. Hi Ted, good question. I guess one way is to have an api like `rdd.persist(storageLevel, compress)` where 'compress' can be true or false. On Tue, Mar 15, 2016 at 5:18 PM, Sabarish Sasidharan wrote: > It will compress only rdds with serializati

Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread 李承霖
HI, I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make

Imporvement the cube with the Fast Cubing In apache Kylin

2016-03-15 Thread licl
HI, I tried to build a cube on a 100 million data set. When I set 9 fields to build the cube with 10 cores. It nearly coast me a whole day to finish the job. At the same time, it generate almost 1”TB“ data in the "/tmp“ folder. Could we refer to the ”fast cube“ algorithm in apache Kylin To make th

mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manasdebashiskar
Hi, I have a streaming application that takes data from a kafka topic and uses mapwithstate. After couple of hours of smooth running of the application I see a problem that seems to have stalled my application. The batch seems to have been stuck after the following error popped up. Has anyone se

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Thanks again Is there anyway one can set this one up without eclipse much like what I did with sbt? I need to know the directory structure foe MVN project. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Installing Spark on Mac

2016-03-15 Thread Aida Tefera
Hi Jakob, sorry for my late reply I tried to run the below; came back with "netstat: lunt: unknown or uninstrumented protocol I also tried uninstalling version 1.6.0 and installing version1.5.2 with Java 7 and SCALA version 2.10.6; got the same error messages Do you think it would be worth me

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
sounds like the layout is basically the same as sbt layout, the sbt file is replaced by pom.xml? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
You can build using maven from the command line as well. This layout should give you an idea and here are some resources - http://www.scala-lang.org/old/node/345 project/ pom.xml - Defines the project src/ main/ java/ - Contains a

Re: Can we use spark inside a web service?

2016-03-15 Thread Andrés Ivaldi
Thanks Evan for the points. I had supposed what you said, but as I don't have enough experience maybe I was missing something, thanks for the answer!! On Mon, Mar 14, 2016 at 7:22 PM, Evan Chan wrote: > Andres, > > A couple points: > > 1) If you look at my post, you can see that you could use Sp

Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, I'm running a Spark 1.6.0 on YARN on a Hadoop 2.6.0 cluster. I observe a very strange issue. I run a simple job that reads about 1TB of json logs from a remote HDFS cluster and converts them to parquet, then saves them to the local HDFS of the Hadoop cluster. I run it with 25 executors with

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Yes, sbt uses the same structure as maven for source files. > On Mar 15, 2016, at 1:53 PM, Mich Talebzadeh > wrote: > > Thanks the maven structure is identical to sbt. just sbt file I will have to > replace with pom.xml > > I will use your pom.xml to start with it. > > Cheers > > Dr Mich Ta

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Mich Talebzadeh
Thanks the maven structure is identical to sbt. just sbt file I will have to replace with pom.xml I will use your pom.xml to start with it. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Hi, Try starting your clusters with roles, and you will not have to configure, hard code anything at all. Let me know in case you need any help with this. Regards, Gourav Sengupta On Tue, Mar 15, 2016 at 11:32 AM, Yasemin Kaya wrote: > Hi Safak, > > I changed the Keys but there is no change.

Re: Spark work distribution among execs

2016-03-15 Thread Chitturi Padma
By default spark uses 2 executors with one core each, have you allocated more executors using the command line args as - --num-executors 25 --executor-cores x ??? What do you mean by the difference between the nodes is huge ? Regards, Padma Ch On Tue, Mar 15, 2016 at 6:57 PM, bkapukaranov [via

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
You have a slash before the bucket name. It should be @. Regards Sab On 15-Mar-2016 4:03 pm, "Yasemin Kaya" wrote: > Hi, > > I am using Spark 1.6.0 standalone and I want to read a txt file from S3 > bucket named yasemindeneme and my file name is deneme.txt. But I am getting > this error. Here is

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Ted Yu
Which version of Spark are you using ? Can you show the code snippet w.r.t. broadcast variable ? Thanks On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar wrote: > Hi, > I have a streaming application that takes data from a kafka topic and uses > mapwithstate. > After couple of hours of smoot

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, Yes, I'm running the executors with 8 cores each. I also have properly configured executor memory, driver memory, num execs and so on in submit cmd. I'm a long time spark user, please lets skip the dummy cmd configuration stuff and dive in the interesting stuff :) Another strange thing I've n

Re: Streaming app consume multiple kafka topics

2016-03-15 Thread Cody Koeninger
The direct stream gives you access to the topic. The offset range for each partition contains the topic. That way you can create a single stream, and the first thing you do with it is mapPartitions with a switch on topic. Of course, it may make more sense to separate topics into different jobs,

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
I am using spark 1.6. I am not using any broadcast variable. This broadcast variable is probably used by the state management of mapwithState ...Manas On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu wrote: > Which version of Spark are you using ? > > Can you show the code snippet w.r.t. broadcast vari

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Oh!!! What the hell Please never use the URI *s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause of pain, security issues, code maintenance issues and ofcourse something that Amazon strongly suggests that we do not use. Please use roles and you will not have to worry about s

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2016-03-15 Thread jkukul
Hi Eric (or rather: anyone who's experiencing similar situation), I think your problem was, that the /--files/ parameter was provided after the application jar. Your command should have looked like this, instead: ./bin/spark-submit --class edu.bjut.spark.SparkPageRank --master yarn-cluster

Re: reading file from S3

2016-03-15 Thread Sabarish Sasidharan
There are many solutions to a problem. Also understand that sometimes your situation might be such. For ex what if you are accessing S3 from your Spark job running in your continuous integration server sitting in your data center or may be a box under your desk. And sometimes you are just trying s

Re: Spark work distribution among execs

2016-03-15 Thread manasdebashiskar
Your input is skewed in terms of the default hash partitioner that is used. Your options are to use a custom partitioner that can re-distribute the data evenly among your executors. I think you will see the same behaviour when you use more executors. It is just that the data skew appears to be les

Re: reading file from S3

2016-03-15 Thread Gourav Sengupta
Once again, please use roles, there is no way that you have to specify the access keys in the URI under any situation. Please read Amazon documentation and they will say the same. The only situation when you use the access keys in URI is when you have not read the Amazon documentation :) Regards,

Re: sparkR issues ?

2016-03-15 Thread roni
Alex, No I have not defined he "dataframe" its the spark default Dataframe. That line is just casting Factor as datarame to send to the function. Thanks -R On Mon, Mar 14, 2016 at 11:58 PM, Alex Kozlov wrote: > This seems to be a very unfortunate name collision. SparkR defines it's > own DataF

Re: sparkR issues ?

2016-03-15 Thread roni
Hi , Is there a work around for this? Do i need to file a bug for this? Thanks -R On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui wrote: > It seems as.data.frame() defined in SparkR convers the versions in R base > package. > > We can try to see if we can change the implementation of as.data.frame(

Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu
Dear Spark Users and Developers, We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy to announce the release of XGBoost4J (http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), a Portable Distributed XGBoost in Spark, Fli

Re: Spark work distribution among execs

2016-03-15 Thread bkapukaranov
Hi, This is an interesting point of view. I thought the HashPartitioner works completely differently. Here's my understanding - the HashPartitioner defines how keys are distributed within a dataset between the different partitions, but play no role in assigning each partition for processing by exe

Re: sparkR issues ?

2016-03-15 Thread Alex Kozlov
Hi Roni, you can probably rename the as.data.frame in $SPARK_HOME/R/pkg/R/DataFrame.R and re-install SparkR by running install-dev.sh On Tue, Mar 15, 2016 at 8:46 AM, roni wrote: > Hi , > Is there a work around for this? > Do i need to file a bug for this? > Thanks > -R > > On Tue, Mar 15, 201

Re: create hive context in spark application

2016-03-15 Thread Antonio Si
Thanks Akhil. Yes, spark-shell works fine. In my app, I have a Restful service and from the Restful service, I am calling the spark-api to do some hiveql. That's why I am not using spark-submit. Thanks. Antonio. On Tue, Mar 15, 2016 at 12:02 AM, Akhil Das wrote: > Did you ry submitting your

Questions about Spark On Mesos

2016-03-15 Thread Shuai Lin
Hi list, We (scrapinghub) are planning to deploy spark in a 10+ node cluster, mainly for processing data in HDFS and kafka streaming. We are thinking of using mesos instead of yarn as the cluster resource manager so we can use docker container as the executor and makes deployment easier. But there

Parition RDD by key to create DataFrames

2016-03-15 Thread Mohamed Nadjib MAMI
Hi, I have a pair RDD of the form: (mykey, (value1, value2)) How can I create a DataFrame having the schema [V1 String, V2 String] to store [value1, value2] and save it into a Parquet table named "mykey"? /createDataFrame()/ method takes an RDD and a schema (StructType) in parameters. The sc

?????? mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Sea
Hi,manas: Maybe you can look at this bug: https://issues.apache.org/jira/browse/SPARK-13566 -- -- ??: "manas kar";; : 2016??3??15??(??) 10:48 ??: "Ted Yu"; : "user"; : Re: mapwithstate Hangs with Error cleaning b

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
Is it always the case that one title is a substring of another ? -- Not always. One title can have values like D.O.C, doctor_{areacode}, doc_{dep,areacode} On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet wrote: > I think you need some sort of fuzzy join ? > Is it always the case that one titl

bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Emmanuel
In Spark 1.6 if I do (column name has dot in it, but is not a nested column): df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))scala> df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))org.apache.spark.sql.AnalysisException: cannot resolve 'raw.minOfDay' given input colu

Re: Parition RDD by key to create DataFrames

2016-03-15 Thread Davies Liu
I think you could create a DataFrame with schema (mykey, value1, value2), then partition it by mykey when saving as parquet. r2 = rdd.map((k, v) => Row(k, v._1, v._2)) df = sqlContext.createDataFrame(r2, schema) df.write.partitionBy("myKey").parquet(path) On Tue, Mar 15, 2016 at 10:33 AM, Moham

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton
Hi, I need to process some events in a specific order based on a timestamp, for each user in my data. I had implemented this by using the dataframe sort method to sort by user id and then sort by the timestamp secondarily, then do a groupBy().mapValues() to process the events for each user. Howe

RE: [MARKETING] Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Iain Cundy
Hi Manas I saw a very similar problem while using mapWithState. Timeout on BlockManager remove leading to a stall. In my case it only occurred when there was a big backlog of micro-batches, combined with a shortage of memory. The adding and removing of blocks between new and old tasks was inte

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
The data in the title is different, so to correct the data in the column requires to find out what is the correct data and then replace. To find the correct data could be tedious but if some mechanism is in place which can help to group the partially matched data then it might help to do the furt

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3? It seems like there are many ways to bulk copy

Re: bug? using withColumn with colName with dot can't replace column

2016-03-15 Thread Jan Štěrba
First off, I would advise against having dots in column names, thats just playing with fire. Second the exception is really strange since spark is complaining about a completely unrelated column. I would like to see the df schema before the exception was thrown. -- Jan Sterba https://twitter.com/h

Re: newbie HDFS S3 best practices

2016-03-15 Thread Frank Austin Nothaft
Hard to say with #1 without knowing your application’s characteristics; for #2, we use conductor with IAM roles, .boto/.aws/credentials files. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 > On Mar 15, 2016, at 11:

How to select from table name using IF(condition, tableA, tableB)?

2016-03-15 Thread Rex X
I want to do a query based on a logic condition to query between two tables. select * from if(A>B, tableA, tableB) But "if" function in Hive cannot work within FROM above. Any idea how?

Microsoft SQL dialect issues

2016-03-15 Thread Andrés Ivaldi
Hello, I'm trying to use MSSQL, storing data on MSSQL but i'm having dialect problems I found this https://mail-archives.apache.org/mod_mbox/spark-issues/201510.mbox/%3cjira.12901078.1443461051000.34556.1444123886...@atlassian.jira%3E That is what is happening to me, It's possible to define the di

Re: Microsoft SQL dialect issues

2016-03-15 Thread Mich Talebzadeh
Hi, Can you please clarify what you are trying to achieve and I guess you mean Transact_SQL for MSSQL? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread manas kar
You are quite right. I am getting this error while profiling my module to see what is the minimum resources I can use to achieve my SLA. My point is that if resource constraint creates this problem, then this issue is just waiting to happen in a larger scenario(Though the probability of happening w

Re: Docker configuration for akka spark streaming

2016-03-15 Thread David Gomez Saavedra
The issue is related to this https://issues.apache.org/jira/browse/SPARK-13906 .set("spark.rpc.netty.dispatcher.numThreads","2") seem to fix the problem On Tue, Mar 15, 2016 at 6:45 AM, David Gomez Saavedra wrote: > I have updated the config since I realized the actor system was listening > on

How to convert Parquet file to a text file.

2016-03-15 Thread Shishir Anshuman
I need to convert the parquet file generated by the spark to a text (csv preferably) file. I want to use the data model outside spark. Any suggestion on how to proceed?

Re: Spark and KafkaUtils

2016-03-15 Thread Vinti Maheshwari
Hi Cody, I wanted to update my build.sbt which was working with kafka without giving any error, it may help other user if they face similar issue. name := "NetworkStreaming" version := "1.0" scalaVersion:= "2.10.5" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-streaming-kafka" %

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
Hi Frank We have thousands of small files . Each file is between 6K to maybe 100k. Conductor looks interesting Andy From: Frank Austin Nothaft Date: Tuesday, March 15, 2016 at 11:59 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: newbie HDFS S3 best practices > Hard to say with #

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott
I'd recommend reading the parquet file into a DataFrame object, and then using spark-csv to write to a CSV file. On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman wrote: > I need to convert the parquet file generated by the spark to a text (csv > prefera

Spark streaming with akka association with remote system failure

2016-03-15 Thread David Gomez Saavedra
hi there, I'm trying to set up a simple spark streaming app using akka actors as receivers. I followed the example provided and created two apps. One creating an actor system and another one subscribing to it. I can see the subscription message but few seconds later i get an error [info] 20:37:40

Re: Installing Spark on Mac

2016-03-15 Thread Jakob Odersky
Hi, what do you get running just 'sudo netstat'? Also, what's the output of 'jps -mlv' when running your spark application? Can you post the contents of the files in $SPARK_HOME/conf ? Are there any special firewall rules in place, forbidding connections on localhost? Regarding the IP address chan

Re: Microsoft SQL dialect issues

2016-03-15 Thread Suresh Thalamati
You should be able to register your own dialect if the default mappings are not working for your scenario. import org.apache.spark.sql.jdbc JdbcDialects.registerDialect(MyDialect) Please refer to the JdbcDialects to find example of existing default dialect for your database or another databa

Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
Hi, I normally use sbt and using this sbt file works fine for me cat ImportCSV.sbt name := "ImportCSV" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1" libraryDependenc

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
Please suffix _2.10 to artifact name See: http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 On Tue, Mar 15, 2016 at 3:08 PM, Mich Talebzadeh wrote: > Hi, > > I normally use sbt and using this sbt file works fine for me > > cat ImportCSV.sbt > name := "ImportCSV" > version := "1

Re: Error building spark app with Maven

2016-03-15 Thread Jakob Odersky
Hi Mich, probably unrelated to the current error you're seeing, however the following dependencies will bite you later: spark-hive_2.10 spark-csv_2.11 the problem here is that you're using libraries built for different Scala binary versions (the numbers after the underscore). The simple fix here is

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
Many thanks Ted and thanks for heads up Jakob Just these two changes to dependencies org.apache.spark spark-core*_2.10* 1.5.1 org.apache.spark spark-sql*_2.10* 1.5.1 [DEBUG] endProcessChildren: artifact=spark:scala:jar:1.0 [INFO] -

spark.ml : eval model outside sparkContext

2016-03-15 Thread Emmanuel
Hello, In MLLib with Spark 1.4, I was able to eval a model by loading it and using `predict` on a vector of features. I would train on Spark but use my model on my workflow. In `spark.ml` it seems like the only way to eval is to use `transform` which only takes a DataFrame.To build a DataFrame

Re: Get output of the ALS algorithm.

2016-03-15 Thread Bryan Cutler
Jacek is correct for using org.apache.spark.ml.recommendation.ALSModel If you are trying to save org.apache.spark.mllib.recommendation.MatrixFactorizationModel, then it is similar, but just a little different, see the example here https://github.com/apache/spark/blob/master/examples/src/main/scala

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
An observation Once compiled with MVN the job submit works as follows: + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark:// 50.140.197.217:7077 --executor-memory=12G --executor-cores=12 --num-executors=2 *target/s

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
1.0 ... scala On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh wrote: > An observation > > Once compiled with MVN the job submit works as follows: > > + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages > com.databricks:spark-csv_2.11:1.3.0 --class ImportCSV --master spark:// > 50.1

Re: Error building spark app with Maven

2016-03-15 Thread Mich Talebzadeh
ok Ted In sbt I have name := "ImportCSV" version := "1.0" scalaVersion := "2.10.4" which ends up in importcsv_2.10-1.0.jar as part of *target/scala-2.10/importcsv_2.**10-1.0.jar* In mvn I have 1.0 scala Does it matter? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/vie

what is the best practice to read configure file in spark streaming

2016-03-15 Thread yaoxiaohua
Hi guys, I'm using kafka+spark streaming do log analysis. Now my requirement is that the log alarm rules may change sometimes. Rules maybe like this: App=Hadoop,keywords=oom|Exception|error,threshold=10 The thresho

  1   2   >