Re: SparkR Supported Types - Please add "bigint"

2015-08-07 Thread Davies Liu
They are actually the same thing, LongType. `long` is friendly for developer, `bigint` is friendly for database guy, maybe data scientists. On Thu, Jul 23, 2015 at 11:33 PM, Sun, Rui wrote: > printSchema calls StructField. buildFormattedString() to output schema > information. buildFormattedStri

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
Hi Philip, Thanks for providing the log file. It seems that most of the time are spent on partition discovery. The code snippet you provided actually issues two jobs. The first one is for listing the input directories to find out all leaf directories (and this actually requires listing all le

JavsSparkContext causes hadoop.ipc.RemoteException error

2015-08-07 Thread junliu6
HI, I'm a new spark user,nowdays,I meet a wired erron happeded in our cluster. I depoly spark-1.3.1 and cdh5 on my cluster,weeks ago ,I depoly namenode HA on it. After that , my spark job meet error when I use JAVA-API,like this:

Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-07 Thread canan chen
Is there any reason that historyserver use another property for the event log dir ? Thanks

DataFrame column structure change

2015-08-07 Thread Rishabh Bhardwaj
Hi all, I want to have some nesting structure from the existing columns of the dataframe. For that,,I am trying to transform a DF in the following way,but couldn't do it. scala> df.printSchema root |-- a: string (nullable = true) |-- b: string (nullable = true) |-- c: string (nullable = true)

miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Gerald Loeffler
hi, if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0, doesn’t that make it a deterministic/classical gradient descent rather than a SGD? Specifically, miniBatchFraction=1.0 means the entire data set, i.e. all rows. In the spirit of SGD, shouldn’t the default be the fraction that r

Spark on YARN

2015-08-07 Thread Jem Tucker
Hi, I am running spark on YARN on the CDH5.3.2 stack. I have created a new user to own and run a testing environment, however when using this user applications I submit to yarn never begin to run, even if they are the exact same application that is successful with another user? Has anyone seen an

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Cheng Lian
However, it's weird that the partition discovery job only spawns 2 tasks. It should use the default parallelism, which is probably 8 according to the logs of the next Parquet reading job. Partition discovery is already done in a distributed manner via a Spark job. But the parallelism is mysteri

SparkR -Graphx Connected components

2015-08-07 Thread smagadi
Hi I was trying to use stronglyconnectcomponents () Given a DAG is graph I was supposed to get back list of stronglyconnected l comps . def main(args: Array[String]) { val vertexArray = Array( (1L, ("Alice", 28)), (2L, ("Bob", 27)), (3L, ("Charlie", 65)), (4L, ("David", 42)), (5L,

How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Hao Ren
Is there any workaround to distribute non-serializable object for RDD transformation or broadcast variable ? Say I have an object of class C which is not serializable. Class C is in a jar package, I have no control on it. Now I need to distribute it either by rdd transformation or by broadcast. I

Re: DataFrame column structure change

2015-08-07 Thread Rishabh Bhardwaj
I am doing it by creating a new data frame out of the fields to be nested and then join with the original DF. Looking for some optimized solution here. On Fri, Aug 7, 2015 at 2:06 PM, Rishabh Bhardwaj wrote: > Hi all, > > I want to have some nesting structure from the existing columns of > the d

Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko
No, here's an example: COL1 COL2 a one b two a two c three StringIndexer.setInputCol(COL1).setOutputCol(SI1) -> (0-> a, 1->b,2->c) SI1 0 1 0 2 StringIndexer.setInputCol(COL2).setOutputCol(SI2) -> (0-> one, 1->two, 2->three) SI1 0 1 1 2 VectorAssembler.setInpu

automatically determine cluster number

2015-08-07 Thread Ziqi Zhang
Hi I am new to spark and I need to use the clustering functionality to process large dataset. There are between 50k and 1mil objects to cluster. However the problem is that the optimal number of clusters is unknown. we cannot even estimate a range, except we know there are N objects. Previo

Spark streaming and session windows

2015-08-07 Thread Ankur Chauhan
Hi all, I am trying to figure out how to perform equivalent of "Session windows" (as mentioned in https://cloud.google.com/dataflow/model/windowing) using spark streaming. Is it even possible (i.e. possible to do efficiently at scale). Just to expand on the definition: Taken from the google da

RE: Specifying the role when launching an AWS spark cluster using spark_ec2

2015-08-07 Thread Ewan Leith
You'll have a lot less hassle using the AWS EMR instances with Spark 1.4.1 for now, until the spark_ec2.py scripts move to Hadoop 2.7.1, at the moment I'm pretty sure it's only using Hadoop 2.4 The EMR setup with Spark lets you use s3:// URIs with IAM roles Ewan -Original Message- From

Re: How to binarize data in spark

2015-08-07 Thread Adamantios Corais
I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2? val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=> var lt : Array[Double] = new Array[Double](test12.s

Insert operation in Dataframe

2015-08-07 Thread guoqing0...@yahoo.com.hk
Hi all , Is the Dataframe support the insert operation , like sqlContext.sql("insert into table1 xxx select xxx from table2") ? guoqing0...@yahoo.com.hk

Re: SparkR -Graphx Connected components

2015-08-07 Thread Robineast
Hi The graph returned by SCC (strong_graphs in your code) has vertex data where each vertex in a component is assigned the lowest vertex id of the component. So if you have 6 vertices (1 to 6) and 2 strongly connected components (1 and 3, and 2,4,5 and 6) then the strongly connected components are

Re: Time series forecasting

2015-08-07 Thread ploffay
Im interested in machine learning on time series. In our environment we have lot of metric data continuously coming from agents. Data are stored in Cassandra. Is it possible to set up spark that would use machine learning on previous data and new incoming data? -- View this message in conte

Issues with Phoenix 4.5

2015-08-07 Thread Nicola Ferraro
Hi all, I am getting an exception when trying to execute a Spark Job that is using the new Phoenix 4.5 spark connector. The application works very well in my local machine, but fails to run in a cluster environment on top of yarn. The cluster is a Cloudera CDH 5.4.4 with HBase 1.0.0 and Phoenix 4.

Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
Looking at the callstack and diffs between 1.3.1 and 1.4.1-rc4, I see something that could be relevant to the issue. 1) Callstack tells that log4j manager gets initialized and uses default java context class loader. This context class loader should probably be MutableURLClassLoader from spark but

Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
That starts to smell... When analyzing SparkSubmit.scala, we can see than one of the firsts thing it does is to parse arguments. This uses Utils object and triggers initialization of member variables. One such variable is ShutdownHookManager (which didn't exists in spark 1.3) with the later log4j

Possible bug: JDBC with Speculative mode launches orphan queries

2015-08-07 Thread Saif.A.Ellafi
Hello, When enabling speculation, my first job is to launch a partitioned JDBC dataframe query, in which some partitions take longer than others to respond. This causes speculation and creates new nodes to launch the query. When one of those nodes finish the query, the speculative one remains f

Amazon DynamoDB & Spark

2015-08-07 Thread Yasemin Kaya
Hi, Is there a way using DynamoDB in spark application? I have to persist my results to DynamoDB. Thanx, yasemin -- hiç ender hiç

Estimate size of Dataframe programatically

2015-08-07 Thread Srikanth
Hello, Is there a way to estimate the approximate size of a dataframe? I know we can cache and look at the size in UI but I'm trying to do this programatically. With RDD, I can sample and sum up size using SizeEstimator. Then extrapolate it to the entire RDD. That will give me approx size of RDD.

Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
One possible solution is to spark-submit with --driver-class-path and list all recursive dependencies. This is fragile and error prone. Non-working alternatives (used in SparkSubmit.scala AFTER arguments parser is initialized): spark-submit --packages ... spark-submit --jars ... spark-defaults.c

Re: log4j custom appender ClassNotFoundException with spark 1.4.1

2015-08-07 Thread mlemay
Offending commit is : [SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races. https://github.com/apache/spark/commit/e72c16e30d85cdc394d318b5551698885cfda9b8 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundE

Issue when rebroadcasting a variable outside of the definition scope

2015-08-07 Thread simone.robutti
Hello everyone, this is my first message ever to a mailing list so please pardon me if for some reason I'm violating the etiquette. I have a problem with rebroadcasting a variable. How it should work is not well documented so I could find only a few and simple example to understand how it should

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Sujit Pal
Hi Hao, I think sc.broadcast will allow you to broadcast non-serializable objects. According to the scaladocs the Broadcast class itself is Serializable and it wraps your object, allowing you to get it from the Broadcast object using value(). Not 100% sure though since I haven't tried broadcastin

Re: log4j.xml bundled in jar vs log4.properties in spark/conf

2015-08-07 Thread mlemay
See post for detailed explanation of you problem: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-xml-bundled-in-jar-vs-log4

distributing large matrices

2015-08-07 Thread iceback
Is this the sort of problem spark can accommodate? I need to compare 10,000 matrices with each other (10^10 comparison). The matrices are 100x10 (10^7 int values). I have 10 machines with 2 to 8 cores (8-32 "processors"). All machines have to - contribute to matrices generation (a simulati

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Philip Weaver
If the object cannot be serialized, then I don't think broadcast will make it magically serializable. You can't transfer data structures between nodes without serializing them somehow. On Fri, Aug 7, 2015 at 7:31 AM, Sujit Pal wrote: > Hi Hao, > > I think sc.broadcast will allow you to broadcast

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-07 Thread Philip Weaver
Thanks, I also confirmed that the partition discovery is slow by writing a non-Spark application that uses the parquet library directly to load that partitions. It's so slow that my colleague's Python application can read the entire contents of all the parquet data files faster than my application

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Han JU
If the object is something like an utility object (say a DB connection handler), I often use: @transient lazy val someObj = MyFactory.getObj(...) So basically `@transient` tell the closure cleaner don't serialize this, and the `lazy val` allows it to be initiated on each executor upon its firs

Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hi, I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a datastax cassandra cluster. Since there are tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to check with people here to see what they are using today. Some of the r

Re: Spark job workflow engine recommendations

2015-08-07 Thread Hien Luu
Looks like Oozie can satisfy most of your requirements. On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone wrote: > Hi, > I'm looking for open source workflow tools/engines that allow us to > schedule spark jobs on a datastax cassandra cluster. Since there are tonnes > of alternatives out there like

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Eugene Morozov
Hao, I’d say there are few possible ways to achieve that: 1. Use KryoSerializer. The flaw of KryoSerializer is that current version (2.21) has an issue with internal state and it might not work for some objects. Spark get kryo dependency as transitive through chill and it’ll not be resolved q

RE: Issue when rebroadcasting a variable outside of the definition scope

2015-08-07 Thread Ganelin, Ilya
Simone, here are some thoughts. Please check out the "understanding closures" section of the Spark Programming Guide. Secondly, broadcast variables do not propagate updates to the underlying data. You must either create a new broadcast variable or alternately if you simply wish to accumulate res

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Thanks for the suggestion Hien. I'm curious why not azkaban from linkedin. >From what I read online Oozie was very cumbersome to setup and use compared to azkaban. Since you are from linkedin wanted to get some perspective on what it lacks compared to Oozie. Ease of use is very important more than

RE: distributing large matrices

2015-08-07 Thread Koen Vantomme
Verzonden vanaf mijn Sony Xperia™-smartphone iceback schreef >Is this the sort of problem spark can accommodate? > >I need to compare 10,000 matrices with each other (10^10 comparison). The >matrices are 100x10 (10^7 int values). >I have 10 machines with 2 to 8 cores (8-32 "pro

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Sounds reasonable to me, feel free to create a JIRA (and PR if you're up for it) so we can see what others think! On Fri, Aug 7, 2015 at 1:45 AM, Gerald Loeffler < gerald.loeff...@googlemail.com> wrote: > hi, > > if new LinearRegressionWithSGD() uses a miniBatchFraction of 1.0, > doesn’t that mak

Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Muler
Consider the classic word count application over a 4 node cluster with a sizable working data. What makes Spark ran faster than MapReduce considering that Spark also has to write to disk during shuffle?

Re: Amazon DynamoDB & Spark

2015-08-07 Thread Jay Vyas
In general the simplest way is that you can use the Dynamo Java API as is and call it inside a map(), and use the asynchronous put() Dynamo api call . > On Aug 7, 2015, at 9:08 AM, Yasemin Kaya wrote: > > Hi, > > Is there a way using DynamoDB in spark application? I have to persist my > res

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Hien Luu
This blog outlines a few things that make Spark faster than MapReduce - https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html On Fri, Aug 7, 2015 at 9:13 AM, Muler wrote: > Consider the classic word count application over a 4 node cluster with a > sizable working data. What makes Spark

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Corey Nolet
1) Spark only needs to shuffle when data needs to be partitioned around the workers in an all-to-all fashion. 2) Multi-stage jobs that would normally require several map reduce jobs, thus causing data to be dumped to disk between the jobs can be cached in memory.

Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Muler
Spark is an in-memory engine and attempts to do computation in-memory. Tachyon is memory-centeric distributed storage, OK, but how would that help ran Spark faster?

Re: Spark MLib v/s SparkR

2015-08-07 Thread Feynman Liang
SparkR and MLlib are becoming more integrated (we recently added R formula support) but the integration is still quite small. If you learn R and SparkR, you will not be able to leverage most of the distributed algorithms in MLlib (e.g. all the algorithms you cited). However, you could use the equiv

How to run start-thrift-server in debug mode?

2015-08-07 Thread Benjamin Ross
Hi, I'm trying to run the hive thrift server in debug mode. I've tried to simply pass -Xdebug -Xrunjdwp:transport=dt_socket,address=127.0.0.1:,server=y,suspend=n to start-thriftserver.sh as a driver option, but it doesn't seem to host a server. I've then tried to edit the various shell sc

tachyon

2015-08-07 Thread Abhishek R. Singh
Do people use Tachyon in production, or is it experimental grade still? Regards, Abhishek - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark job workflow engine recommendations

2015-08-07 Thread Jörn Franke
Check also falcon in combination with oozie Le ven. 7 août 2015 à 17:51, Hien Luu a écrit : > Looks like Oozie can satisfy most of your requirements. > > > > On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone wrote: > >> Hi, >> I'm looking for open source workflow tools/engines that allow us to >> sch

Re: tachyon

2015-08-07 Thread Ted Yu
Looks like you would get better response on Tachyon's mailing list: https://groups.google.com/forum/?fromgroups#!forum/tachyon-users Cheers On Fri, Aug 7, 2015 at 9:56 AM, Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > Do people use Tachyon in production, or is it experimental gr

Re: Spark job workflow engine recommendations

2015-08-07 Thread Nick Pentreath
Hi Vikram, We use Azkaban (2.5.0) in our production workflow scheduling. We just use local mode deployment and it is fairly easy to set up. It is pretty easy to use and has a nice scheduling and logging interface, as well as SLAs (like kill job and notify if it doesn't complete in 3 hours or whate

Re: Spark job workflow engine recommendations

2015-08-07 Thread Ted Yu
>From what I heard (an ex-coworker who is Oozie committer), Azkaban is being phased out at LinkedIn because of scalability issues (though UI-wise, Azkaban seems better). Vikram: I suggest you do more research in related projects (maybe using their mailing lists). Disclaimer: I don't work for Link

SparkSQL: remove jar added by "add jar " command from dependencies

2015-08-07 Thread Wu, James C.
Hi, I am using Spark SQL to run some queries on a set of avro data. Somehow I am getting this error 0: jdbc:hive2://n7-z01-0a2a1453> select count(*) from flume_test; Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 26.0 failed 4 times, most recent failu

Re: How to distribute non-serializable object in transform task or broadcast ?

2015-08-07 Thread Eugene Morozov
Would like to add smth, inlined. On 07 Aug 2015, at 18:51, Eugene Morozov wrote: > Hao, > > I’d say there are few possible ways to achieve that: > 1. Use KryoSerializer. > The flaw of KryoSerializer is that current version (2.21) has an issue with > internal state and it might not work for

RE: All masters are unresponsive! Giving up.

2015-08-07 Thread Jeff Jones
Thanks. Added this to both the client and the master but still not getting any more information. I confirmed the flag with ps. jjones53222 2.7 0.1 19399412 549656 pts/3 Sl 17:17 0:44 /opt/jdk1.8/bin/java -cp /home/jjones/bin/spark-1.4.1-bin-hadoop2.6/sbin/../conf/:/home/jjones/bin/spa

[Spark Streaming] Session based windowing like in google dataflow

2015-08-07 Thread Ankur Chauhan
Hi all, I am trying to figure out how to perform equivalent of "Session windows" (as mentioned in https://cloud.google.com/dataflow/model/windowing) using spark streaming. Is it even possible (i.e. possible to do efficiently at scale). Just to expand on the definition: Taken from the google da

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Oh ok. That's a good enough reason against azkaban then. So looks like Oozie is the best choice here. On Friday, August 7, 2015, Ted Yu wrote: > From what I heard (an ex-coworker who is Oozie committer), Azkaban is > being phased out at LinkedIn because of scalability issues (though UI-wise, > A

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Ted Yu
Spark 1.4.1 depends on: 2.3.4-spark Is it possible that your standalone cluster has another version of akka ? Cheers On Fri, Aug 7, 2015 at 10:48 AM, Jeff Jones wrote: > Thanks. Added this to both the client and the master but still not getting > any more information. I confirmed the flag

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Igor Berman
check on which ip/port master listens netstat -a -t --numeric-ports On 7 August 2015 at 20:48, Jeff Jones wrote: > Thanks. Added this to both the client and the master but still not getting > any more information. I confirmed the flag with ps. > > > > jjones53222 2.7 0.1 19399412 549656 p

Re: tachyon

2015-08-07 Thread Calvin Jia
Hi Abhishek, Here's a production use case that may interest you: http://www.meetup.com/Tachyon/events/222485713/ Baidu is using Tachyon

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
I think in the SGD algorithm, the mini batch sample is done without replacement. So with fraction=1, then all the rows will be sampled exactly once to form the miniBatch, resulting to the deterministic/classical case. On Fri, Aug 7, 2015 at 9:05 AM, Feynman Liang wrote: > Sounds reasonable to me,

Fwd: [Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amount of small files

2015-08-07 Thread Roberto Coluccio
Please community, I'd really appreciate your opinion on this topic. Best regards, Roberto -- Forwarded message -- From: Roberto Coluccio Date: Sat, Jul 25, 2015 at 6:28 PM Subject: [Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amou

Re: Amazon DynamoDB & Spark

2015-08-07 Thread Yasemin Kaya
Thanx Jay. 2015-08-07 19:25 GMT+03:00 Jay Vyas : > In general the simplest way is that you can use the Dynamo Java API as is > and call it inside a map(), and use the asynchronous put() Dynamo api call > . > > > > On Aug 7, 2015, at 9:08 AM, Yasemin Kaya wrote: > > > > Hi, > > > > Is there a wa

Re: Spark job workflow engine recommendations

2015-08-07 Thread Hien Luu
Scalability is a known issue due the the current architecture. However this will be applicable if you run more 20K jobs per day. On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu wrote: > From what I heard (an ex-coworker who is Oozie committer), Azkaban is > being phased out at LinkedIn because of scala

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Yep, I think that's what Gerald is saying and they are proposing to default miniBatchFraction = (1 / numInstances). Is that correct? On Fri, Aug 7, 2015 at 11:16 AM, Meihua Wu wrote: > I think in the SGD algorithm, the mini batch sample is done without > replacement. So with fraction=1, then all

Re: Spark job workflow engine recommendations

2015-08-07 Thread Ted Yu
In my opinion, choosing some particular project among its peers should leave enough room for future growth (which may come faster than you initially think). Cheers On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu wrote: > Scalability is a known issue due the the current architecture. However > this w

Spark SQL query AVRO file

2015-08-07 Thread java8964
Hi, Spark users: We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production cluster, which has 42 data/task nodes. There is one dataset stored as Avro files about 3T. Our business has a complex query running for the dataset, which is stored in nest structure with Array of St

Get bucket details created in shuffle phase

2015-08-07 Thread cheez
Hey all. I was trying to understand Spark Internals by looking in to (and hacking) the code. I was trying to explore the buckets which are generated when we partition the output of each map task and then let the reduce side fetch them on the basis of paritionId. I went into the write() method of

Re: Spark SQL query AVRO file

2015-08-07 Thread Michael Armbrust
Have you considered trying Spark SQL's native support for avro data? https://github.com/databricks/spark-avro On Fri, Aug 7, 2015 at 11:30 AM, java8964 wrote: > Hi, Spark users: > > We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our > production cluster, which has 42 data/task

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Calvin Jia
Hi, Tachyon manages memory off heap which can help prevent long GC pauses. Also, using Tachyon will allow the data to be shared between Spark jobs if they use the same dataset. Here's a production use case where Baidu

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
Hi, Michael: I am not sure how spark-avro can help in this case. My understanding is that to use Spark-avro, I have to translate all the logic from this big Hive query into Spark code, right? If I have this big Hive query, how I can use spark-avro to run the query? Thanks Yong From: mich...@data

Re: Spark SQL query AVRO file

2015-08-07 Thread Michael Armbrust
You can register your data as a table using this library and then query it using HiveQL CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path "src/test/resources/episodes.avro") On Fri, Aug 7, 2015 at 11:42 AM, java8964 wrote: > Hi, Michael: > > I am not sure how spark-

RE: Spark SQL query AVRO file

2015-08-07 Thread java8964
Good to know that. Let me research it and give it a try. Thanks Yong From: mich...@databricks.com Date: Fri, 7 Aug 2015 11:44:48 -0700 Subject: Re: Spark SQL query AVRO file To: java8...@hotmail.com CC: user@spark.apache.org You can register your data as a table using this library and then query

Re: Estimate size of Dataframe programatically

2015-08-07 Thread Ted Yu
Have you tried calling SizeEstimator.estimate() on a DataFrame ? I did the following in REPL: scala> SizeEstimator.estimate(df) res1: Long = 17769680 FYI On Fri, Aug 7, 2015 at 6:48 AM, Srikanth wrote: > Hello, > > Is there a way to estimate the approximate size of a dataframe? I know we > ca

Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hi, A silly question here. The Driver Web UI dies when the spark-submit program finish. I would like some time to analyze after the program ends, as the page does not refresh it self, when I hit F5 I lose all the info. Thanks, Saif

Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier
Hi, all spark processes are saved in the Spark History Server look at your host on port 18080 instead of 4040 François Le 2015-08-07 15:26, saif.a.ell...@wellsfargo.com a écrit : > Hi, > > A silly question here. The Driver Web UI dies when the spark-submit > program finish. I would like some t

SparkSQL: "add jar" blocks all queries

2015-08-07 Thread Wu, James C.
Hi, I got into a situation where a prior "add jar " command causing Spark SQL stops to work for all users. Does anyone know how to fix the issue? Regards, james From: , Walt Disney mailto:james.c...@disney.com>> Date: Friday, August 7, 2015 at 10:29 AM To: "user@spark.apache.org

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Meihua Wu
Feynman, thanks for clarifying. If we default miniBatchFraction = (1 / numInstances), then we will only hit one row for every iteration of SGD regardless the number of partitions and executors. In other words the parallelism provided by the RDD is lost in this approach. I think this is something w

RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Saif.A.Ellafi
Hello, thank you, but that port is unreachable for me. Can you please share where can I find that port equivalent in my environment? Thank you Saif From: François Pelletier [mailto:newslett...@francoispelletier.org] Sent: Friday, August 07, 2015 4:38 PM To: user@spark.apache.org Subject: Re: Spa

RE: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread Koen Vantomme
Verzonden vanaf mijn Sony Xperia™-smartphone saif.a.ell...@wellsfargo.com schreef > > >Hello, thank you, but that port is unreachable for me. Can you please share >where can I find that port equivalent in my environment? > >  > >Thank you > >Saif > >  > >From: François Pelletier [ma

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Koen Vantomme
Verzonden vanaf mijn Sony Xperia™-smartphone Meihua Wu schreef >Feynman, thanks for clarifying. > >If we default miniBatchFraction = (1 / numInstances), then we will >only hit one row for every iteration of SGD regardless the number of >partitions and executors. In other words the par

Re: Spark master driver UI: How to keep it after process finished?

2015-08-07 Thread François Pelletier
look at spark.history.ui.port, if you use standalone spark.yarn.historyServer.address, if you use YARN in your Spark config file Mine is located at /etc/spark/conf/spark-defaults.conf If you use Apache Ambari you can find this settings in the Spark / Configs / Advanced spark-defaults tab Franç

Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread andy petrella
Exactly! The sharing part is used in the Spark Notebook (this one ) so we can share stuffs between notebooks which are different SparkContext (in diff JVM). OTOH, we have a project that creates micro services

Re: miniBatchFraction for LinearRegressionWithSGD

2015-08-07 Thread Feynman Liang
Good point; I agree that defaulting to online SGD (single example per iteration) would be a poor UX due to performance. On Fri, Aug 7, 2015 at 12:44 PM, Meihua Wu wrote: > Feynman, thanks for clarifying. > > If we default miniBatchFraction = (1 / numInstances), then we will > only hit one row fo

Re: SparkSQL: "add jar" blocks all queries

2015-08-07 Thread Wu, James C.
Hi, The issue only seems to happen when trying to access spark via the SparkSQL Thrift Server interface. Does anyone know a fix? james From: , Walt Disney mailto:james.c...@disney.com>> Date: Friday, August 7, 2015 at 12:40 PM To: "user@spark.apache.org" mailto:u

Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread Eric Bless
I’m having some difficulty getting the desired results fromthe Spark Python example hbase_inputformat.py. I’m running with CDH5.4, hbaseVersion 1.0.0, Spark v 1.3.0 Using Python version 2.6.6   I followed the example to create a test HBase table. Here’sthe data from the table I created – hbase(m

Fwd: spark config

2015-08-07 Thread Bryce Lobdell
I Recently downloaded spark package 1.4.0: A build of Spark with "sbt/sbt clean assembly" failed with message "Error: Invalid or corrupt jarfile build/sbt-launch-0.13.7.jar" Upon investigation I figured out that "sbt-launch-0.13.7.jar" is downloaded at build time and that it contained the the fol

Accessing S3 files with s3n://

2015-08-07 Thread Akshat Aranya
Hi, I've been trying to track down some problems with Spark reads being very slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I realized that this file system implementation fetches the entire file, which isn't really a Spark problem, but it really slows down things when try

How to get total CPU consumption for Spark job

2015-08-07 Thread Xiao JIANG
Hi all, I was running some Hive/spark job on hadoop cluster. I want to see how spark helps improve not only the elapsed time but also the total CPU consumption. For Hive, I can get the 'Total MapReduce CPU Time Spent' from the log when the job finishes. But I didn't find any CPU stats for Spark

Re: tachyon

2015-08-07 Thread Abhishek R. Singh
Thanks Calvin - much appreciated ! -Abhishek- On Aug 7, 2015, at 11:11 AM, Calvin Jia wrote: > Hi Abhishek, > > Here's a production use case that may interest you: > http://www.meetup.com/Tachyon/events/222485713/ > > Baidu is using Tachyon to manage more than 100 nodes in production resulti

Re: spark config

2015-08-07 Thread Ted Yu
In master branch, build/sbt-launch-lib.bash has the following: URL1= https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/${SBT_VERSION}/sbt-launch.jar I verified that the following exists: https://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/#sbt-launc

Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Hi all, I have a partitioned parquet table (very small table with only 2 partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I applied this patch to spark [SPARK-7743] so I assume that spark can read parquet files normally, however, I'm getting this when trying to do a simple `se

Re: spark config

2015-08-07 Thread Ted Yu
Looks like Sean fixed it: [SPARK-9633] [BUILD] SBT download locations outdated; need an update Cheers On Fri, Aug 7, 2015 at 3:22 PM, Dean Wampler wrote: > That's the correct URL. Recent change? The last time I looked, earlier > this week, it still had the obsolete artifactory URL for URL1 ;)

Re: spark config

2015-08-07 Thread Dean Wampler
That's the correct URL. Recent change? The last time I looked, earlier this week, it still had the obsolete artifactory URL for URL1 ;) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampl

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Philip Weaver
Yes, NullPointerExceptions are pretty common in Spark (or, rather, I seem to encounter them a lot!) but can occur for a few different reasons. Could you add some more detail, like what the schema is for the data, or the code you're using to read it? On Fri, Aug 7, 2015 at 3:20 PM, Jerrick Hoang w

Re: Spark job workflow engine recommendations

2015-08-07 Thread Vikram Kone
Hien, Is Azkaban being phased out at linkedin as rumored? If so, what's linkedin going to use for workflow scheduling? Is there something else that's going to replace Azkaban? On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu wrote: > In my opinion, choosing some particular project among its peers should

does dstream.transform() run on the driver node?

2015-08-07 Thread lookfwd
Hello, here's a simple program that demonstrates my problem: Is "keyavg = rdd.values().reduce(sum) / rdd.count()" inside stats calculated one time per partition or it's just once? I guess another way to ask the same question is DStream.transform() is called on the driver node or not? What would

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Cheng Lian
It doesn't seem to be Parquet 1.7.0 since the package name isn't under "org.apache.parquet" (1.7.0 is the first official Apache release of Parquet). The version you were using is probably Parquet 1.6.0rc3 according to the line number information: https://github.com/apache/parquet-mr/blob/parque

Re: Spark failed while trying to read parquet files

2015-08-07 Thread Jerrick Hoang
Yes! I was being dumb, should have caught that earlier, thank you Cheng Lian On Fri, Aug 7, 2015 at 4:25 PM, Cheng Lian wrote: > It doesn't seem to be Parquet 1.7.0 since the package name isn't under > "org.apache.parquet" (1.7.0 is the first official Apache release of > Parquet). The version yo

  1   2   >