Re: S3 SubFolder Write Issues

2015-03-11 Thread Calum Leslie
You want the s3n:// ("native") protocol rather than s3://. s3:// is a block filesystem based on S3 that doesn't respect paths. More information on the Hadoop site: https://wiki.apache.org/hadoop/AmazonS3 Calum. On Wed, 11 Mar 2015 04:47 cpalm3 wrote: > Hi All, > > I am hoping someone has seen

Re: SocketTextStream not working from messages sent from other host

2015-03-11 Thread Akhil Das
May be you can use this code for your purpose https://gist.github.com/akhld/4286df9ab0677a555087 It basically sends the content of the given file through Socket (both IO/NIO), i used it for a benchmark between IO and NIO. Thanks Best Regards On Wed, Mar 11, 2015 at 11:36 AM, Cui Lin wrote: >

Re: Spark fpg large basket

2015-03-11 Thread Akhil Das
You need to set spark.cores.max to a number say 16, so that on all 4 machines the tasks will get distributed evenly, Another thing would be to set spark.default.parallelism if you haven't tried already. Thanks Best Regards On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay wrote: > I am running on

How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Hi, I need to set per-user spark.local.dir, how can I do that? I tried both /x/home/${user.name}/spark/tmp and /x/home/${USER}/spark/tmp And neither worked. Looks like it has to be a constant setting in spark-defaults.conf. Right? Any ideas how to do that? Thanks, -- Jianshi Huang Linke

Re: S3 SubFolder Write Issues

2015-03-11 Thread Akhil Das
Does it write anything in BUCKET/SUB_FOLDER/output? Thanks Best Regards On Wed, Mar 11, 2015 at 10:15 AM, cpalm3 wrote: > Hi All, > > I am hoping someone has seen this issue before with S3, as I haven't been > able to find a solution for this problem. > > When I try to save as Text file to s3 i

Re: Temp directory used by spark-submit

2015-03-11 Thread Akhil Das
After setting SPARK_LOCAL_DIRS/SPARK_WORKER_DIR you need to restart your spark instances (stop-all.sh and start-all.sh), You can also try setting java.io.tmpdir while creating the SparkContext. Thanks Best Regards On Wed, Mar 11, 2015 at 1:47 AM, Justin Yip wrote: > Hello, > > I notice that whe

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Patrick Wendell
We don't support expressions or wildcards in that configuration. For each application, the local directories need to be constant. If you have users submitting different Spark applications, those can each set spark.local.dirs. - Patrick On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang wrote: > Hi,

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Unfortunately /tmp mount is really small in our environment. I need to provide a per-user setting as the default value. I hacked bin/spark-class for the similar effect. And spark-defaults.conf can override it. :) Jianshi On Wed, Mar 11, 2015 at 3:28 PM, Patrick Wendell wrote: > We don't suppor

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Sean Owen
You shouldn't use /tmp, but it doesn't mean you should use user home directories either. Typically, like in YARN, you would a number of directories (on different disks) mounted and configured for local storage for jobs. On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang wrote: > Unfortunately /tmp mo

Re: Spark fpg large basket

2015-03-11 Thread Sean Owen
I don't think there is enough information here. Where is the program spending its time? where does it "stop"? how many partitions are there? On Wed, Mar 11, 2015 at 7:10 AM, Akhil Das wrote: > You need to set spark.cores.max to a number say 16, so that on all 4 > machines the tasks will get distr

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Thanks Sean. I'll ask our Hadoop admin. Actually I didn't find hadoop.tmp.dir in the Hadoop settings...using user home is suggested by other users. Jianshi On Wed, Mar 11, 2015 at 3:51 PM, Sean Owen wrote: > You shouldn't use /tmp, but it doesn't mean you should use user home > directories eit

Re: Spark fpg large basket

2015-03-11 Thread Sean Barzilay
The program spends its time when I am writing the output to a text file and I am using 70 partitions On Wed, 11 Mar 2015 9:55 am Sean Owen wrote: > I don't think there is enough information here. Where is the program > spending its time? where does it "stop"? how many partitions are > there? > >

Example of partitionBy in pyspark

2015-03-11 Thread Stephen Boesch
I am finding that partitionBy is hanging - and it is not clear whether the custom partitioner is even being invoked (i put an exception in there and can not see it in the worker logs). The structure is similar to the following: inputPairedRdd = sc.parallelize([{0:"Entry1",1,"Entry2"}]) def ident

skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke
Hi I'm trying to do a join of two datasets: 800GB with ~50MB. My code looks like this: private def parseClickEventLine(line: String, jsonFormatBC: Broadcast[LazyJsonFormat]): ClickEvent = { val json = line.parseJson.asJsObject val eventJson = if (json.fields.contains("recommendationId

Re: Example of partitionBy in pyspark

2015-03-11 Thread Ted Yu
Should the comma after 1 be colon ? > On Mar 11, 2015, at 1:41 AM, Stephen Boesch wrote: > > > I am finding that partitionBy is hanging - and it is not clear whether the > custom partitioner is even being invoked (i put an exception in there and can > not see it in the worker logs). > > Th

"Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, it seems like I am unable to shut down my StreamingContext properly, both in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster mode, subsequent use of a new StreamingContext will raise an InvalidActorNameException. I use logger.info("stoppingStreamingContext") staticStr

hbase sql query

2015-03-11 Thread Udbhav Agarwal
Hi, How can we simply cache hbase table and do sql query via java api in spark. Thanks, Udbhav Agarwal

Running Spark from Scala source files other than main file

2015-03-11 Thread Aung Kyaw Htet
Hi Everyone, I am developing a scala app, in which the main object does not call the SparkContext, but another object defined in the same package creates it, run spark operations and closes it. The jar file is built successfully in maven, but when I called spark-submit with this jar, that spark do

Re: Spark fpg large basket

2015-03-11 Thread Sean Owen
Have you looked at how big your output is? for example, if your min support is very low, you will output a massive volume of frequent item sets. If that's the case, then it may be expected that it's taking ages to write terabytes of data. On Wed, Mar 11, 2015 at 8:34 AM, Sean Barzilay wrote: > Th

Unable to saveToCassandra while cassandraTable works fine

2015-03-11 Thread Tiwari, Tarun
Hi, I am stuck at this for 3 days now. I am using the spark-cassandra-connector with spark and I am able to make RDDs with sc.cassandraTable function that means spark is able to communicate with Cassandra properly. But somehow the saveToCassandra is not working. Below are the steps I am doing.

Re: Joining data using Latitude, Longitude

2015-03-11 Thread Manas Kar
There are few techniques currently available. Geomesa which uses GeoHash also can be proved useful.( https://github.com/locationtech/geomesa) Other potential candidate is https://github.com/Esri/gis-tools-for-hadoop especially https://github.com/Esri/geometry-api-java for inner customization. If

Re: Spark fpg large basket

2015-03-11 Thread Sean Barzilay
My min support is low and after filling out all my space I am applying a filter on the results to only get item seta that interest me On Wed, 11 Mar 2015 1:58 pm Sean Owen wrote: > Have you looked at how big your output is? for example, if your min > support is very low, you will output a massiv

Re: skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke
On Wed, 11 Mar 2015 11:19:56 +0100 Marcin Cylke wrote: > Hi > > I'm trying to do a join of two datasets: 800GB with ~50MB. The job finishes if I set spark.yarn.executor.memoryOverhead to 2048MB. If it is around 1000MB it fails with "executor lost" errors. My spark settings are: - executor cor

PairRDD serialization exception

2015-03-11 Thread manasdebashiskar
(This is a repost. May be a simpler subject will fetch more attention among experts) Hi, I have a CDH5.3.2(Spark1.2) cluster. I am getting an local class incompatible exception for my spark application during an action. All my classes are case classes(To best of my knowledge) Appreciate any hel

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Yes, a previous prototype is available https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at last year's Spark Summit ( http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark ) We are currently porting the prototype to use the latest DataFra

Re: Writing wide parquet file in Spark SQL

2015-03-11 Thread Ravindra
Even I am keen to learn an answer for this but as an alternate you can use hive to create a table "stored as parquet" and then use it in spark. On Wed, Mar 11, 2015 at 1:44 AM kpeng1 wrote: > Hi All, > > I am currently trying to write a very wide file into parquet using spark > sql. I have 100K

Re: PairRDD serialization exception

2015-03-11 Thread Sean Owen
This usually means you are mixing different versions of code. Here it is complaining about a Spark class. Are you sure you built vs the exact same Spark binaries, and are not including them in your app? On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar wrote: > (This is a repost. May be a simpler

Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Hi Spark Community, We would like to define exception handling behavior on RDD instantiation / build. Since the RDD is lazily evaluated, it seems like we are forced to put all exception handling in the first action call? This is an example of something that would be nice: def myRDD = { Try { val

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores
Hi: Thanks for both answers. One final question. *This registerTempTable is not an extra process that the SQL queries need to do that may decrease performance over the language integrated method calls? *The thing is that I am planning to use them in the current version of the ML Pipeline transform

RE: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Wang, Ningjun (LNG-NPV)
Thanks for the suggestion. I will try that. Ningjun From: Silvio Fiorito [mailto:silvio.fior...@granturing.com] Sent: Wednesday, March 11, 2015 12:40 AM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: Is it possible to use windows service to start and stop spark standalone clust

Re: SQL with Spark Streaming

2015-03-11 Thread Irfan Ahmad
Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 11, 2015 at 6:41 AM, J

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Is there a way to have the exception handling go lazily along with the definition? e.g... we define it on the RDD but then our exception handling code gets triggered on that first action... without us having to define it on the first action? (e.g. that RDD code is boilerplate and we want to just h

Re: GraphX Snapshot Partitioning

2015-03-11 Thread Matthew Bucci
Hi, Thanks for the response! That answered some questions I had, but the last one I was wondering is what happens if you run a partition strategy and one of the partitions ends up being too large? For example, let's say partitions can hold 64MB (actually knowing the maximum possible size of a part

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Sorry typo; should be https://github.com/intel-spark/stream-sql Thanks, -Jason On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad wrote: > Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql > > > *Irfan Ahmad* > CTO | Co-Founder | *CloudPhysics* >

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Well I'm thinking that this RDD would fail to build in a specific way... different from the subsequent code (e.g. s3 access denied or timeout on connecting to a database) So for example, define the RDD failure handling on the RDD, define the action failure handling on the action? Does this make s

Re: Define exception handling on lazy elements?

2015-03-11 Thread Sean Owen
Hm, but you already only have to define it in one place, rather than on each transformation. I thought you wanted exception handling at each transformation? Or do you want it once for all actions? you can enclose all actions in a try-catch block, I suppose, to write exception handling code once. Y

SVM questions (data splitting, SVM parameters)

2015-03-11 Thread Natalia Connolly
Hello, I am new to Spark and I am evaluating its suitability for my machine learning tasks. I am using Spark v. 1.2.1. I would really appreciate if someone could provide any insight about the following two issues. 1. I'd like to try a "leave one out" approach for training my SVM, meaning t

Re: SVM questions (data splitting, SVM parameters)

2015-03-11 Thread Sean Owen
Jackknife / leave-one-out is just a special case of bootstrapping. In general I don't think you'd ever do it this way with large data, as you are creating N-1 models, which could be billions. It's suitable when you have very small data. So I suppose I'd say the general bootstrapping you see in thi

bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-11 Thread Patcharee Thongtra
Hi, I have built spark version 1.3 and tried to use this in my spark scala application. When I tried to compile and build the application by SBT, I got error> bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available It se

Re: Joining data using Latitude, Longitude

2015-03-11 Thread Ankur Srivastava
Thank you everyone!! I have started implementing the join using the geohash and using the first 4 alphabets of the HASH as the key. Can I assign a Confidence factor in terms of distance based on number of characters matching in the HASH code? I will also look at the other options listed here. Th

Read parquet folders recursively

2015-03-11 Thread Masf
Hi all Is it possible to read recursively folders to read parquet files? Thanks. -- Saludos. Miguel Ángel

Re: PairRDD serialization exception

2015-03-11 Thread Manas Kar
Hi Sean, Below is the sbt dependencies that I am using. I gave another try by removing the "provided" keyword which failed with the same error. What confuses me is that the stack trace appears after few of the stages have already run completely. object V { val spark = "1.2.0-cdh5.3.0" v

Architecture Documentation

2015-03-11 Thread Mohit Anchlia
Is there a good architecture doc that gives a sufficient overview of high level and low level details of spark with some good diagrams?

Scaling problem in RandomForest?

2015-03-11 Thread insperatum
Hi, the Random Forest implementation (1.2.1) is repeatably crashing when I increase the depth to 20. I generate random synthetic data (36 workers, 1,000,000 examples per worker, 30 features per example) as follows: val data = sc.parallelize(1 to 36, 36).mapPartitionsWithIndex((i, _) => {

Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Hi, I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added checkpointing; everything works fine when starting from scratch. When starting from a checkpoint however, the job doesn’t work and produces the foll

Re: Architecture Documentation

2015-03-11 Thread Vijay Saraswat
I've looked at the Zaharia PhD thesis to get an idea of the underlying system. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html On 3/11/15 12:48 PM, Mohit Anchlia wrote: Is there a good architecture doc that gives a sufficient overview of high level and low level details of sp

Re: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Michael Armbrust
That val is not really your problem. In general, there is a lot of global state throughout the hive codebase that make it unsafe to try and connect to more than one hive installation from the same JVM. On Tue, Mar 10, 2015 at 11:36 PM, Haopu Wang wrote: > Hao, thanks for the response. > > > >

Re: How to read from hdfs using spark-shell in Intel hadoop?

2015-03-11 Thread Arush Kharbanda
You can add resolvers on SBT using resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"; On Thu, Feb 26, 2015 at 4:09 PM, MEETHU MATHEW wrote: > Hi, > > I am not able to read from HDFS(Intel distribution hadoop,Hadoop version > is 1.0.3) from spa

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)). > On 11.03.2015, at 18:35, Marius Soutier wrote: > > Hi, > > I’ve written a Spark Streaming Job that inserts into a Parquet, using > stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added > checkpointi

Writing to a single file from multiple executors

2015-03-11 Thread SamyaMaiti
Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors and when all try to write to a given file I get a concurrent exception. I way to mitigate the issue is to repartition & have a

Which strategy is used for broadcast variables?

2015-03-11 Thread Tom
In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury I read that Spark uses HDFS for its broadcast variables. This seems highly inefficient. In the same paper alternatives are proposed, among which "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105, secon

RE: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
Hi Tom, That's an outdated document from 4/5 years ago. Spark currently uses a BitTorrent like mechanism that's been tuned for datacenter environments. Mosharaf -Original Message- From: "Tom" Sent: ‎3/‎11/‎2015 4:58 PM To: "user@spark.apache.org" Subject: Which strategy is used for

Spark SQL using Hive metastore

2015-03-11 Thread Grandl Robert
Hi guys, I am a newbie in running Spark SQL / Spark. My goal is to run some TPC-H queries atop Spark SQL using Hive metastore.  It looks like spark 1.2.1 release has Spark SQL / Hive support. However, I am not able to fully connect all the dots. I did the following: 1. Copied hive-site.xml fro

SVD transform of large matrix with MLlib

2015-03-11 Thread sergunok
Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse matrix? What time did it take? What implementation of SVD is used in MLLib? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVD-transform-of-large-matrix-with-MLlib-tp22005.html Sent

can spark take advantage of ordered data?

2015-03-11 Thread Jonathan Coveney
Hello all, I am wondering if spark already has support for optimizations on sorted data and/or if such support could be added (I am comfortable dropping to a lower level if necessary to implement this, but I'm not sure if it is possible at all). Context: we have a number of data sets which are es

PySpark: Python 2.7 cluster installation script (with Numpy, IPython, etc)

2015-03-11 Thread Sebastián Ramírez
Many times, when I'm setting up a cluster, I have to use an operating system (as RedHat or CentOS 6.5) which has an old version of Python (Python 2.6). For example, when using a Hadoop distribution that only supports those operating systems (as Hortonworks' HDP or Cloudera). And that also makes i

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska
You might also want to see if TaskScheduler helps with that. I have not used it with Windows 2008 R2 but it generally does allow you to schedule a bat file to run on startup On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > > Thanks for the suggestio

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Thanks Mosharaf, for the quick response! Can you maybe give me some pointers to an explanation of this strategy? Or elaborate a bit more on it? Which parts are involved in which way? Where are the time penalties and how scalable is this implementation? Thanks again, Tom On 11 March 2015 at 16:01

Re: Writing to a single file from multiple executors

2015-03-11 Thread Tathagata Das
Why do you have to write a single file? On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti wrote: > Hi Experts, > > I have a scenario, where in I want to write to a avro file from a streaming > job that reads data from kafka. > > But the issue is, as there are multiple executors and when all try to w

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Tobias Pfeiffer
Hi, On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores wrote: > > Thanks for both answers. One final question. *This registerTempTable is > not an extra process that the SQL queries need to do that may decrease > performance over the language integrated method calls? * > As far as I know, registerTe

Re: SQL with Spark Streaming

2015-03-11 Thread Tobias Pfeiffer
Hi, On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie wrote: > > According to my understanding, your approach is to register a series of > tables by using transformWith, right? And then, you can get a new Dstream > (i.e., SchemaDstream), which consists of lots of SchemaRDDs. > Yep, it's basically the

Re: How to preserve/preset partition information when load time series data?

2015-03-11 Thread Imran Rashid
It should be *possible* to do what you want ... but if I understand you right, there isn't really any very easy way to do it. I think you would need to write your own subclass of RDD, which has its own logic on how the input files get put divided among partitions. You can probably subclass Hadoop

JavaSparkContext - jarOfClass or jarOfObject dont work

2015-03-11 Thread Nirav Patel
Hi I am trying to run my spark service against cluster. As it turns out I have to do setJars and set my applicaiton jar in there. If I do it using physical path like following it works `conf.setJars(new String[]{"/path/to/jar/Sample.jar"});` but If i try to use JavaSparkContext (or SparkContext) a

Fwd: Numbering RDD members Sequentially

2015-03-11 Thread Steve Lewis
-- Forwarded message -- From: Steve Lewis Date: Wed, Mar 11, 2015 at 9:13 AM Subject: Re: Numbering RDD members Sequentially To: "Daniel, Ronald (ELS-SDG)" perfect - exactly what I was looking for, not quite sure why it is called zipWithIndex since zipping is not involved my cod

Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Imran Rashid
I am not entirely sure I understand your question -- are you saying: * scoring a sample of 50k events is fast * taking the top N scores of 77M events is slow, no matter what N is ? if so, this shouldn't come as a huge surprise. You can't find the top scoring elements (no matter how small N is)

RE: Spark SQL using Hive metastore

2015-03-11 Thread java8964
You need to include the Hadoop native library in your spark-shell/spark-sql, assuming your hadoop native library including native snappy library. spark-sql --driver-library-path point_to_your_hadoop_native_library In spark-sql, you can just use any command as you are in Hive CLI. Yong Date: Wed,

Re: Process time series RDD after sortByKey

2015-03-11 Thread Imran Rashid
this is a very interesting use case. First of all, its worth pointing out that if you really need to process the data sequentially, fundamentally you are limiting the parallelism you can get. Eg., if you need to process the entire data set sequentially, then you can't get any parallelism. If you

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-11 Thread Imran Rashid
I'm not sure what you mean. Are you asking how you can recompile all of spark and deploy it, instead of using one of the pre-built versions? https://spark.apache.org/docs/latest/building-spark.html Or are you seeing compile problems specifically w/ HighlyCompressedMapStatus? The code compiles

RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964
RangePartitioner? At least for join, you can implement your own partitioner, to utilize the sorted data. Just my 2 cents. Date: Wed, 11 Mar 2015 17:38:04 -0400 Subject: can spark take advantage of ordered data? From: jcove...@gmail.com To: User@spark.apache.org Hello all, I am wondering if spark

Re: How to use more executors

2015-03-11 Thread Du Li
Is it being merged in the next release? It's indeed a critical patch! Du On Wednesday, January 21, 2015 3:59 PM, Nan Zhu wrote: …not sure when will it be reviewed… but for now you can work around by allowing multiple worker instances on a single machine  http://spark.apache.org/docs

Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid
is your data skewed? Could it be that there are a few keys with a huge number of records? You might consider outputting (recordA, count) (recordB, count) instead of recordA recordA recordA ... you could do this with: input = sc.textFile pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}

Re: How to use more executors

2015-03-11 Thread Nan Zhu
at least 1.4 I think now using YARN or allowing multiple worker instances are just fine Best, -- Nan Zhu http://codingcat.me On Wednesday, March 11, 2015 at 8:42 PM, Du Li wrote: > Is it being merged in the next release? It's indeed a critical patch! > > Du > > > On Wednesday, J

RE: Spark SQL using Hive metastore

2015-03-11 Thread Cheng, Hao
Hi, Robert Spark SQL currently only support Hive 0.12.0(need to re-compile the package) and 0.13.1(by default), I am not so sure if it supports the Hive 0.14 metastore service as backend. Another way you can try is configure the $SPARK_HOME/conf/hive-site.xml to access the remote metastore dat

Re: "Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Sean, On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer wrote: > > it seems like I am unable to shut down my StreamingContext properly, both > in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster > mode, subsequent use of a new StreamingContext will raise > an InvalidActorNameExc

Re: How to use more executors

2015-03-11 Thread Du Li
Is it possible to extend this PR further (or create another PR) to allow for per-node configuration of workers?  There are many discussions about heterogeneous spark cluster. Currently configuration on master will override those on the workers. Many spark users have the need for having machines

Re: Numbering RDD members Sequentially

2015-03-11 Thread Mark Hamstra
> > not quite sure why it is called zipWithIndex since zipping is not involved > It isn't? http://stackoverflow.com/questions/1115563/what-is-zip-functional-programming On Wed, Mar 11, 2015 at 5:18 PM, Steve Lewis wrote: > > -- Forwarded message -- > From: Steve Lewis > Date: W

Re: How to use more executors

2015-03-11 Thread Nan Zhu
I think this should go to another PR can you create a JIRA on that? Best, -- Nan Zhu http://codingcat.me On Wednesday, March 11, 2015 at 8:50 PM, Du Li wrote: > Is it possible to extend this PR further (or create another PR) to allow for > per-node configuration of workers? > > There

Re: Running Spark from Scala source files other than main file

2015-03-11 Thread Imran Rashid
did you forget to specify the main class w/ "--class Main"? though if that was it, you should at least see *some* error message, so I'm confused myself ... On Wed, Mar 11, 2015 at 6:53 AM, Aung Kyaw Htet wrote: > Hi Everyone, > > I am developing a scala app, in which the main object does not ca

Re: can spark take advantage of ordered data?

2015-03-11 Thread Imran Rashid
Hi Jonathan, you might be interested in https://issues.apache.org/jira/browse/SPARK-3655 (not yet available) and https://github.com/tresata/spark-sorted (not part of spark, but it is available right now). Hopefully thats what you are looking for. To the best of my knowledge that covers what is a

StreamingListener

2015-03-11 Thread Corey Nolet
Given the following scenario: dstream.map(...).filter(...).window(...).foreachrdd() When would the onBatchCompleted fire?

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Tathagata Das
Can you show us the code that you are using? This might help. This is the updated streaming programming guide for 1.3, soon to be up, this is a quick preview. http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations TD On Wed, Mar 11, 201

Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Saba Sehrish
Let me clarify - taking 1 elements of 5 elements using top or takeOrdered is taking about 25-50s which seems to be slow. I also try to use sortByKey to sort the elements to get a time estimate and I get numbers in the same range. I'm running this application on a cluster with 5 nodes and

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
The current broadcast algorithm in Spark approximates the one described in the Section 5 of this paper . It is expected to scale sub-linearly; i.e., O(log N), where N is the number of machines in your cluster. We evaluated up to 10

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Those results look very good for the larger workloads (100MB and 1GB). Were you also able to run experiments for smaller amounts of data? For instance broadcasting a single variable to the entire cluster? In the paper you state that HDFS-based mechanisms performed well only for small amounts of dat

Re: SVD transform of large matrix with MLlib

2015-03-11 Thread Reza Zadeh
Answers: databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html Reza On Wed, Mar 11, 2015 at 2:33 PM, sergunok wrote: > Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse > matrix? > What time did it take? > What implementation of SVD

Re: "Timed out while stopping the job generator" plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, I discovered what caused my issue when running on YARN and was able to work around it. On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer wrote: > The processing itself is complete, i.e., the batch currently processed at > the time of stop() is finished and no further batches are processed. >

Re: sc.textFile() on windows cannot access UNC path

2015-03-11 Thread Akhil Das
Sounds like the way of doing it. Could you try accessing a file from UNC Path with native Java nio code and make sure it is able access it with the URI file:10.196.119.230/folder1/abc.txt? Thanks Best Regards On Wed, Mar 11, 2015 at 7:45 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.c

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-11 Thread raghav0110.cs
Thank you very much for your detailed response, it was very informative and cleared up some of my misconceptions. After your explanation, I understand that the distribution of the data and parallelism is all meant to be an abstraction to the developer. In your response you say “When you ca

Re: hbase sql query

2015-03-11 Thread Akhil Das
Like this? val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]).cache() Here's a complete example

Re: bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-11 Thread Akhil Das
Spark 1.3.0 is not officially out yet, so i don't think sbt will download the hadoop dependencies for your spark by itself. You could try manually adding the hadoop dependencies yourself (hadoop-core, hadoop-common, hadoop-client) Thanks Best Regards On Wed, Mar 11, 2015 at 9:07 PM, Patcharee Tho

Re: StreamingListener

2015-03-11 Thread Akhil Das
At the end of foreachrdd i believe. Thanks Best Regards On Thu, Mar 12, 2015 at 6:48 AM, Corey Nolet wrote: > Given the following scenario: > > dstream.map(...).filter(...).window(...).foreachrdd() > > When would the onBatchCompleted fire? > >

Re: Read parquet folders recursively

2015-03-11 Thread Akhil Das
Hi We have a custom build to read directories recursively, Currently we use it with fileStream like: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/datadumps/", (t: Path) => true, true, *true*) Making the 4th argument true to read recursively. You could give it a try h