Re: training recsys model

2014-08-14 Thread Xiangrui Meng
Try many combinations of parameters on a small dataset, find the best, and then try to map them to a big dataset. You can also reduce the search region iteratively based on the best combination in the current iteration. -Xiangrui On Wed, Aug 13, 2014 at 1:13 AM, Hoai-Thu Vuong wrote: > Thank you

Re: Spark Akka/actor failures.

2014-08-14 Thread Xiangrui Meng
Could you try to map it to row-majored first? Your approach may generate multiple copies of the data. The code should look like this: ~~~ val rows = rdd.map { case (j, values) => values.view.zipWithIndex.map { case (v, i) => (i, (j, v)) } }.groupByKey().map { case (i, entries) => Vectors

Re: Job aborted due to stage failure: TID x failed for unknown reasons

2014-08-14 Thread jerryye
bump. same problem here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Gefei Li
Hello, I wrote a class named BooleanPair: public static class BooleanPairet implements Serializable{ public Boolean elementBool1; public Boolean elementBool2; BooleanPair(Boolean bool1, Boolean bool2){elementBool1 = bool1; elementBool2 = bool2;} public String to

Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Tathagata Das
FlatMap the JavaRDD to JavaRDD. Then it should work. TD On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li wrote: > Hello, > I wrote a class named BooleanPair: > > public static class BooleanPairet implements Serializable{ > public Boolean elementBool1; > public Boolean elementBool2

read performance issue

2014-08-14 Thread Gurvinder Singh
Hi, I am running spark from the git directly. I recently compiled the newer version Aug 13 version and it has performance drop of 2-3x in read from HDFS compare to git version of Aug 1. So I am wondering which commit would have cause such an issue in read performance. The performance is almost sam

Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Gefei Li
Thank you! It works so well for me! Regards, Gefei On Thu, Aug 14, 2014 at 4:25 PM, Tathagata Das wrote: > FlatMap the JavaRDD to JavaRDD. Then it should > work. > > TD > > > On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li wrote: > >> Hello, >> I wrote a class named BooleanPair: >> >> public st

Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Hoai-Thu Vuong
I've found a method saveAsObjectFile in RDD (or JavaRDD). I think we can save this array to file and load back to object when read these file. However, I've known the way to load back and cast RDD to specific object, need time to try. On Thu, Aug 14, 2014 at 3:48 PM, Gefei Li wrote: > Thank you

Re: Viewing web UI after fact

2014-08-14 Thread Grzegorz Białek
Hi, Thank you both for your answers. Browsing using Master UI works fine. Unfortunately History Server shows "No Completed Applications Found" even if logs exists under given directory, but using Master UI is enough for me. Best regards, Grzegorz On Wed, Aug 13, 2014 at 8:09 PM, Andrew Or wrot

Re: How to direct insert vaules into SparkSQL tables?

2014-08-14 Thread chutium
oh, right, i meant within SqlContext alone, schemaRDD from text file with a case class -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html Sent from the Apache Spark User List mailing list archi

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Hoai-Thu Vuong
A man in this community give me a video: https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in this community and other guys helped me to solve this problem. I'm trying to load MatrixFactorizationModel from object file, but compiler said that, I can not create object because the

Re: how to use the method saveAsTextFile of a RDD like javaRDD

2014-08-14 Thread Gefei Li
It is interesting to save a RDD on a disk or HDFS or somethings else as a set of objects, but I think it's more useful to save it as a text file for debugging or just as an output file. If we want to reuse a RDD, text file also works, but perhaps a set of object files will bring a decrease on execu

Re: Script to deploy spark to Google compute engine

2014-08-14 Thread Mayur Rustagi
We have a version that is submitted for PR https://github.com/sigmoidanalytics/spark_gce/tree/for_spark We are working on a more generic implementation based on lib_cloud... would love collaborate if you are interested.. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_ru

Re: Script to deploy spark to Google compute engine

2014-08-14 Thread Michael Hausenblas
Did you check out http://www.spark-stack.org/spark-cluster-on-google-compute/ already? Cheers, Michael -- Michael Hausenblas Ireland, Europe http://mhausenblas.info/ On 14 Aug 2014, at 05:17, Soumya Simanta wrote: > > Before I start doing something on my own I wanted to chec

Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Darin McBeath
I started up a cluster on EC2 (using the provided scripts) and specified a different instance type for the master and the the worker nodes.  The cluster started fine, but when I looked at the cluster (via port 8080), it showed that the amount of memory available to the worker nodes did not match

Re: Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Akhil Das
Hi Darin, This is the piece of code doing the actual work (Setting the memory). As you can see, it leaves 15Gb of ram for OS on a > 100Gb machine... 2Gb RAM on a 10-20Gb machine etc. You can always set SPARK_WORKER_MEMORY/SPARK_EXECU

Re: Python + Spark unable to connect to S3 bucket .... "Invalid hostname in URI"

2014-08-14 Thread Miroslaw
I have tried that already but still get the same error. To be honestly, I feel as though I am missing something obvious with my configuration, I just can't find what it may be. Miroslaw Horbal On Wed, Aug 13, 2014 at 10:38 PM, jerryye [via Apache Spark User List] < ml-node+s1001560n12082...@n3.

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Christopher Nguyen
Hi Hoai-Thu, the issue of private default constructor is unlikely the cause here, since Lance was already able to load/deserialize the model object. And on that side topic, I wish all serdes libraries would just use constructor.setAccessible(true) by default :-) Most of the time that privacy is no

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Matt Narrell
I’d suggest something like Apache YARN, or Apache Mesos with Marathon or something similar to allow for management, in particular restart on failure. mn On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer wrote: > Hi, > > On Thu, Aug 14, 2014 at 5:49 AM, salemi wrote: > what is the best way to make

Re: Down-scaling Spark on EC2 cluster

2014-08-14 Thread Shubhabrata
What about down-scaling when I use Mesos, does that really deteriorate the performance ? Otherwise we would probably go for spark on mesos on ec2 :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Down-scaling-Spark-on-EC2-cluster-tp10494p12109.html Sent fro

Using Spark Streaming to listen to HDFS directory and handle different files by file name

2014-08-14 Thread ZhangYi
As we know, in Spark, SparkContext provide the wholeTextFile() method to read all files in the specific directory, then generate RDD(fileName, content): scala> val lines = sc.wholeTextFiles("/Users/workspace/scala101/data") 14/08/14 22:43:02 INFO MemoryStore: ensureFreeSpace(35896) called with

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think I can reproduce this error. The following code cannot work and report "Foo" cannot be serialized. (log in gist https://gist.github.com/zsxwing/4f9f17201d4378fe3e16): class Foo { def foo() = Array(1.0) } val t = new Foo val m = t.foo val r1 = sc.parallelize(List(1, 2, 3)) val r2 = r1.map(_

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Silvio Fiorito
You also need to ensure you're using checkpointing and support recreating the context on driver failure as described in the docs here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node From: Matt Narrell mailto:matt.narr...@gmail.com>> Date: Thursday

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread lancezhange
Following codes works, too class Foo1 extends Serializable { def foo() = Array(1.0) } val t1 = new Foo1 val m1 = t1.foo val r11 = sc.parallelize(List(1, 2, 3)) val r22 = r11.map(_ + m1(0)) r22.toArray On Thu, Aug 14, 2014 at 10:55 PM, Shixiong Zhu [via Apache Spark User List] wrote: > I th

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Shixiong Zhu
I think in the following case class Foo { def foo() = Array(1.0) } val t = new Foo val m = t.foo val r1 = sc.parallelize(List(1, 2, 3)) val r2 = r1.map(_ + m(0)) r2.toArray Spark should not serialize "t". But looks it will. Best Regards, Shixiong Zhu 2014-08-14 23:22 GMT+08:00 lancezhange :

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread Kan Zhang
Good timing! I encountered that same issue recently and to address it, I changed the default Class.forName call to Utils.classForName. See my patch at https://github.com/apache/spark/pull/1916. After that change, my bin/pyspark --jars worked. On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein wrote

SPARK_DRIVER_MEMORY

2014-08-14 Thread Brad Miller
Hi All, I have a Spark job for which I need to increase the amount of memory allocated to the driver to collect a large-ish (>200M) data structure. Formerly, I accomplished this by setting SPARK_MEM before invoking my job (which effectively set memory on the driver) and then setting spark.executor

Re: Ways to partition the RDD

2014-08-14 Thread ssb61
You can try something like this, val kvRdd = sc.textFile("rawdata/").map( m => { val pfUser = m.split("t",2) (pfUser(0) -> pfUser(1))})

RE: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Sameer Tilak
Hi Yanbo,I think it was happening because some of the rows did not have all the columns. We are cleaning up the data and will let you know once we confirm this. Date: Thu, 14 Aug 2014 22:50:58 +0800 Subject: Re: java.lang.UnknownError: no bin was found for continuous variable. From: yanboha...@gm

Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
I tried with simple spark-hive select and insert, and it works. But to directly manipulate the ORCFile through RDD, spark has to be upgraded to support hive-0.13 first. Because some ORC API is not exposed until Hive-0.12. Thanks. Zhan Zhang On Aug 11, 2014, at 10:23 PM, vinay.kash...@socialin

Mlib model: viewing and saving

2014-08-14 Thread Sameer Tilak
I have a mlib model: val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth) I see model has following methods:algo asInstanceOf isInstanceOf predicttoString topNode model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf = fal

SPARK_LOCAL_DIRS

2014-08-14 Thread Brad Miller
Hi All, I'm having some trouble setting the disk spill directory for spark. The following approaches set "spark.local.dir" (according to the "Environment" tab of the web UI) but produce the indicated warnings: *In spark-env.sh:* export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill *Associated w

Re: SPARK_LOCAL_DIRS

2014-08-14 Thread Debasish Das
Actually I faced it yesterday... I had to put it in spark-env.sh and take it out from spark-defaults.conf on 1.0.1...Note that this settings should be visible on all workers.. After that I validated that SPARK_LOCAL_DIRS was indeed getting used for shuffling... On Thu, Aug 14, 2014 at 10:27 AM,

Re: Using Hadoop InputFormat in Python

2014-08-14 Thread TJ Klein
Yes, thanks great. This seems to be the issue. At least running with spark-submit works as well. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html Sent from the Apache Spark User List mailing list archive

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks, will give that a try. I see the number of partitions requested is 8 (through HashPartitioner(8)). If I have a 40 node cluster, whats the recommended number of partitions? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp1

Re: Subscribing to news releases

2014-08-14 Thread Nicholas Chammas
I've created an issue to track this: SPARK-3044: Create RSS feed for Spark News On Fri, May 30, 2014 at 11:07 AM, Nick Chammas wrote: > Is there a way to subscribe to news releases > ? That would be swel

Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
Yes. You are right, but I tried old hadoopFile for OrcInputFormat. In hive12, OrcStruct is not exposing its api, so spark cannot access it. With Hive13, RDD can read from OrcFile. Btw, I didn’t see ORCNewOutputFormat in hive-0.13. Direct RDD manipulation (Hive13) val inputRead = sc.hadoopFile

How to transform large local files into Parquet format and write into HDFS?

2014-08-14 Thread Parthus
Hi there, I have several large files (500GB per file) to transform into Parquet format and write to HDFS. The problems I encountered can be described as follows: 1) At first, I tried to load all the records in a file and then used "sc.parallelize(data)" to generate RDD and finally used "saveAsNew

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
First, I think you might have a misconception about partitioning. ALL RDDs are partitioned (even if they are a single partition). When reading from HDFS the number of partitions depends on how the data is stored in HDFS. After data is shuffled (generally caused by things like reduceByKey), the numb

Documentation to start with

2014-08-14 Thread Abhilash K Challa
Hi, Do any one have specific documentation for integrating Spark with hadoop distribution(does not already have spark) ? Thanks, Abhilash

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
Hi Davies, I tried the second option and launched my ec2 cluster with master on all the slaves by providing the latest commit hash of master as the "--spark-version" option to the spark-ec2 script. However, I am getting the same errors as before. I am running the job with the original spark-defaul

Re: groupByKey() completes 99% on Spark + EC2 + S3 but then throws java.net.SocketException: Connection reset

2014-08-14 Thread Arpan Ghosh
The errors are occurring in the exact same time in the job as well..right at the end of the groupByKey() when 5 tasks are left. On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh wrote: > Hi Davies, > > I tried the second option and launched my ec2 cluster with master on all > the slaves by prov

Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks Daniel for the detailed information. Since the RDD is already partitioned, there is no need to worry about repartitioning. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html Sent from the Apache Spark User Li

Re: Support for ORC Table in Shark/Spark

2014-08-14 Thread Zhan Zhang
I agree. We need the support similar to parquet file for end user. That’s the purpose of Spark-2883. Thanks. Zhan Zhang On Aug 14, 2014, at 11:42 AM, Yin Huai wrote: > I feel that using hadoopFile and saveAsHadoopFile to read and write ORCFile > are more towards developers because read/write

Spark on HDP

2014-08-14 Thread Padmanabh
Hi, I was reading the documentation at http://hortonworks.com/labs/spark/ and it seems to say that Spark is not ready for enterprise, which I think is not quite right. What I think they wanted to say is Spark on HDP is not ready for enterprise. I was wondering if someone here is using Spark on HDP

Re: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Joseph Bradley
I have run into that issue too, but only when the data were not pre-processed correctly. E.g., if a categorical feature is binary with values in {-1, +1} instead of {0,1}. Will be very interested to learn if it can occur elsewhere! On Thu, Aug 14, 2014 at 10:16 AM, Sameer Tilak wrote: > > Hi

Re: Spark Akka/actor failures.

2014-08-14 Thread ldmtwo
The reason we are not using MLLib and Breeze is the lack of control over the data and performance. After computing the covariance matrix, there isn't too much we can do after that. Many of the methods are private. For now, we need the max value and the coresponding pair of columns. Later, we may do

Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides

2014-08-14 Thread Denny Lee
For those whom were not able to attend the Seattle Spark Meetup - Spark at eBay - Troubleshooting the Everyday Issues, the slides have been now posted at:  http://files.meetup.com/12063092/SparkMeetupAugust2014Public.pdf. Enjoy! Denny

spark streaming - lamda architecture

2014-08-14 Thread salemi
Hi, How would you implement the batch layer of lamda architecture with spark/spark streaming? Thanks, Ali -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html Sent from the Apache Spark User List mailing list arch

Re: Spark webUI - application details page

2014-08-14 Thread SK
Hi, I am using Spark 1.0.1. But I am still not able to see the stats for completed apps on port 4040 - only for running apps. Is this feature supported or is there a way to log this info to some file? I am interested in stats about the total # of executors, total runtime, and total memory used by

Performance hit for using sc.setCheckPointDir

2014-08-14 Thread Debasish Das
Hi, For our large ALS runs, we are considering using sc.setCheckPointDir so that the intermediate factors are written to HDFS and the lineage is broken... Is there a comparison which shows the performance degradation due to these options ? If not I will be happy to add experiments with it... Tha

Dealing with Idle shells

2014-08-14 Thread Gary Malouf
We have our quantitative team using Spark as part of their daily work. One of the more common problems we run into is that people unintentionally leave their shells open throughout the day. This eats up memory in the cluster and causes others to have limited resources to run their jobs. With som

Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Hi, I'm having trouble compiling a snapshot, any advice would be appreciated. I get the error below when compiling either master or branch-1.1. The key error is, I believe, "[ERROR] File name too long" but I don't understand what it is referring to. Thanks! ./make-distribution.sh --tgz --skip-

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
There may be cases where you want to adjust the number of partitions or explicitly call RDD.repartition or RDD.coalesce. However, I would start with the defaults and then adjust if necessary to improve performance (for example, if cores are idling because there aren't enough tasks you may want more

Re: Spark webUI - application details page

2014-08-14 Thread durin
If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS should achieve what you want. I'm logging to the HDFS, but according to the config page a folder should be possible as well. Example with all other settings rem

SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Carlos J. Gil Bellosta
Hello, I am having problems trying to apply the split-apply-combine strategy for dataframes using SparkR. I have a largish dataframe and I would like to achieve something similar to what ddply(df, .(id), foo) would do, only that using SparkR as computing engine. My df has a few million records

Re: Spark webUI - application details page

2014-08-14 Thread Andrew Or
Hi all, As Simon explained, you need to set "spark.eventLog.enabled" to true. I'd like to add that the usage of SPARK_JAVA_OPTS to set spark configurations is deprecated. I'm sure many of you have noticed this from the scary warning message we print out. :) The recommended and supported way of se

Re: Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Tracked this down to incompatibility with Scala and encryptfs. Resolved by compiling in a directory not mounted with encryption (eg /tmp). On Thu, Aug 14, 2014 at 3:25 PM, Jim Blomo wrote: > Hi, I'm having trouble compiling a snapshot, any advice would be > appreciated. I get the error below whe

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread lancezhange
I finally solved the problem by following code var m: org.apache.spark.mllib.classification.LogisticRegressionModel = null m = newModel // newModel is the loaded one, see above post of mine val labelsAndPredsOnGoodData = goodDataPoints.map { point => val prediction = m.predict(point.feature

Getting hadoop distcp to work on ephemeral-hsfs in spark-ec2 cluster

2014-08-14 Thread Arpan Ghosh
Hi, I have launched an AWS Spark cluster using the spark-ec2 script (--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can see the name node at :50070. When I try to copy files from S3 into ephemeral-HDFS using distcp using the following command: ephemeral-hdfs/bin/hadoop dis

Re: Spark webUI - application details page

2014-08-14 Thread SK
I set "spark.eventLog.enabled" to true in $SPARK_HOME/conf/spark-defaults.conf and also configured the logging to a file as well as console in log4j.properties. But I am not able to get the log of the statistics in a file. On the console there is a lot of log messages along with the stats - so ha

Re: Spark webUI - application details page

2014-08-14 Thread SK
More specifically, as indicated by Patrick above, in 1.0+, apps will have persistent state so that the UI can be reloaded. Is there a way to enable this feature in 1.0.1? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details

Spark working directories

2014-08-14 Thread Yana Kadiyska
Hi all, trying to change defaults of where stuff gets written. I've set "-Dspark.local.dir=/spark/tmp" and I can see that the setting is used when the executor is started. I do indeed see directories like spark-local-20140815004454-bb3f in this desired location but I also see undesired stuff unde

Re: spark streaming - lamda architecture

2014-08-14 Thread Tathagata Das
Can you be a bit more specific about what you mean by lambda architecture? On Thu, Aug 14, 2014 at 2:27 PM, salemi wrote: > Hi, > > How would you implement the batch layer of lamda architecture with > spark/spark streaming? > > Thanks, > Ali > > > > -- > View this message in context: > http://a

Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-14 Thread Shivaram Venkataraman
Could you try increasing the number of slices with the large data set ? SparkR assumes that each slice (or partition in Spark terminology) can fit in memory of a single machine. Also is the error happening when you do the map function or does it happen when you combine the results ? Thanks Shivar

Re: Spark webUI - application details page

2014-08-14 Thread Andrew Or
Hi SK, Not sure if I understand you correctly, but here is how the user normally uses the event logging functionality: After setting "spark.eventLog.enabled" and optionally "spark.eventLog.dir", the user runs his/her Spark application and calls sc.stop() at the end of it. Then he/she goes to the

Re: spark streaming - lamda architecture

2014-08-14 Thread salemi
below is what is what I understand under lambda architecture. The batch layer provides the historical data and the speed layer provides the real-time view! All data entering the system is dispatched to both the batch layer and the speed layer for processing. The batch layer has two functions: (i)

Re: Spark working directories

2014-08-14 Thread Calvin
I've had this issue too running Spark 1.0.0 on YARN with HDFS: it defaults to a working directory located in hdfs:///user/$USERNAME and it's not clear how to set the working directory. In the case where HDFS has a non-standard directory structure (i.e., home directories located in hdfs:///users/)

Re: spark streaming - lamda architecture

2014-08-14 Thread Michael Hausenblas
>>> How would you implement the batch layer of lamda architecture with >>> spark/spark streaming? I assume you’re familiar with resources such as https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark and are after more detailed advices? Cheers, Michael -- M

RE: spark streaming - lamda architecture

2014-08-14 Thread Shao, Saisai
Hi Ali, Maybe you can take a look at twitter's Summingbird project (https://github.com/twitter/summingbird), which is currently one of the few open source choices of lambda Architecture. There's a undergoing sub-project called summingbird-spark, that might be the one you wanted, might this can

None in RDD

2014-08-14 Thread guoxu1231
Hi Guys, I have a serious problem regarding the 'None' in RDD(pyspark). Take a example of transformations that produce 'None'. leftOuterJoin(self, other, numPartitions=None) Perform a left outer join of self and other. (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all p

RE: Spark SQL Stackoverflow error

2014-08-14 Thread Cheng, Hao
I couldn’t reproduce the exception, probably it’s solved in the latest code. From: Vishal Vibhandik [mailto:vishal.vibhan...@gmail.com] Sent: Thursday, August 14, 2014 11:17 AM To: user@spark.apache.org Subject: Spark SQL Stackoverflow error Hi, I tried running the sample sql code JavaSparkSQL bu

Re: Python + Spark unable to connect to S3 bucket .... "Invalid hostname in URI"

2014-08-14 Thread Miroslaw
So after doing some more research I found the root cause of the problem. The bucket name we were using contained an underscore '_'. This goes against the new requirements for naming buckets. Using a bucket that is not named with an underscore solved the issue. If anyone else runs into this problem