Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes. Sent from my phone > On Nov 13, 2015, at 10:02 PM, AlexG wrote: > > Never mind; when I switched to Spark 1.5.0, my code works as written and is > pretty fast! Looking at some Parquet related Spark jiras, it seems that > Parquet is known to

Re: out of memory error with Parquet

2015-11-13 Thread AlexG
Never mind; when I switched to Spark 1.5.0, my code works as written and is pretty fast! Looking at some Parquet related Spark jiras, it seems that Parquet is known to have some memory issues with buffering and writing, and that at least some were resolved in Spark 1.5.0. -- View this messag

RE: Connecting SparkR through Yarn

2015-11-13 Thread Sun, Rui
I guess this is not related to SparkR. It seems that Spark can’t pick hostname/IP address of RM. Make sure you have correctly set YARN_CONF_DIR env var and have configured address of RM in yarn-site.xml. From: Amit Behera [mailto:amit.bd...@gmail.com] Sent: Friday, November 13, 2015 9:38 PM To:

Re: large, dense matrix multiplication

2015-11-13 Thread Burak Yavuz
Hi, The BlockMatrix multiplication should be much more efficient on the current master (and will be available with Spark 1.6). Could you please give that a try if you have the chance? Thanks, Burak On Fri, Nov 13, 2015 at 10:11 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: >

out of memory error with Parquet

2015-11-13 Thread AlexG
I'm using Spark to read in a data from many files and write it back out in Parquet format for ease of use later on. Currently, I'm using this code: val fnamesRDD = sc.parallelize(fnames, ceil(fnames.length.toFloat/numfilesperpartition).toInt) val results = fnamesRDD.mapPartitionsWithIndex((index

Re: does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Sonal Goyal
The RDD has a takeSample method where you can supply the flag for replacement or not as well as the fraction to sample. On Nov 14, 2015 2:51 AM, "Andy Davidson" wrote: > In R, its easy to split a data set into training, crossValidation, and > test set. Is there something like this in spark.ml? I

Re: Save GraphX to disk

2015-11-13 Thread SLiZn Liu
Hi Gaurav, Your graph can be saved to graph databases like Neo4j or Titan through their drivers, that eventually saved to the disk. BR, Todd Gaurav Kumar gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道: > Hi, > > I was wondering how to save a graph to disk and load it back again. I know > how to

Spak filestreaming issue

2015-11-13 Thread ravi.gawai
Hi, I am trying simple file streaming example using Sparkstreaming(spark-streaming_2.10,version:1.5.1) public class DStreamExample { public static void main(final String[] args) { final SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("SparkJob"); sparkC

send transformed RDD to s3 from slaves

2015-11-13 Thread Walrus theCat
Hi, I have an RDD which crashes the driver when being collected. I want to send the data on its partitions out to S3 without bringing it back to the driver. I try calling rdd.foreachPartition, but the data that gets sent has not gone through the chain of transformations that I need. It's the dat

Re: a way to allow spark job to continue despite task failures?

2015-11-13 Thread Ted Yu
I searched the code base and looked at: https://spark.apache.org/docs/latest/running-on-yarn.html I didn't find mapred.max.map.failures.percent or its counterpart. FYI On Fri, Nov 13, 2015 at 9:05 AM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > I know a task can fail 2 t

Re: very slow parquet file write

2015-11-13 Thread Rok Roskar
I'm not sure what you mean? I didn't do anything specifically to partition the columns On Nov 14, 2015 00:38, "Davies Liu" wrote: > Do you have partitioned columns? > > On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote: > > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions > i

Re: bin/pyspark SparkContext is missing?

2015-11-13 Thread Davies Liu
You forgot to create a SparkContext instance: sc = SparkContext() On Tue, Nov 3, 2015 at 9:59 AM, Andy Davidson wrote: > I am having a heck of a time getting Ipython notebooks to work on my 1.5.1 > AWS cluster I created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 > > I have read the instructio

Re: Distributing Python code packaged as tar balls

2015-11-13 Thread Davies Liu
Python does not support library as tar balls, so PySpark may also not support that. On Wed, Nov 4, 2015 at 5:40 AM, Praveen Chundi wrote: > Hi, > > Pyspark/spark-submit offers a --py-files handle to distribute python code > for execution. Currently(version 1.5) only zip files seem to be supported

Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Do you have partitioned columns? On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote: > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a > parquet file on HDFS. I've got a few hundred nodes in the cluster, so for > the size of file this is way over-provisioned (I've tried

pyspark sql: number of partitions and partition by size?

2015-11-13 Thread Wei Chen
Hey Friends, I am trying to use sqlContext.write.parquet() to write dataframe to parquet files. I have the following questions. 1. number of partitions The default number of partition seems to be 200. Is there any way other than using df.repartition(n) to change this number? I was told repartitio

does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Andy Davidson
In R, its easy to split a data set into training, crossValidation, and test set. Is there something like this in spark.ml? I am using python of now. My real problem is I want to randomly select a relatively small data set to do some initial data exploration. Its not clear to me how using spark I c

Join and HashPartitioner question

2015-11-13 Thread Alexander Pivovarov
Hi Everyone Is there any difference in performance btw the following two joins? val r1: RDD[(String, String]) = ??? val r2: RDD[(String, String]) = ??? val partNum = 80 val partitioner = new HashPartitioner(partNum) // Join 1 val res1 = r1.partitionBy(partitioner).join(r2.partitionBy(partition

SparkException: Could not read until the end sequence number of the range

2015-11-13 Thread Alan Dipert
Hi all, We're running Spark 1.5.0 on EMR 4.1.0 in AWS and consuming from Kinesis. We saw the following exception today - it killed the Spark "step": org.apache.spark.SparkException: Could not read until the end sequence number of the range We guessed it was because our Kinesis stream didn't have

Columnar Statisics

2015-11-13 Thread sara mustafa
Hi, I am using Spark 1.5.2 and I notice the existence of the class org.apache.spark.sql.columnar.ColumnStatisticsSchema, How can I use it to calculate column statistics of a DataFrame? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Columnar-Statisi

Re: Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Sabarish Sasidharan
If it is metadata why would we not cache it before we perform the join? Regards Sab On 13-Nov-2015 10:27 pm, "Eran Medan" wrote: > Hi > I'm looking for some benchmarks on joining data frames where most of the > data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is > still in RD

RE: Save GraphX to disk

2015-11-13 Thread Buttler, David
A graph is nodes and vertices. What else are you expecting to save/load? You could save/load the triplets, but that is actually more work to reconstruct the graph than the nodes and vertices separately. Dave From: Gaurav Kumar [mailto:gauravkuma...@gmail.com] Sent: Friday, November 13, 2015

Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Have you use any partitioned columns when write as json or parquet? On Fri, Nov 6, 2015 at 6:53 AM, Rok Roskar wrote: > yes I was expecting that too because of all the metadata generation and > compression. But I have not seen performance this bad for other parquet > files I’ve written and was wo

SequenceFile and object reuse

2015-11-13 Thread jeff saremi
So we tried reading a sequencefile in Spark and realized that all our records have ended up becoming the same. THen one of us found this: Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an ag

Re: No suitable drivers found for postgresql

2015-11-13 Thread James Nowell
The --jars flag should be available for PySpark as well (I could be wrong, I've only used Spark 1.4 onward). Take, for example, the command I'm using to stark a PySpark shell for a Jupyter Notebook: "--jars hdfs://{our namenode}/tmp/postgresql-9.4-1204.jdbc42.jar --driver-class-path /usr/local/sha

hang correlated to number of shards Re: Checkpointing with Kinesis hangs with socket timeouts when driver is relaunched while transforming on a 0 event batch

2015-11-13 Thread Hster Geguri
Just an update that the kinesis checkpointing works well with orderly and kill -9 driver shutdowns when there is less than 4 shards. We use 20+. I created a case with Amazon support since it is the AWS kinesis getRecords API which is hanging. Regards, Heji On Thu, Nov 12, 2015 at 10:37 AM, Hste

Re: large, dense matrix multiplication

2015-11-13 Thread Sabarish Sasidharan
Hi Eilidh Because you are multiplying with the transpose you don't have to necessarily build the right side of the matrix. I hope you see that. You can broadcast blocks of the indexed row matrix to itself and achieve the multiplication. But for similarity computation you might want to use some a

Re: What is difference btw reduce & fold?

2015-11-13 Thread firemonk9
Thes is very well explained. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-difference-btw-reduce-fold-tp22653p25376.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Please add us to the Powered by Spark page

2015-11-13 Thread Sujit Pal
Hello, We have been using Spark at Elsevier Labs for a while now. Would love to be added to the “Powered By Spark” page. Organization Name: Elsevier Labs URL: http://labs.elsevier.com Spark components: Spark Core, Spark SQL, MLLib, GraphX. Use Case: Building Machine Reading Pipeline, Knowledge Gr

Re: No suitable drivers found for postgresql

2015-11-13 Thread Krishna Sangeeth KS
​​ ​Hi,​ I have been trying to do this today at work with impala as the data source​ . I have been getting the same error as well. I am using PySpark APIs with Spark 1.3 version and I was wondering if there is any workaround for Pyspark. I don't think we can use --jars option in PySpark. ​Cheer

a way to allow spark job to continue despite task failures?

2015-11-13 Thread Nicolae Marasoiu
Hi, I know a task can fail 2 times and only the 3rd breaks the entire job. I am good with this number of attempts. I would like that after trying a task 3 times, it continues with the other tasks. The job can be "failed", but I want all tasks run. Please see my use case. I read a hadoop in

Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Eran Medan
Hi I'm looking for some benchmarks on joining data frames where most of the data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is still in RDBMS. I am only looking at the very first join before any caching happens, and I assume there will be loss of parallelization because JDBCRDD

Re: large, dense matrix multiplication

2015-11-13 Thread Eilidh Troup
Hi Sab, Thanks for your response. We’re thinking of trying a bigger cluster, because we just started with 2 nodes. What we really want to know is whether the code will scale up with larger matrices and more nodes. I’d be interested to hear how large a matrix multiplication you managed to do? I

Re: No suitable drivers found for postgresql

2015-11-13 Thread James Nowell
I recently had this same issue. Though I didn't find the cause, I was able to work around it by loading the JAR into hdfs. Once in HDFS, I used the --jars flag with the full hdfs path: --jars hdfs://{our namenode}/tmp/postgresql-9.4-1204-jdbc42.jar James On Fri, Nov 13, 2015 at 10:14 AM satish ch

Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-13 Thread Cody Koeninger
Unless you change maxRatePerPartition, a batch is going to contain all of the offsets from the last known processed to the highest available. Offsets are not time-based, and Kafka's time-based api currently has very poor granularity (it's based on filesystem timestamp of the log segment). There's

No suitable drivers found for postgresql

2015-11-13 Thread satish chandra j
HI All, Currently using Spark 1.4.1, my Spark job has to fetche data from PostgreSQL database using JdbcRDD I am submitting my spark job using --jars to pass PostgreSQL JDBC driver but still getting error as mentioned below: "java.sql.SQLException: No suitable driver found for PostgreSQL JDBC" wh

Re: spark 1.4 GC issue

2015-11-13 Thread Gaurav Kumar
Please have a look at http://spark.apache.org/docs/1.4.0/tuning.html You may also want to use the latest build of JDK 7/8 and use G1GC instead. I saw considerable reductions in GC time just by doing that. Rest of the tuning parameters are better explained in the link above. Best Regards, Gaurav

Spark Streaming + SparkSQL, time based windowing queries

2015-11-13 Thread Saiph Kappa
Hi, Does SparkSQL support time based windowing queries over streams like the following one (from Intel/StreamingSQL): « sql( """|SELECT t.word, COUNT(t.word)|FROM (SELECT * FROM test) OVER (WINDOW '9' SECONDS, SLIDE '3' SECONDS) AS t|GROUP BY t.word """.stripMargin) » What are my

Re: Stack Overflow Question

2015-11-13 Thread Sabarish Sasidharan
The reserved cores are to prevent starvation so that user B cam run jobs when user A's job is already running and using almost all of the cluster. You can change your scheduler configuration to use more cores. Regards Sab On 13-Nov-2015 6:56 pm, "Parin Choganwala" wrote: > EMR 4.1.0 + Spark 1.5.

spark 1.4 GC issue

2015-11-13 Thread Renu Yadav
am using spark 1.4 and my application is taking much time in GC around 60-70% of time for each task I am using parallel GC. please help somebody as soon as possible. Thanks, Renu

Save GraphX to disk

2015-11-13 Thread Gaurav Kumar
Hi, I was wondering how to save a graph to disk and load it back again. I know how to save vertices and edges to disk and construct the graph from them, not sure if there's any method to save the graph itself to disk. Best Regards, Gaurav Kumar Big Data • Data Science • Photography • Music +91 99

Stack Overflow Question

2015-11-13 Thread Parin Choganwala
EMR 4.1.0 + Spark 1.5.0 + YARN Resource Allocation http://stackoverflow.com/q/33488869/1366507?sem=2

How is the predict() working in LogisticRegressionModel?

2015-11-13 Thread MEETHU MATHEW
Hi all,Can somebody point me to the implementation of predict() in LogisticRegressionModel of spark mllib? I could find a predictPoint() in the class LogisticRegressionModel, but where is predict()?  Thanks & Regards,  Meethu M

Traing data sets storage requirement

2015-11-13 Thread Veluru, Aruna
Hi All, Just started understanding / getting hands on with Spark, Streaming and MLLIb. We are in the design phase and need suggestions on the training data storage requirement. Batch Layer: Our core systems generate data which we will be using as batch data, currently SQL server

Spark Executors off-heap memory usage keeps increasing

2015-11-13 Thread Balthasar Schopman
Hi, The off-heap memory usage of the 3 Spark executor processes keeps increasing constantly until the boundaries of the physical RAM are hit. This happened two weeks ago, at which point the system comes to a grinding halt, because it's unable to spawn new processes. At such a moment restarting

Spark and Spring Integrations

2015-11-13 Thread Netai Biswas
Hi, I am facing issue while integrating spark with spring. I am getting "java.lang.IllegalStateException: Cannot deserialize BeanFactory with id" errors for all beans. I have tried few solutions available in web. Please help me out to solve this issue. Few details: Java : 8 Spark : 1.5.1 Sprin

Re: Spark 1.5 UDAF ArrayType

2015-11-13 Thread Deenar Toraskar
Michael I took a while to get back, but I am pleased to report that the issue has been fixed in Spark 1.5.1. Great not having to write verbose Hive UDAFs in Java any longer. But I noticed an issue, with Spark 1.5.1 the HiveUDAF has stopped working. I noticed this when trying to do a performance c

Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-13 Thread kundan kumar
Hi, I am using spark streaming check-pointing mechanism and reading the data from kafka. The window duration for my application is 2 hrs with a sliding interval of 15 minutes. So, my batches run at following intervals... 09:45 10:00 10:15 10:30 and so on Suppose, my running batch dies at 09:55

Re: HiveServer2 Thrift OOM

2015-11-13 Thread Steve Loughran
looks suspiciously like some thrift transport unmarshalling problem, THRIFT-2660 Spark 1.5 uses hive 1.2.1; it should have the relevant thrift JAR too. Otherwise, you could play with thrift JAR versions yourself —maybe it will work, maybe not... On 13 Nov 2015, at 00:29, Yana Kadiyska mailto:y

Re: how to run unit test for specific component only

2015-11-13 Thread Steve Loughran
try: mvn test -pl sql -DwildcardSuites=org.apache.spark.sql -Dtest=none On 12 Nov 2015, at 03:13, weoccc mailto:weo...@gmail.com>> wrote: Hi, I am wondering how to run unit test for specific spark component only. mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none The above co

Re: large, dense matrix multiplication

2015-11-13 Thread Sabarish Sasidharan
We have done this by blocking but without using BlockMatrix. We used our own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What is the size of your block? How much memory are you giving to the executors? I assume you are running on YARN, if so you would want to make sure your ya

Re: thought experiment: use spark ML to real time prediction

2015-11-13 Thread Sabarish Sasidharan
That may not be an issue if the app using the models runs by itself (not bundled into an existing app), which may actually be the right way to design it considering separation of concerns. Regards Sab On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai wrote: > This will bring the whole dependencies of sp