Re: Optimizing large collect operations

2015-11-26 Thread Jeff Zhang
For such large output, I would suggest you to do the following processing in cluster rather than in driver (use RDD api to do that). If you really want to pull it to driver, then you can first save it in hdfs and then read it using hdfs api to avoid the akka issue On Fri, Nov 27, 2015 at 2:41 PM,

Re: Spark on yarn vs spark standalone

2015-11-26 Thread Jeff Zhang
If your cluster is a dedicated spark cluster (only running spark job, no other jobs like hive/pig/mr), then spark standalone would be fine. Otherwise I think yarn would be a better option. On Fri, Nov 27, 2015 at 3:36 PM, cs user wrote: > Hi All, > > Apologies if this question has been asked bef

Spark on yarn vs spark standalone

2015-11-26 Thread cs user
Hi All, Apologies if this question has been asked before. I'd like to know if there are any downsides to running spark over yarn with the --master yarn-cluster option vs having a separate spark standalone cluster to execute jobs? We're looking at installing a hdfs/hadoop cluster with Ambari and s

Re: error while creating HiveContext

2015-11-26 Thread fightf...@163.com
Hi, I think you just want to put the hive-site.xml in the spark/conf directory and it would load it into spark classpath. Best, Sun. fightf...@163.com From: Chandra Mohan, Ananda Vel Murugan Date: 2015-11-27 15:04 To: user Subject: error while creating HiveContext Hi, I am building a sp

error while creating HiveContext

2015-11-26 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I am building a spark-sql application in Java. I created a maven project in Eclipse and added all dependencies including spark-core and spark-sql. I am creating HiveContext in my spark program and then try to run sql queries against my Hive Table. When I submit this job in spark, for some r

Re: GraphX - How to make a directed graph an undirected graph?

2015-11-26 Thread Robineast
1. GraphX doesn't have a concept of undirected graphs, Edges are always specified with a srcId and dstId. However there is nothing to stop you adding in edges that point in the other direction i.e. if you have an edge with srcId -> dstId you can add an edge dstId -> srcId 2. In general APIs will r

Optimizing large collect operations

2015-11-26 Thread Gylfi
Hi. I am doing very large collectAsMap() operations, about 10,000,000 records, and I am getting "org.apache.spark.SparkException: Error communicating with MapOutputTracker" errors.. details: "org.apache.spark.SparkException: Error communicating with MapOutputTracker at org.apache.spa

Millions of entities in custom Hadoop InputFormat and broadcast variable

2015-11-26 Thread Anfernee Xu
Hi Spark experts, First of all, happy Thanksgiving! The comes to my question, I have implemented custom Hadoop InputFormat to load millions of entities from my data source to Spark(as JavaRDD and transform to DataFrame). The approach I took in implementing the custom Hadoop RDD is loading all ID'

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-26 Thread Gylfi
HDFS has a default replication factor of 3 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471p25497.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Stop Spark yarn-client job

2015-11-26 Thread Jeff Zhang
Could you attach the yarn AM log ? On Fri, Nov 27, 2015 at 8:10 AM, Jagat Singh wrote: > Hi, > > What is the correct way to stop fully the Spark job which is running as > yarn-client using spark-submit. > > We are using sc.stop in the code and can see the job still running (in > yarn resource ma

Stop Spark yarn-client job

2015-11-26 Thread Jagat Singh
Hi, What is the correct way to stop fully the Spark job which is running as yarn-client using spark-submit. We are using sc.stop in the code and can see the job still running (in yarn resource manager) after final hive insert is complete. The code flow is start context do somework insert to hiv

Unable to use "Batch Start Time" on worker nodes.

2015-11-26 Thread Abhishek Anand
Hi , I need to use batch start time in my spark streaming job. I need the value of batch start time inside one of the functions that is called within a flatmap function in java. Please suggest me how this can be done. I tried to use the StreamingListener class and set the value of a variable in

Grid search with Random Forest

2015-11-26 Thread Ndjido Ardo Bar
Hi folks, Does anyone know whether the Grid Search capability is enabled since the issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol column doesn't exist" when trying to perform a grid search with Spark 1.4.0. Cheers, Ardo --

Re: Help with Couchbase connector error

2015-11-26 Thread Shixiong Zhu
Het Eyal, I just checked the couchbase spark connector jar. The target version of some of classes are Java 8 (52.0). You can create a ticket in https://issues.couchbase.com/projects/SPARKC Best Regards, Shixiong Zhu 2015-11-26 9:03 GMT-08:00 Ted Yu : > StoreMode is from Couchbase connector. > >

Re: UDF with 2 arguments

2015-11-26 Thread Daniel Lopes
Thanks Davies and Nathan, I found my error. I was using *ArrayType()* and I need to pass de kind of type has in this array and I has not passing *ArrayType(IntegerType())*. Thanks :) On Wed, Nov 25, 2015 at 7:46 PM, Davies Liu wrote: > It works in master (1.6), what's the version of Spark you

possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-26 Thread Andy Davidson
I am using spark-1.5.1-bin-hadoop2.6. I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured spark-env to use python3. I get and exception 'Randomness of hash of string should be disabled via PYTHONHASHSEED¹. Is there any reason rdd.py should not just set PYTHONHASHSEED ?

Re: question about combining small parquet files

2015-11-26 Thread Ruslan Dautkhanov
An interesting compaction approach of small files is discussed recently http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/ AFAIK Spark supports views too. -- Ruslan Dautkhanov On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi < nyigitb...@netflix

question about combining small parquet files

2015-11-26 Thread Nezih Yigitbasi
Hi Spark people, I have a Hive table that has a lot of small parquet files and I am creating a data frame out of it to do some processing, but since I have a large number of splits/files my job creates a lot of tasks, which I don't want. Basically what I want is the same functionality that Hive pro

Re: Help with Couchbase connector error

2015-11-26 Thread Ted Yu
StoreMode is from Couchbase connector. Where did you obtain the connector ? See also http://stackoverflow.com/questions/1096148/how-to-check-the-jdk-version-used-to-compile-a-class-file On Thu, Nov 26, 2015 at 8:55 AM, Eyal Sharon wrote: > Hi , > Great , that gave some directions. But can you

Re: Help with Couchbase connector error

2015-11-26 Thread Eyal Sharon
Hi , Great , that gave some directions. But can you elaborate more? or share some post I am currently running JDK 7 , and my Couchbase too Thanks ! On Thu, Nov 26, 2015 at 6:02 PM, Ted Yu wrote: > This implies version mismatch between the JDK used to build your jar and > the one at runtime. >

building spark from 1.3 release without Hive

2015-11-26 Thread Mich Talebzadeh
Hi, I am not having much luck making Hive run on Spark! I tried to build spark 1.5.2 without Hive jards. It worked but could not run hive sql on Spark. I saw in this link: http://stackoverflow.com/questions/33233431/hive-on-spark-java-lang-noclassdeffounderror-org-apache-hive-spark-cli

Re: MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtCoKmv14Hd1H1&subj=Re+Spark+Hive+max+key+length+is+767+bytes On Thu, Nov 26, 2015 at 5:26 AM, wrote: > hi guys, > > when I am trying to connect hive with spark-sql,I got a problem like > below: > > > [root@master spark]# bin/spark-

Re: Help with Couchbase connector error

2015-11-26 Thread Ted Yu
This implies version mismatch between the JDK used to build your jar and the one at runtime. When building, target JDK 1.7 There're plenty of posts on the web for dealing with such error. Cheers On Thu, Nov 26, 2015 at 7:31 AM, Eyal Sharon wrote: > Hi, > > I am trying to set a connection to C

Help with Couchbase connector error

2015-11-26 Thread Eyal Sharon
Hi, I am trying to set a connection to Couchbase. I am at the very beginning, and I got stuck on this exception Exception in thread "main" java.lang.UnsupportedClassVersionError: com/couchbase/spark/StoreMode : Unsupported major.minor version 52.0 Here is the simple code fragment val sc =

Re: java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-26 Thread Ted Yu
bq. (Permission denied) Have you checked the permission for /mnt/md0/var/lib/spark/... ? Cheers On Thu, Nov 26, 2015 at 3:03 AM, Sahil Sareen wrote: > Im using Spark1.4.2 with Hadoop 2.7, I tried increasing > spark.shuffle.io.maxRetries to 10 but didn't help. > > Any ideas on what could be ca

Re: custom inputformat recordreader

2015-11-26 Thread Ted Yu
Please take a look at: python//pyspark/tests.py There're examples using sc.hadoopFile() and sc.newAPIHadoopRDD() Cheers On Thu, Nov 26, 2015 at 4:50 AM, Patcharee Thongtra < patcharee.thong...@uni.no> wrote: > Hi, > > In python how to use inputformat/custom recordreader? > > Thanks, > Patcharee

Re: Adding new column to Dataframe

2015-11-26 Thread Ted Yu
Forgot to include this line which was at the beginning of the sample: sqlContext = HiveContext(SparkContext()) FYI On Wed, Nov 25, 2015 at 7:57 PM, Vishnu Viswanath < vishnu.viswanat...@gmail.com> wrote: > Thanks Ted, > > It looks like I cannot use row_number then. I tried to run a sample w

MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread luohui20001
hi guys, when I am trying to connect hive with spark-sql,I got a problem like below: [root@master spark]# bin/spark-shell --master local[4]log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please initialize the log4j system p

custom inputformat recordreader

2015-11-26 Thread Patcharee Thongtra
Hi, In python how to use inputformat/custom recordreader? Thanks, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

[no subject]

2015-11-26 Thread Dmitry Tolpeko

java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-26 Thread Sahil Sareen
Im using Spark1.4.2 with Hadoop 2.7, I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help. Any ideas on what could be causing this?? This is the exception that I am getting: [MySparkApplication] WARN : Failed to execute SQL statement select * from TableS s join TableC c on s.pro

Re: ClassNotFoundException with a uber jar.

2015-11-26 Thread Ali Tajeldin EDU
I'm not %100 sure, but I don't think a jar within a jar will work without a custom class loader. You can perhaps try to use "maven-assembly-plugin" or "maven-shade-plugin" to build your uber/fat jar. Both of these will build a flattened single jar. -- Ali On Nov 26, 2015, at 2:49 AM, Marc de

ClassNotFoundException with a uber jar.

2015-11-26 Thread Marc de Palol
Hi all, I have a uber jar made with maven, the contents are: my.org.my.classes.Class ... lib/lib1.jar // 3rd party libs lib/lib2.jar I'm using this kind of jar for hadoop applications and all works fine. I added spark libs, scala and everything needed in spark, but when I submit this jar to

Re: RE: Spark checkpoint problem

2015-11-26 Thread eric wong
I don't think it is a deliberate design. So you may need do action on the RDD before the action of RDD, if you want to explicitly checkpoint RDD. 2015-11-26 13:23 GMT+08:00 wyphao.2007 : > Spark 1.5.2. > > 在 2015-11-26 13:19:39,"张志强(旺轩)" 写道: > > What’s your spark version? > > *发件人:* wyphao.

starting start-master.sh throws "java.lang.ClassNotFoundException: org.slf4j.Logger" error

2015-11-26 Thread Mich Talebzadeh
Hi, I just built spark without hive jars and trying to run start-master.sh I get this error in the log. Sounds like it cannot find java.lang.ClassNotFoundException: org.slf4j.Logger Spark Command: /usr/java/latest/bin/java -cp /usr/lib/spark/sbin/../conf/:/usr/lib/spark/lib/spark-as

controlling parquet file sizes for faster transfer to S3 from HDFS

2015-11-26 Thread AlexG
Is there a way to control how large the part- files are for a parquet dataset? I'm currently using e.g. results.toDF.coalesce(60).write.mode("append").parquet(outputdir) to manually reduce the number of parts, but this doesn't map linearly to fewer parts: I noticed that coalescing to 30 actually