Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-26 Thread Marco Mistroni
Hi Raymond run this command and it should work, provided you have kafka setup a s well on localhost at port 2181 spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 kafka_wordcount.py localhost:2181 test But i suggest, if you are a beginner, to use Spark examples' wor

Re: Spark runs out of memory with small file

2017-02-26 Thread Yan Facai
Hi, Tremblay. Your file is .gz format, which is not splittable for hadoop. Perhaps the file is loaded by only one executor. How many executors do you start? Perhaps repartition method could solve it, I guess. On Sun, Feb 26, 2017 at 3:33 AM, Henry Tremblay wrote: > I am reading in a single smal

attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
I'm attempting to perform a map on a Dataset[Row] but getting an error on decode when attempting to pass a custom encoder. My code looks similar to the following: val source = spark.read.format("parquet").load("/emrdata/sources/very_large_ds") source.map{ row => { val key = row(0) }

Re: attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
sorry here's the whole code val source = spark.read.format("parquet").load("/emrdata/sources/very_large_ds") implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[(Any,ArrayBuffer[Row])] source.map{ row => { val key = row(0) val buff = new ArrayBuffer[Row]() buff += row

Saving Structured Streaming DF to Hive Partitioned table

2017-02-26 Thread nimrodo
Hi, I want to load a stream of CSV files to a partitioned Hive table called myTable. I tried using Spark 2 Structured Streaming to do that: val spark = SparkSession .builder .appName("TrueCallLoade") .enableHiveSupport() .config("hive.exec.dynamic.partition.mode", "non-str

Re: Kafka Streaming and partitioning

2017-02-26 Thread tonyye
Hi Dave, I had the same question and was wondering if you had found a way to do the join without causing a shuffle? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955p28425.html Sent from the Apache Spark User List maili

Re: Disable Spark SQL Optimizations for unit tests

2017-02-26 Thread Stefan Ackermann
I found some ways to get faster unit tests.In the meantime they had gone up to about an hour. Apparently defining columns in a for loop makes catalyst very slow, as it blows up the logical plan with many projections: final def castInts(dfIn: DataFrame, castToInts: String*): DataFrame = { va

Re: RDD blocks on Spark Driver

2017-02-26 Thread Prithish
Thanks for the responses, I am running this on Amazon EMR which runs the Yarn cluster manager. On Sat, Feb 25, 2017 at 4:45 PM, liangyhg...@gmail.com < liangyhg...@gmail.com> wrote: > Hi, > I think you are using the local model of Spark. There > are mainly four models, which are local, standalon

Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
Hoping someone can answer this. I am unable to override and use a Custom log4j.properties on Amazon EMR. I am running Spark on EMR (Yarn) and have tried all the below combinations in the Spark-Submit to try and use the custom log4j. In Client mode --driver-java-options "-Dlog4j.configuration=hdfs

Re: Spark runs out of memory with small file

2017-02-26 Thread Henry Tremblay
The file is so small that a stand alone python script, independent of spark, can process the file in under a second. Also, the following fails: 1. Read the whole file in with wholeFiles 2. use flatMap to get 50,000 rows that looks like: Row(id="path", line="line") 3. Save the results as CVS

Are we still dependent on Guava jar in Spark 2.1.0 as well?

2017-02-26 Thread kant kodali
Are we still dependent on Guava jar in Spark 2.1.0 as well (Given Guava jar incompatibility issues)?

Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Steve Loughran
try giving a resource of a file in the JAR, e.g add a file "log4j-debugging.properties into the jar, and give a config option of -Dlog4j.configuration=/log4j-debugging.properties (maybe also try without the "/") On 26 Feb 2017, at 16:31, Prithish mailto:prith...@gmail.com>> wrote: Hoping s

Re: Spark runs out of memory with small file

2017-02-26 Thread Gourav Sengupta
Hi Henry, Those guys in Databricks training are nuts and still use Spark 1.x for their exams. Learning SPARK is a VERY VERY VERY old way of solving problems using SPARK. The core engine of SPARK, which even I understand, has gone through several fundamental changes. Just try reading the file usi

Re: In Spark streaming, will saved kafka offsets become invalid if I change the number of partitions in a kafka topic?

2017-02-26 Thread shyla deshpande
Please help! On Sat, Feb 25, 2017 at 11:10 PM, shyla deshpande wrote: > I am commiting offsets to Kafka after my output has been stored, using the > commitAsync API. > > My question is if I increase/decrease the number of kafka partitions, will > the saved offsets will become invalid. > > Thanks

Re: Spark runs out of memory with small file

2017-02-26 Thread Henry Tremblay
I am actually using Spark 2.1 and trying to solve a real life problem. Unfortunately, some of the discussion of my problem went off line, and then I started a new thread. Here is my problem. I am parsing crawl data which exists in a flat file format. It looks like this: u'WARC/1.0', u'WARC-

Re: Spark runs out of memory with small file

2017-02-26 Thread Koert Kuipers
using wholeFiles to process formats that can not be split per line is not "old" and there are plenty of problems for which RDD is still better suited than Dataset or DataFrame currently (this might change in near future when Dataset gets some crucial optimizations fixed). On Sun, Feb 26, 2017 at

Re: Spark runs out of memory with small file

2017-02-26 Thread ayan guha
Hi We are doing similar stuff, but with large number of small-ish files. What we do is write a function to parse a complete file, similar to your parse file. But we use yield, instead of return and flatmap on top of it. Can you give it a try and let us know if it works? On Mon, Feb 27, 2017 at 9:

SPark - YARN Cluster Mode

2017-02-26 Thread ayan guha
Hi I am facing an issue with Cluster Mode, with pyspark Here is my code: conf = SparkConf() conf.setAppName("Spark Ingestion") conf.set("spark.yarn.queue","root.Applications") conf.set("spark.executor.instances","50") conf.set("spark.executor.memory","22g"

Re: SPark - YARN Cluster Mode

2017-02-26 Thread ayan guha
Also, I wanted to add if I specify the conf in the command line, it seems to be working. For example, if I use spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.queue=root.Application ayan_test.py 10 Then it is going to correct queue. Any help would be great Best Ayan On Mon,

回复:Spark SQL table authority control?

2017-02-26 Thread yuyong . zhai
https://issues.apache.org/jira/browse/SPARK-8321 翟玉勇 数据架构 ELEME Inc. Email: yuyong.z...@ele.me | Mobile:15221559674 http://ele.me 饿了么 原始邮件 发件人: 李斌松 收件人: user 发送时间: 2017年2月26日(周日) 11:50 主题: Spark SQL table authority control? Through the JDBC connection s

java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext

2017-02-26 Thread lk_spark
hi,all: I want to extract some info from kafka useing sparkstream,my code like : val keyword = "" val system = "dmp" val datetime_idx = 0 val datetime_length = 23 val logLevelBeginIdx = datetime_length + 2 - 1 val logLevelMaxLenght = 5 val lines =

Re: Spark runs out of memory with small file

2017-02-26 Thread Henry Tremblay
Not sure where you want me to put yield. My first try caused an error in Spark that it could not pickle generator objects. On 02/26/2017 03:25 PM, ayan guha wrote: Hi We are doing similar stuff, but with large number of small-ish files. What we do is write a function to parse a complete file

Getting unrecoverable exception: java.lang.NullPointerException when trying to find wordcount in kafka topic

2017-02-26 Thread Mina Aslani
Hi, I am trying to submit a job to spark to count number of words in a specific kafka topic but I get below exception when I check the log: . failed with unrecoverable exception: java.lang.NullPointerException The command that I run follows: ./scripts/dm-spark-submit.sh --class org.apache

Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
Steve, I tried that, but didn't work. Any other ideas? On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran wrote: > try giving a resource of a file in the JAR, e.g add a file > "log4j-debugging.properties into the jar, and give a config option of > -Dlog4j.configuration=/log4j-debugging.properties

Thrift server does not respect hive.server2.enable.doAs=true

2017-02-26 Thread yuyong . zhai

Re: Spark runs out of memory with small file

2017-02-26 Thread Pavel Plotnikov
Hi, Henry In first example the dict d always contains only one value because the_Id is same, in second case duct grows very quickly. So, I can suggest to firstly apply map function to split you file with string on rows then please make repartition and then apply custom logic Example: def splitf(

How to do multiple join in pyspark

2017-02-26 Thread lovemoon
This is my code as below: cfg = SparkConf().setAppName('MyApp') spark = SparkSession.builder.config(conf=cfg).getOrCreate() rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (4, 'c')], ['idx', 'val']) rdd1.registerTempTable('rdd1') rdd2 = spark.createDataFrame([(1, 2, 100), (1, 3, 200), (2, 3,