Re: sbt "org.apache.spark#spark-streaming-kafka_2.11;2.0.0: not found"

2016-12-12 Thread Mattz
I use this in my SBT and it works on 2.0.1: "org.apache.spark" %% "spark-streaming-kafka-0-8" % "2.0.1" On Tue, Dec 13, 2016 at 1:00 PM, Luke Adolph wrote: > Hi all, > My project uses spark-streaming-kafka module.When I migrate spark from > 1.6.0 to 2.0.0 and rebuild project, I run into bel

sbt "org.apache.spark#spark-streaming-kafka_2.11;2.0.0: not found"

2016-12-12 Thread Luke Adolph
Hi all, My project uses spark-streaming-kafka module.When I migrate spark from 1.6.0 to 2.0.0 and rebuild project, I run into below error: [warn] module not found: org.apache.spark#spark-streaming-kafka_2.11;2.0.0 [warn] local: tried [warn] /home/linker/.ivy2/local/org.apache.spark

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jörn Franke
In Hadoop you should not have many small files. Put them into a HAR. > On 13 Dec 2016, at 05:42, Jakob Odersky wrote: > > Assuming the bottleneck is IO, you could try saving your files to > HDFS. This will distribute your data and allow for better concurrent > reads. > >> On Mon, Dec 12, 2016 a

Graphx triplet comparison

2016-12-12 Thread balaji9058
Hi, I would like to know how to do graphx triplet comparison in scala. Example there are two triplets; val triplet1 = mainGraph.triplet.filter(condition1) val triplet2 = mainGraph.triplet.filter(condition2) now i want to do compare triplet1 & triplet2 with condition3 -- View this message i

Dynamic spark sql

2016-12-12 Thread geoHeil
Hi I am curious how to dynamically generate spark sql in the scala api. http://stackoverflow.com/q/41102347/2587904 >From this list val columnsFactor = Seq("bar", "baz") I want to generate multiple withColumn statements dfWithNewLabels.withColumn("replace", lit(null: String)) .withColu

Where is yarn-shuffle.jar in maven?

2016-12-12 Thread Neal Yin
Hi, For dynamic allocation feature, I need spark-xxx-yarn-shuffle.jar. In my local spark build, I can see it. But in maven central, I can't find it. My build script pulls all jars from maven central. The only option is to check in this jar into git? Thanks, -Neal

Re: Query in SparkSQL

2016-12-12 Thread vaquar khan
Hi Neeraj, As per my understanding Spark SQL doesn't support Update statements . Why you need update command in Spark SQL, You can run command in Hive . Regards, Vaquar khan On Mon, Dec 12, 2016 at 10:21 PM, Niraj Kumar wrote: > Hi > > > > I am working on SpqrkSQL using hiveContext (version 1.

Re: Optimization for Processing a million of HTML files

2016-12-12 Thread Jakob Odersky
Assuming the bottleneck is IO, you could try saving your files to HDFS. This will distribute your data and allow for better concurrent reads. On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote: > Hi, > > I have millions of html files in a directory, using "wholeTextFiles" api to > load them and proce

Query in SparkSQL

2016-12-12 Thread Niraj Kumar
Hi I am working on SpqrkSQL using hiveContext (version 1.6.2). Can I run following queries directly in sparkSQL, if yes how update calls set sample = 'Y' where accnt_call_id in (select accnt_call_id from samples); insert into details (accnt_call_id, prdct_cd, prdct_id, dtl_pstn) select accnt_c

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Also, in case the issue was not due to the string length (however it is still valid and may get you later), the issue may be due to some other indexing issues which are currently being worked on here https://issues.apache.org/jira/browse/SPARK-6235 On Mon, Dec 12, 2016 at 8:18 PM, Jakob Odersky w

Re: wholeTextFiles()

2016-12-12 Thread Jakob Odersky
Hi Pradeep, I'm afraid you're running into a hard Java issue. Strings are indexed with signed integers and can therefore not be longer than approximately 2 billion characters. Could you use `textFile` as a workaround? It will give you an RDD of the files' lines instead. In general, this guide htt

wholeTextFiles()

2016-12-12 Thread Pradeep
Hi, Why there is an restriction on max file size that can be read by wholeTextFile() method. I can read a 1.5 gigs file but get Out of memory for 2 gig file. Also, how can I raise this as an defect in spark jira. Can someone please guide. Thanks, Pradeep -

Optimization for Processing a million of HTML files

2016-12-12 Thread Reth RM
Hi, I have millions of html files in a directory, using "wholeTextFiles" api to load them and process further. Right now, testing it with 40k records and at the time of loading files(wholeTextFiles), it waits for minimum of 8-9 minutes. What are some recommended optimizations? Should consider any

RE: get corrupted rows using columnNameOfCorruptRecord

2016-12-12 Thread Yehuda Finkelstein
Ok got it. The destination column must be exists in the data frame. I thought that it will create new column in the data frame. Thanks you for your help. Yehuda *From:* Hyukjin Kwon [mailto:gurwls...@gmail.com] *Sent:* Wednesday, December 07, 2016 12:19 PM *To:* Yehuda Finkelstein *C

Re: Few questions on reliability of accumulators value.

2016-12-12 Thread Daniel Siegmann
Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation: http://imranrashid.com/posts/Spark-Accumulators/ On Sun, Dec 11, 2016 at 11:27 AM, Sudev A C wrote: > Please help. > Anyone, any thought

Re: Tensor Flow

2016-12-12 Thread tog
Tensorframes is a project from databricks ( https://github.com/databricks/tensorframes). No commit for a couple of months though. Does anyone have an insight on the status of the project? On Mon, 12 Dec 2016 at 19:31 Meeraj Kunnumpurath wrote: > Apologies. okay, I will have a look at Tensor Fra

unsubscribe

2016-12-12 Thread Chris Harvey
unsubscribe

Re: [Spark Core] - Spark dynamoDB integration

2016-12-12 Thread Neil Jonkers
Hello, Good examples on how to interface with DynamoDB from Spark here: https://aws.amazon.com/blogs/big-data/using-spark-sql-for-etl/ https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/ Thanks On Mon, Dec 12, 2016 at 7:56 PM, Marco Mistroni wrote: >

Tensor Flow

2016-12-12 Thread Meeraj Kunnumpurath
Hello, Is there anything available in Spark similar to Tensor Flow? I am looking at a mechanism for performing nearest neighbour search on vectorized image data. Regards -- *Meeraj Kunnumpurath* *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597* *00 971 50 409 0169mee...

Livy VS Spark Job Server

2016-12-12 Thread shyla deshpande
It will be helpful if someone can compare Livy and Spark Job Server. Thanks

Re: [Spark Core] - Spark dynamoDB integration

2016-12-12 Thread Marco Mistroni
Hi If it can help 1.Check Java docs of when that method was introduced 2. U building a fat jar? Check which libraries have been includedsome other dependencies might have forced an old copy to be included 3. If u. Take code outside spark.does it work successfully? 4. Send short sample.

Re: Spark Streaming with Kafka

2016-12-12 Thread Anton Okolnychyi
thanks for all your replies, now I have a complete picture. 2016-12-12 16:49 GMT+01:00 Cody Koeninger : > http://spark.apache.org/docs/latest/streaming-kafka-0-10- > integration.html#creating-a-direct-stream > > Use a separate group id for each stream, like the docs say. > > If you're doing mul

Re: [Spark log4j] Turning off log4j while scala program runs on spark-submit

2016-12-12 Thread Irving Duran
Hi - I have a question about log4j while running on spark-submit. I would like to have spark only show errors when I am running spark-submit. I would like to accomplish with without having to edit log4j config file on $SPARK_HOME, is there a way to do this? I found this and it only works on spar

Re: Spark 2 or Spark 1.6.x?

2016-12-12 Thread Cody Koeninger
You certainly can use stable version of Kafka brokers with spark 2.0.2, why would you think otherwise? On Mon, Dec 12, 2016 at 8:53 AM, Amir Rahnama wrote: > Hi, > > You need to describe more. > > For example, in Spark 2.0.2, you can't use stable versions of Apache Kafka. > > In general, I would

Re: Spark Streaming with Kafka

2016-12-12 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-a-direct-stream Use a separate group id for each stream, like the docs say. If you're doing multiple output operations, and aren't caching, spark is going to read from kafka again each time, and if some of those re

Re: Spark 2 or Spark 1.6.x?

2016-12-12 Thread Amir Rahnama
Hi, You need to describe more. For example, in Spark 2.0.2, you can't use stable versions of Apache Kafka. In general, I would say start with 2.0.2- On Mon, Dec 12, 2016 at 7:34 AM, Lohith Samaga M wrote: > Hi, > > I am new to Spark. I would like to learn Spark. > >

Re: using replace function

2016-12-12 Thread Bas Harenslak
There is no replace() function. You could use regexp_replace() for this: SELECT regexp_replace(column_name, ’pattern’, ‘replacement’) FROM table_name Documentation: https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#regexp_replace(org.apache.spark.sql.Column,%20java

Re: About transformations

2016-12-12 Thread Rishikesh Teke
Hi, Spark is very efficeint in SQL because of https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html you can see all the metrics of your all the transforma

Re: WARN util.NativeCodeLoader

2016-12-12 Thread Steve Loughran
> On 8 Dec 2016, at 06:38, baipeng wrote: > > Hi ALL > > I’m new to Spark.When I execute spark-shell, the first line is as > follows > WARN util.NativeCodeLoader: Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable. > Can someone tell

[Spark Core] - Spark dynamoDB integration

2016-12-12 Thread Pratyaksh Sharma
Hey I am using Apache Spark for one streaming application. I am trying to store the processed data into dynamodb using java sdk. Getting the following exception - 16/12/08 23:23:43 WARN TaskSetManager: Lost task 0.0 in stage 1.0: java.lang.NoSuchMethodError: com.amazonaws.SDKGlobalConfigu ration.is

Graphx triplet loops causing null pointer exception

2016-12-12 Thread balaji9058
HI, I am getting null pointer exception when i am executing the triplet loop inside another triplet loop works fine below : for (mainTriplet <- mainGraph.triplets) { println(mainTriplet.dstAttr.name) } works fine below : for (subTriplet <- subGrapgh.triplets) { println(subTriplet .dstA