[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile

2015-08-27 Thread Jacek Laskowski
Hi, Is this a known issue of building Spark from today's sources using Scala 2.11? I did the following: ➜ spark git:(master) ✗ git rev-parse HEAD de0278286cf6db8df53b0b68918ea114f2c77f1f ➜ spark git:(master) ✗ ./dev/change-scala-version.sh 2.11 ➜ spark git:(master) ✗ ./build/mvn -Pyarn -Phadoo

Re: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile

2015-08-27 Thread Sean Owen
Hm, if anything that would be related to my change at https://issues.apache.org/jira/browse/SPARK-9613. Let me investigate. There's some chance there is some Scala 2.11-specific problem here. On Thu, Aug 27, 2015 at 8:51 AM, Jacek Laskowski wrote: > Hi, > > Is this a known issue of building Spark

Re: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile

2015-08-27 Thread Jacek Laskowski
Hi, I'm trying to nail it down myself, too. Is there anything relevant to help on my side? Pozdrawiam, Jacek -- Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl Follow me at https://twitter.com/jaceklaskowski Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski

Re: [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile

2015-08-27 Thread Jacek Laskowski
Hi, Sean helped me offline and I sent https://github.com/apache/spark/pull/8479 for review. That's the only breaking place for the build I could find. Tested with 2.10 and 2.11. Pozdrawiam, Jacek -- Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl Follow me at https://twit

Selecting different levels of nested data records during one select?

2015-08-27 Thread Ewan Leith
Hello, I'm trying to query a nested data record of the form: root |-- userid: string (nullable = true) |-- datarecords: array (nullable = true) ||-- element: struct (containsNull = true) |||-- name: string (nullable = true) |||-- system: boolean (nullable = true) |||--

Re: suggest configuration for debugging spark streaming, kafka

2015-08-27 Thread Jacek Laskowski
On Wed, Aug 26, 2015 at 11:02 PM, Joanne Contact wrote: > Hi I have a Ubuntu box with 4GB memory and duo cores. Do you think it > won't be enough to run spark streaming and kafka? I try to install > standalone mode spark kafka so I can debug them in IDE. Do I need to > install hadoop? Hi, It sho

Re: query avro hive table in spark sql

2015-08-27 Thread ponkin
Can you select something from this table using Hive? And also could you post your spark code which leads to this exception. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/query-avro-hive-table-in-spark-sql-tp24462p24468.html Sent from the Apache Spark User

Re: error accessing vertexRDD

2015-08-27 Thread ponkin
Check permission for user which runs spark-shell (Permission denied) - means that you do not have permissions to /tmp -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-accessing-vertexRDD-tp24466p24469.html Sent from the Apache Spark User List mailing li

Re: SQLContext load. Filtering files

2015-08-27 Thread Akhil Das
Have a look at the spark streaming. You can make use of the ssc.fileStream. Eg: val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input) You can also specify a filter function

Re: How to set the number of executors and tasks in a Spark Streaming job in Mesos

2015-08-27 Thread Akhil Das
How many mesos slaves are you having? and how many cores are you having in total? sparkConf.set("spark.mesos.coarse", "true") sparkConf.set("spark.cores.max", "128") These two configurations are sufficient. Now regarding the active tasks, how many partitions are you seeing for that jo

Re: creating data warehouse with Spark and running query with Hive

2015-08-27 Thread Akhil Das
Can you paste the stacks-trace? Is it complaining about directory already exists? Thanks Best Regards On Thu, Aug 20, 2015 at 11:23 AM, Jeetendra Gangele wrote: > HI All, > > I have a data in HDFS partition with Year/month/data/event_type. And I am > creating a hive tables with this data, this

Re: load NULL Values in RDD

2015-08-27 Thread Akhil Das
Are you reading the column data from a SQL database? If so, have a look at the JdbcRDD. Thanks Best Regards On Fri, Aug 21, 2015 at 1:25 AM, SAHA, DEBOBROTA wrote: > Hi , > > > > Can anyone help me in loading a column that may or may not have NULL

Re: org.apache.hadoop.security.AccessControlException: Permission denied when access S3

2015-08-27 Thread Akhil Das
Did you try with s3a? Also make sure your key does not have any wildcard chars in it. Thanks Best Regards On Fri, Aug 21, 2015 at 2:03 AM, Shuai Zheng wrote: > Hi All, > > > > I try to access S3 file from S3 in Hadoop file format: > > > > Below is my code: > > > > Configura

Re: How frequently should full gc we expect

2015-08-27 Thread Akhil Das
Used to hit GC: Overhead limit exceeded and sometime GC: Heap error too. Thanks Best Regards On Sat, Aug 22, 2015 at 2:44 AM, java8964 wrote: > In the test job I am running in Spark 1.3.1 in our stage cluster, I can > see following information on the application stage information: > > MetricMin

Re: SQLContext load. Filtering files

2015-08-27 Thread Masf
Thanks Akhil, I will have a look. I have a dude regarding to spark streaming and filestream. If spark streaming crashs and while spark was down new files are created in input folder, when spark streaming is launched again, how can I process these files? Thanks. Regards. Miguel. On Thu, Aug 27,

Re: sparkStreaming how to work with partitions,how tp create partition

2015-08-27 Thread Akhil Das
You can use the driver ui and click on the Jobs -> Stages to see the number of partitions being created for that job. If you want to increase the partitions, then you could do a .repartition too. With directStream api i guess the # partitions in spark will be equal to the number of partitions in th

RE: Selecting different levels of nested data records during one select?

2015-08-27 Thread Ewan Leith
I've just come across https://forums.databricks.com/questions/893/how-do-i-explode-a-dataframe-column-containing-a-c.html Which appears to get us started using explode on nested datasets as arrays correctly, thanks. Ewan From: Ewan Leith [mailto:ewan.le...@realitymine.com] Sent: 27 August 2015

Re: SQLContext load. Filtering files

2015-08-27 Thread Akhil Das
If you have enabled checkpointing the spark will handle that for you. Thanks Best Regards On Thu, Aug 27, 2015 at 4:21 PM, Masf wrote: > Thanks Akhil, I will have a look. > > I have a dude regarding to spark streaming and filestream. If spark > streaming crashs and while spark was down new file

Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks (it’s reading a few TB of data). The job itself is fairly simple - it’s just getting a list of distinct values: val days = spark .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass]) .sample(wit

RE: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ewan Leith
Are you using the Kryo serializer? If not, have a look at it, it can save a lot of memory during shuffles https://spark.apache.org/docs/latest/tuning.html I did a similar task and had various issues with the volume of data being parsed in one go, but that helped a lot. It looks like the main di

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
I should have mentioned: yes I am using Kryo and have registered KeyClass and ValueClass. I guess it’s not clear to me what is actually taking up space on the driver heap - I can’t see how it can be data with the code that I have. On 27/08/2015 12:09, "Ewan Leith" wrote: >Are you using the

RE: query avro hive table in spark sql

2015-08-27 Thread java8964
What version of the Hive you are using? And do you compile to the right version of Hive when you compiled Spark? BTY, spark-avro works great for our experience, but still, some non-tech people just want to use as a SQL shell in spark, like HIVE-CLI. Yong From: mich...@databricks.com Date: Wed, 2

sbt error -- before Terasort compilation

2015-08-27 Thread Shreeharsha G Neelakantachar
Hi , I am getting this sbt error when trying to compile a terasort sbt file with following contents, but facinng below error. Please suggest for what can be tried.. (Terasort is written in scala for our Benchmark) root@:~# more teraSort/tera.sbt name := "IBM ARL TeraSort" version := "1

spark-submit issue

2015-08-27 Thread pranay
I have a java program that does this - (using Spark 1.3.1 ) Create a command string that uses "spark-submit" in it ( with my Class file etc ), and i store this string in a temp file somewhere as a shell script Using Runtime.exec, i execute this script and wait for its completion, using process.wait

RE: Help! Stuck using withColumn

2015-08-27 Thread Saif.A.Ellafi
Hello, thank you for the response. I found a blog where a guy explains that it is not possible to join columns from different data frames. I was trying to modify one column’s information, so selecting it and then trying to replace the original dataframe column. Found another way, Thanks Saif

Re: spark streaming 1.3 kafka buffer size

2015-08-27 Thread Cody Koeninger
As it stands currently, no. If you're already overriding the dstream, it would be pretty straightforward to change the kafka parameters used when creating the rdd for the next batch though On Wed, Aug 26, 2015 at 11:41 PM, Shushant Arora wrote: > Can I change this param fetch.message.max.bytes

Re: spark streaming 1.3 kafka topic error

2015-08-27 Thread Cody Koeninger
Your kafka broker died or you otherwise had a rebalance. Normally spark retries take care of that. Is there something going on with your kafka installation, that rebalance is taking especially long? Yes, increasing backoff / max number of retries will "help", but it's better to figure out what's

Re: spark streaming 1.3 kafka topic error

2015-08-27 Thread Ahmed Nawar
Dears, I needs to commit DB Transaction for each partition,Not for each row. below didn't work for me. rdd.mapPartitions(partitionOfRecords => { DBConnectionInit() val results = partitionOfRecords.map(..) DBConnection.commit() }) Best regards, Ahmed Atef Nawwar Data Management

commit DB Transaction for each partition

2015-08-27 Thread Ahmed Nawar
Dears, I needs to commit DB Transaction for each partition,Not for each row. below didn't work for me. rdd.mapPartitions(partitionOfRecords => { DBConnectionInit() val results = partitionOfRecords.map(..) DBConnection.commit() results }) Best regards, Ahmed Atef Nawwar Data Man

Re: spark streaming 1.3 kafka topic error

2015-08-27 Thread Cody Koeninger
Map is lazy. You need an actual action, or nothing will happen. Use foreachPartition, or do an empty foreach after the map. On Thu, Aug 27, 2015 at 8:53 AM, Ahmed Nawar wrote: > Dears, > > I needs to commit DB Transaction for each partition,Not for each row. > below didn't work for me. > >

Adding Kafka topics to a running streaming context

2015-08-27 Thread yael aharon
Hello, My streaming application needs to allow consuming new Kafka topics at arbitrary times. I know I can stop and start the streaming context when I need to introduce a new stream, but that seems quite disruptive. I am wondering if other people have this situation and if there is a more elegant s

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ashish Rangole
I suggest taking a heap dump of driver process using jmap. Then open that dump in a tool like Visual VM to see which object(s) are taking up heap space. It is easy to do. We did this and found out that in our case it was the data structure that stores info about stages, jobs and tasks. There can be

Re:Adding Kafka topics to a running streaming context

2015-08-27 Thread Sudarshan Kadambi (BLOOMBERG/ 731 LEX)
This is something we have been needing for a while too. We are restarting the streaming context to handle new topic subscriptions & unsubscriptions which affects latency of update handling. I think this is something that needs to be addressed in core Spark Streaming (I can't think of any fundame

Re: Adding Kafka topics to a running streaming context

2015-08-27 Thread Cody Koeninger
If you all can say a little more about what your requirements are, maybe we can get a jira together. I think the easiest way to deal with this currently is to start a new job before stopping the old one, which should prevent latency problems. On Thu, Aug 27, 2015 at 9:24 AM, Sudarshan Kadambi (BL

B2i Healthcare "Powered by Spark" addition

2015-08-27 Thread Brandon Ulrich
Another addition to the Powered by Spark page: B2i Healthcare (http://b2i.sg) uses Spark in healthcare analytics with medical ontologies like SNOMED CT. Our Snow Owl MQ ( http://b2i.sg/snow-owl-mq) product relies on the Spark ecosystem to analyze ~1 billion health records with over 70 healthcare t

Best way to filter null on "any" column?

2015-08-27 Thread Saif.A.Ellafi
Hi all, What would be a good way to filter rows in a dataframe, where any value in a row has null? I wouldn't want to go through each column manually. Thanks, Saif

Writing test case for spark streaming checkpointing

2015-08-27 Thread Hafiz Mujadid
Hi! I have enables check pointing in spark streaming with kafka. I can see that spark streaming is checkpointing to the mentioned directory at hdfs. How can i test that it works fine and recover with no data loss ? Thanks -- View this message in context: http://apache-spark-user-list.1001560

Re: Writing test case for spark streaming checkpointing

2015-08-27 Thread Cody Koeninger
Kill the job in the middle of a batch, look at the worker logs to see which offsets were being processed, verify the messages for those offsets are read when you start the job back up On Thu, Aug 27, 2015 at 10:14 AM, Hafiz Mujadid wrote: > Hi! > > I have enables check pointing in spark streamin

Re: sbt error -- before Terasort compilation

2015-08-27 Thread Jacek Laskowski
On Thu, Aug 27, 2015 at 2:59 PM, Shreeharsha G Neelakantachar wrote: > root@:~# sbt package > Getting org.scala-sbt sbt 0.13.9 ... > > :: problems summary :: > WARNINGS > module not found: org.scala-sbt#sbt;0.13.9 ... > ERRORS > Server access Error: java.lang.Run

Is there a way to store RDD and load it with its original format?

2015-08-27 Thread Saif.A.Ellafi
Hi, Any way to store/load RDDs keeping their original object instead of string? I am having trouble with parquet (there is always some error at class conversion), and don't use hadoop. Looking for alternatives. Thanks in advance Saif

Spark driver locality

2015-08-27 Thread Swapnil Shinde
Hello I am new to spark world and started to explore recently in standalone mode. It would be great if I get clarifications on below doubts- 1. Driver locality - It is mentioned in documentation that "client" deploy-mode is not good if machine running "spark-submit" is not co-located with worker m

Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
Thanks for this tip. I ran it in yarn-client mode with driver-memory = 4G and took a dump once the heap got close to 4G. num#instances #bytes class name -- 1: 446169 3661137256 [J 2: 2032795 222636720

Re: query avro hive table in spark sql

2015-08-27 Thread Giri P
we are using hive1.1 . I was able to fix below error when I used right version spark 15/08/26 17:51:12 WARN avro.AvroSerdeUtils: Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither avro.schem

Re: query avro hive table in spark sql

2015-08-27 Thread Giri P
can we run hive queries using spark-avro ? In our case its not just reading the avro file. we have view in hive which is based on multiple tables. On Thu, Aug 27, 2015 at 9:41 AM, Giri P wrote: > we are using hive1.1 . > > I was able to fix below error when I used right version spark > > 15/08/

Getting number of physical machines in Spark

2015-08-27 Thread Young, Matthew T
What's the canonical way to find out the number of physical machines in a cluster at runtime in Spark? I believe SparkContext.defaultParallelism will give me the number of cores, but I'm interested in the number of NICs. I'm writing a Spark streaming application to ingest from Kafka with the Re

RE: query avro hive table in spark sql

2015-08-27 Thread java8964
You can run hive query in the spark-avro, but you cannot query the hive view in the spark-avro, as the view is stored in the Hive metadata. What do you mean the right version of spark, then "can't determine table schema" problem is fixed? I faced this problem before, and my guess is the Hive lib

Re: Adding Kafka topics to a running streaming context

2015-08-27 Thread Sudarshan Kadambi (BLOOMBERG/ 731 LEX)
I put up https://issues.apache.org/jira/browse/SPARK-10320. Let's continue this conversation there. From: c...@koeninger.org At: Aug 27 2015 10:30:12 To: Sudarshan Kadambi (BLOOMBERG/ 731 LEX) Cc: user@spark.apache.org Subject: Re: Adding Kafka topics to a running streaming context If you all ca

Spark Streaming Listener to Kill Stages?

2015-08-27 Thread suchenzang
Hello, I was wondering if there's a way to kill stages in spark streaming via a StreamingListener. Sometimes stages will hang, and simply killing it via the UI: is enough to let the streaming job

Re: query avro hive table in spark sql

2015-08-27 Thread Giri P
I was using different build of spark compiled with different version of hive before I error which I see now org.apache.hadoop.hive.serde2.avro.BadSchemaException at org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:195) at org.apache.spark.sql.hive.HadoopTabl

Re: application logs for long lived job on YARN

2015-08-27 Thread Chen Song
Anyone has similar problem or thoughts on this? On Wed, Aug 26, 2015 at 10:37 AM, Chen Song wrote: > When running long-lived job on YARN like Spark Streaming, I found that > container logs gone after days on executor nodes, although the job itself > is still running. > > > I am using cdh5.4.0 an

Commit DB Transaction for each partition

2015-08-27 Thread Ahmed Nawar
Thanks for foreach idea. But once i used it i got empty rdd. I think because "results" is an iterator. Yes i know "Map is lazy" but i expected there is solution to force action. I can not use foreachPartition because i need reuse the new RDD after some maps. On Thu, Aug 27, 2015 at 5:11 PM, Co

Re: sbt error -- before Terasort compilation

2015-08-27 Thread Jacek Laskowski
On Thu, Aug 27, 2015 at 5:40 PM, Jacek Laskowski wrote: >> Server access Error: java.lang.RuntimeException: Unexpected error: >> java.security.InvalidAlgorithmParameterException: the trustAnchors parameter >> must be non-empty >> url=https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.9/sb

Data Frame support CSV or excel format ?

2015-08-27 Thread spark user
Hi all , Can we create data frame from excels sheet or csv file , in below example It seems they support only json ? DataFrame df = sqlContext.read().json("examples/src/main/resources/people.json");

Re: Spark driver locality

2015-08-27 Thread Rishitesh Mishra
Hi Swapnil, Let me try to answer some of the questions. Answers inline. Hope it helps. On Thursday, August 27, 2015, Swapnil Shinde wrote: > Hello > I am new to spark world and started to explore recently in standalone > mode. It would be great if I get clarifications on below doubts- > > 1. Dri

Re: query avro hive table in spark sql

2015-08-27 Thread Michael Armbrust
> > BTY, spark-avro works great for our experience, but still, some non-tech > people just want to use as a SQL shell in spark, like HIVE-CLI. > To clarify: you can still use the spark-avro library with pure SQL. Just use the CREATE TABLE ... USING com.databricks.spark.avro OPTIONS (path '...') s

Re: Commit DB Transaction for each partition

2015-08-27 Thread Cody Koeninger
You need to return an iterator from the closure you provide to mapPartitions On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar wrote: > Thanks for foreach idea. But once i used it i got empty rdd. I think > because "results" is an iterator. > > Yes i know "Map is lazy" but i expected there is solutio

Re: Commit DB Transaction for each partition

2015-08-27 Thread Ahmed Nawar
Yes, of course, I am doing that. But once i added results.foreach(row=> {}) i pot empty RDD. rdd.mapPartitions(partitionOfRecords => { DBConnectionInit() val results = partitionOfRecords.map(..) DBConnection.commit() results.foreach(row=> {}) results }) On Thu, Aug 27, 2015 at 10:

Re: Commit DB Transaction for each partition

2015-08-27 Thread Cody Koeninger
This job contains a spark output action, and is what I originally meant: rdd.mapPartitions { result }.foreach { } This job is just a transformation, and won't do anything unless you have another output action. Not to mention, it will exhaust the iterator, as you noticed: rdd.mapPartitions {

Re: Data Frame support CSV or excel format ?

2015-08-27 Thread Michael Armbrust
Check out spark-csv: http://spark-packages.org/package/databricks/spark-csv On Thu, Aug 27, 2015 at 11:48 AM, spark user wrote: > Hi all , > > Can we create data frame from excels sheet or csv file , in below example > It seems they support only json ? > > > > DataFrame df = > sqlContext.read()

Re: Spark driver locality

2015-08-27 Thread Swapnil Shinde
Thanks Rishitesh !! 1. I get that driver doesn't need to be on master but there is lot of communication between driver and cluster. That's why co-located gateway was recommended. How much is the impact of driver not being co-located with cluster? 4. How does hdfs split get assigned to worker node

Re: Commit DB Transaction for each partition

2015-08-27 Thread Ahmed Nawar
Thanks a lot for your support. It is working now. I wrote it like below val newRDD = rdd.mapPartitions { partition => { val result = partition.map(.) result } } newRDD.foreach { } On Thu, Aug 27, 2015 at 10:34 PM, Cody Koeninger wrote: > This job contains a spark output action, an

Porting a multit-hreaded compute intensive job to spark

2015-08-27 Thread Utkarsh Sengar
I am working on code which uses executor service to parallelize tasks (think machine learning computations done over small dataset over and over again). My goal is to execute some code as fast as possible, multiple times and store the result somewhere (total executions will be on the order of 100M

types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test with an RDD[Array[String]], but when I tried to read back the result with sc.objectFile(path).take(5).foreach(println), I got a non-promising output looking like: [Ljava.lang.String;@46123a [Ljava.lang.String;@76123b [Ljava.

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Holden Karau
So println of any array of strings will look like that. The java.util.Arrays class has some options to print arrays nicely. On Thu, Aug 27, 2015 at 2:08 PM, Arun Luthra wrote: > What types of RDD can saveAsObjectFile(path) handle? I tried a naive test > with an RDD[Array[String]], but when I tri

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Jonathan Coveney
array[String] doesn't pretty print by default. Use .mkString(",") for example El jueves, 27 de agosto de 2015, Arun Luthra escribió: > What types of RDD can saveAsObjectFile(path) handle? I tried a naive test > with an RDD[Array[String]], but when I tried to read back the result with > sc.object

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
Ah, yes, that did the trick. So more generally, can this handle any serializable object? On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney wrote: > array[String] doesn't pretty print by default. Use .mkString(",") for > example > > > El jueves, 27 de agosto de 2015, Arun Luthra > escribió: > >

Any quick method to sample rdd based on one filed?

2015-08-27 Thread Gavin Yue
Hey, I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and randomly keep some Boolean:false rows. And hope in the final result, the negative ones could be 10 times more than positive ones. What would be most efficient way to do this? Thanks,

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Holden Karau
Yes, any java serializable object. Its important to note that since its saving serialized objects it is as brittle as java serialization when it comes to version changes, so if you can make your data fit in something like sequence files, parquet, avro, or similar it can be not only more space effic

tweet transformation ideas

2015-08-27 Thread Jesse F Chen
This is a question on general usage/best practice/best transformation method to use for a sentiment analysis on tweets... Input: Tweets (e.g, "@xyz, sorry but this movie is poorly scripted http://t.co/uyser876";) - large data set, ie. 1 billion tweets Sentiment dictionary (e.g, "

TimeoutException on start-slave spark 1.4.0

2015-08-27 Thread Alexander Pivovarov
I see the following error time to time when try to start slaves on spark 1.4.0 [hadoop@ip-10-0-27-240 apps]$ pwd /mnt/var/log/apps [hadoop@ip-10-0-27-240 apps]$ cat spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-10-0-27-240.ec2.internal.out Spark Command: /usr/java/latest/bin/java -cp /

Spark Taking too long on K-means clustering

2015-08-27 Thread masoom alam
HI every one, I am trying to run KDD data set - basically chapter 5 of the Advanced Analytics with Spark book. The data set is of 789MB, but Spark is taking some 3 to 4 hours. Is it normal behaviour.or some tuning is required. The server RAM is 32 GB, but we can only give 4 GB RAM on 64 bit Ub

Array column stored as “.bag” in parquet file instead of “REPEATED INT64"

2015-08-27 Thread Jim Green
Hi Team, Say I have a test.json file: {"c1":[1,2,3]} I can create a parquet file like : var df = sqlContext.load("/tmp/test.json","json") var df_c = df.repartition(1) df_c.select("*").save("/tmp/testjson_spark","parquet”) The output parquet file’s schema is like: c1: OPTIONAL F:1 .bag:

How to avoid shuffle errors for a large join ?

2015-08-27 Thread Thomas Dudziak
I'm getting errors like "Removing executor with no recent heartbeats" & "Missing an output location for shuffle" errors for a large SparkSql join (1bn rows/2.5TB joined with 1bn rows/30GB) and I'm not sure how to configure the job to avoid them. The initial stage completes fine with some 30k tasks

How to increase the Json parsing speed

2015-08-27 Thread Gavin Yue
Hey I am using the Json4s-Jackson parser coming with spark and parsing roughly 80m records with totally size 900mb. But the speed is slow. It took my 50 nodes(16cores cpu,100gb mem) roughly 30mins to parse Json to use spark sql. Jackson has the benchmark saying parsing should be ms level.

Re: How to increase the Json parsing speed

2015-08-27 Thread Sabarish Sasidharan
For your jsons, can you tell us what is your benchmark when running on a single machine using just plain Java (without Spark and Spark sql)? Regards Sab On 28-Aug-2015 7:29 am, "Gavin Yue" wrote: > Hey > > I am using the Json4s-Jackson parser coming with spark and parsing roughly > 80m records w

Re: Array column stored as “.bag” in parquet file instead of “REPEATED INT64"

2015-08-27 Thread Cheng Lian
Hi Jim, Unfortunately this is neither possible in Spark nor a standard practice for Parquet. In your case, actually repeated int64 c1 doesn't catch the full semantics. Because it represents a *required* array of long values containing zero or more *non-null* elements. However, when inferring sche

Re: tweet transformation ideas

2015-08-27 Thread Holden Karau
It seems like this might be better suited to a broadcasted hash map since 200k entries isn't that big. You can then map over the tweets and lookup each word in the broadcasted map. On Thursday, August 27, 2015, Jesse F Chen wrote: > This is a question on general usage/best practice/best transfor

Re: How to increase the Json parsing speed

2015-08-27 Thread Gavin Yue
Just did some tests. I have 6000 files, each has 14K records with 900Mb file size. In spark sql, it would take one task roughly 1 min to parse. On the local machine, using the same Jackson lib inside Spark lib. Just parse it. FileInputStream fstream = new FileInputStream("testfile")

Re: Feedback: Feature request

2015-08-27 Thread Manish Amde
Hi James, It's a good idea. A JSON format is more convenient for visualization though a little inconvenient to read. How about toJson() method? It might make the mllib api inconsistent across models though. You should probably create a JIRA for this. CC: dev list -Manish > On Aug 26, 2015,

Re: spark 1.4.1 - LZFException

2015-08-27 Thread Akhil Das
Is it filling up your disk space? Can you look a bit more in the executor logs to see whats going on Thanks Best Regards On Sun, Aug 23, 2015 at 1:27 AM, Yadid Ayzenberg wrote: > > > Hi All, > > We have a spark standalone cluster running 1.4.1 and we are setting > spark.io.compression.codec to

Re: how to migrate from spark 0.9 to spark 1.4

2015-08-27 Thread Akhil Das
Edit the spark version in your .pom file and see what all are breaking in the .py file and fix them according to the api docs. Thanks Best Regards On Sun, Aug 23, 2015 at 10:50 AM, sai rakesh wrote: > currently i am using spark 0.9 on my data i wrote code in java for > sparksql.now i want to us

Re: tweet transformation ideas

2015-08-27 Thread Josh Rosen
What about using a subquery / exists query and the new 1.5 broadcast hint on the dictionary data frame? On Thu, Aug 27, 2015 at 8:44 PM, Holden Karau wrote: > It seems like this might be better suited to a broadcasted hash map since > 200k entries isn't that big. You can then map over the tweets

Re: How to remove worker node but let it finish first?

2015-08-27 Thread Akhil Das
You can create a custom mesos framework for your requirement, to get you started you can check this out http://mesos.apache.org/documentation/latest/app-framework-development-guide/ Thanks Best Regards On Mon, Aug 24, 2015 at 12:11 PM, Romi Kuntsman wrote: > Hi, > I have a spark standalone clus

Graphx CompactBuffer help

2015-08-27 Thread smagadi
after running the below code val coonn=graph.connectedComponents.vertices.map(_.swap).groupByKey.map(_._2).collect() coonn.foreach(println) I get array of compact buffers CompactBuffer(13) CompactBuffer(14) CompactBuffer(4, 11, 1, 6, 3, 7, 9, 8, 10, 5, 2) CompactBuffer(15, 12) Now i want to g

Re: How to increase the Json parsing speed

2015-08-27 Thread Sabarish Sasidharan
I see that you are not reusing the same mapper instance in the Scala snippet. Regards Sab On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue wrote: > Just did some tests. > > I have 6000 files, each has 14K records with 900Mb file size. In spark > sql, it would take one task roughly 1 min to parse. >

Re: Is there any way to connect cassandra without spark-cassandra connector?

2015-08-27 Thread Hafiz Mujadid
What maven dependencies are you using ? I tried the same but got following exception Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/cassandra/cql/jdbc/AbstractJdbcType at org.apache.cassandra.cql.jdbc.CassandraConnection.(CassandraConnection.java:146) at org.

Re: How to increase the Json parsing speed

2015-08-27 Thread Sabarish Sasidharan
How many executors are you using when using Spark SQL? On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > I see that you are not reusing the same mapper instance in the Scala > snippet. > > Regards > Sab > > On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue