Re: Not albe to run FP-growth Example

2015-06-14 Thread masoom alam
This is not working: org.apache.spark.mlib spark-mlib provided On Sat, Jun 13, 2015 at 11:56 PM, masoom alam wrote: > These two imports are missing and thus FP-growth is not compiling... > > import org.apache.spark.*mllib.fpm.FPGrow

Re: How to read avro in SparkR

2015-06-14 Thread Shing Hing Man
Burak's suggestion works : read.df(sqlContext, "file:///home/matmsh/myfile.avro","com.databricks.spark.avro" ) Maybe the above should be added to SparkR (R on Spark) - Spark 1.4.0 Documentation |   | |   |   |   |   |   | | SparkR (R on Spark) - Spark 1.4.0 DocumentationSparkR (R on Spark) Over

Re: lower&upperBound not working/spark 1.3

2015-06-14 Thread Sujeevan
I also thought that it is an issue. After investigating it further, found out this. https://issues.apache.org/jira/browse/SPARK-6800 Here is the updated documentation of *org.apache.spark.sql.jdbc.JDBCRelation#columnPartition* method "Notice that lowerBound and upperBound are just used to decide

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread ayan guha
Can you do dedupe process locally for each file first and then globally? Also I did not fully get the logic of the part inside reducebykey. Can you kindly explain? On 14 Jun 2015 13:58, "Gavin Yue" wrote: > I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So > totally 5TB da

Re: Job marked as killed in spark 1.4

2015-06-14 Thread Nizan Grauer
hi update regarding that, hope it will get me some answers... When I enter one the workers log (for of its task), I can see the following exception: Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@172.31.0.186:38560/), Path(/

Re: Job marked as killed in spark 1.4

2015-06-14 Thread Nizan Grauer
and last update for that - The job itself seems to be working and generates output on s3, it just reports itself as KILLED, and history server can't find the logs On Sun, Jun 14, 2015 at 3:55 PM, Nizan Grauer wrote: > hi > > update regarding that, hope it will get me some answers... > > When I

Re: lower&upperBound not working/spark 1.3

2015-06-14 Thread Sathish Kumaran Vairavelu
Hi I am also facing with same issue. Is it possible to view actual query passed to the database. Has anyone tried that? Also, what if we don't give upper and lower bound partition. Would we end up in data skew ? Thanks Sathish On Sun, Jun 14, 2015 at 5:02 AM Sujeevan wrote: > I also thought th

creation of RDD from a Tree

2015-06-14 Thread lisp
Hi there, I have a large amount of objects, which I have to partition into chunks with the help of a binary tree: after each object has been run through the tree, the leaves of that tree contain the chunks. Next I have to process each of those chunks in the same way with a function f(chunk). So I

RE: How to silence Parquet logging?

2015-06-14 Thread Chris Freeman
Hi Cheng, I'm using the official SparkR in the official 1.4 release, loading parquet files with read.df, like so: txnsRaw <- read.df(sqlCtx, "Customer_Transactions.parquet") And here's a sample of the logging info that gets printed to the console: Jun 14, 2015 9:49:52 AM INFO: parquet.hadoop.

Re: creation of RDD from a Tree

2015-06-14 Thread Will Briggs
If you are working on large structures, you probably want to look at the GraphX extension to Spark: https://spark.apache.org/docs/latest/graphx-programming-guide.html On June 14, 2015, at 10:50 AM, lisp wrote: Hi there, I have a large amount of objects, which I have to partition into chunks w

Re: Job marked as killed in spark 1.4

2015-06-14 Thread nizang
hi, A simple way to recreate the problem - I have two servers installations, one with spark 1.3.1, and one with spark 1.4.0 I ran the following on both servers: root@ip-172-31-6-108 ~]$ spark/bin/spark-shell --total-executor-cores 1 scala> val text = sc.textFile("hdfs:///some-file.txt”); s

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen
If your job is dying due to out of memory errors in the post-shuffle stage, I'd consider the following approach for implementing de-duplication / distinct(): - Use sortByKey() to perform a full sort of your dataset. - Use mapPartitions() to iterate through each partition of the sorted dataset, con

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Gavin Yue
Each folder should have no dups. Dups only exist among different folders. The logic inside is that only take the longest string value for each key. The current problem is exceeding the largest frame size when trying to write to hdfs, which is 500m which setting is 80m. Sent from my iPhone

What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-14 Thread Rex X
For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement - *binary / categorical (nominal), counts (ordinal), and ratio (scale)* To be concrete, for example, working with attributes of *city, zip, satisfaction_level, price* In the meanwh

Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu
Hello Everyone, I pulled 2 different tables from the JDBC source and then joined them using the cust_id *decimal* column. A simple join like as below. This simple join works perfectly in the database but not in Spark SQL. I am importing 2 tables as a data frame/registertemptable and firing sql on

Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Michael Armbrust
Sounds like SPARK-5456 . Which is fixed in Spark 1.4. On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hello Everyone, > > I pulled 2 different tables from the JDBC source and then joined them > usi

DataFrame and JDBC regression?

2015-06-14 Thread Peter Haumer
Hello. I have an ETL app that appends to a JDBC table new results found at each run. In 1.3.1 I did this: testResultsDF.insertIntoJDBC(CONNECTION_URL + ";user=" + USER + ";password=" + PASSWORD, TABLE_NAME, false); When I do this now in 1.4 it complains that the "object" 'TABLE

Re: DataFrame and JDBC regression?

2015-06-14 Thread Michael Armbrust
Can you please file a JIRA? On Sun, Jun 14, 2015 at 2:20 PM, Peter Haumer wrote: > Hello. > I have an ETL app that appends to a JDBC table new results found at each > run. In 1.3.1 I did this: > > testResultsDF.insertIntoJDBC(CONNECTION_URL + ";user=" + USER + > ";password=" + PASSWORD, TABLE_

Re: --packages & Failed to load class for data source v1.4

2015-06-14 Thread Don Drake
I looked at this again, and when I use the Scala spark-shell and load a CSV using the same package it works just fine, so this seems specific to pyspark. I've created the following JIRA: https://issues.apache.org/jira/browse/SPARK-8365 -Don On Sat, Jun 13, 2015 at 11:46 AM, Don Drake wrote: >

Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu
Thank you.. it works in Spark 1.4. On Sun, Jun 14, 2015 at 3:51 PM Michael Armbrust wrote: > Sounds like SPARK-5456 . > Which is fixed in Spark 1.4. > > On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wr

Re: How to split log data into different files according to severity

2015-06-14 Thread Hao Wang
Thanks for the link. I’m still running 1.3.1 but will give it a try :) Hao > On Jun 13, 2015, at 9:38 AM, Will Briggs wrote: > > Check out this recent post by Cheng Liam regarding dynamic partitioning in > Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html >

Spark SQL - Complex query pushdown

2015-06-14 Thread Sathish Kumaran Vairavelu
Hello, Is there a way in spark, where I define the data source (say the JDBC Source) and define the list of tables to be used on that data source. Like JDBC connection, where we define the connection and run execute statement based on that connection. In current external table implementation, each

Re: --packages & Failed to load class for data source v1.4

2015-06-14 Thread Burak Yavuz
Hi Don, This seems related to a known issue, where the classpath on the driver is missing the related classes. This is a bug in py4j as py4j uses the System Classloader rather than Spark's Context Classloader. However, this problem existed in 1.3.0 as well, therefore I'm curious whether it's the sa

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-14 Thread Haopu Wang
Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if t

Re: [Spark-1.4.0]jackson-databind conflict?

2015-06-14 Thread Earthson Lu
I’ve recompiled spark-1.4.0 with fasterxml-2.5.x, it works fine now:) --  Earthson Lu On June 12, 2015 at 23:24:32, Sean Owen (so...@cloudera.com) wrote: I see the same thing in an app that uses Jackson 2.5. Downgrading to 2.4 made it work. I meant to go back and figure out if there's somet

Re: Not albe to run FP-growth Example

2015-06-14 Thread masoom alam
*Getting the following error:* [INFO] [INFO] [INFO] Building example 0.0.1 [INFO] Downloading: http://repo.maven.apache.org/maven2/org/apache/spark/spa

Re: Join between DStream and Periodically-Changing-RDD

2015-06-14 Thread Ilove Data
@Akhil Das Join two Dstreams might not be an option since I want to join stream with historical data in HDFS folder. @Tagatha Das & @Evo Eftimov Batch RDD to be reloaded is considerably huge compare to Dstream data since it is historical data. To be more specific, most of join from rdd stream to h

Union of two Diff. types of DStreams

2015-06-14 Thread anshu shukla
How to take union of JavaPairDStream and JavaDStream . *>a.union(b) is working only with Dstreams of same type.* -- Thanks & Regards, Anshu Shukla

Re: Union of two Diff. types of DStreams

2015-06-14 Thread ayan guha
That's by design. You can make up the value in 2nd rdd by putting a default Int value. On 15 Jun 2015 15:30, "anshu shukla" wrote: > How to take union of JavaPairDStream and > JavaDStream . > > *>a.union(b) is working only with Dstreams of same type.* > > > -- > Thanks & Regards, > Ans

Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-14 Thread Proust GZ Feng
Hi, Spark Experts I have played with Spark several weeks, after some time testing, a reduce operation of DataFrame cost 40s on a cluster with 5 datanode executors. And the back-end rows is about 6,000, is this a normal case? Such performance looks too bad because in Java a loop for 6,000 rows ca

Re: Not albe to run FP-growth Example

2015-06-14 Thread masoom alam
even if the following POM is also not working: http://maven.apache.org/POM/4.0.0"; xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation=" http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd";> spark-parent_2.10 org.apache.spark