date:20150614

Re: Not albe to run FP-growth Example

2015-06-14 Thread masoom alam

This is not working: org.apache.spark.mlib spark-mlib provided On Sat, Jun 13, 2015 at 11:56 PM, masoom alam wrote: > These two imports are missing and thus FP-growth is not compiling... > > import org.apache.spark.*mllib.fpm.FPGrow

Re: How to read avro in SparkR

2015-06-14 Thread Shing Hing Man

Burak's suggestion works : read.df(sqlContext, "file:///home/matmsh/myfile.avro","com.databricks.spark.avro" ) Maybe the above should be added to SparkR (R on Spark) - Spark 1.4.0 Documentation | | | | | | | | | SparkR (R on Spark) - Spark 1.4.0 DocumentationSparkR (R on Spark) Over

Re: lower&upperBound not working/spark 1.3

2015-06-14 Thread Sujeevan

I also thought that it is an issue. After investigating it further, found out this. https://issues.apache.org/jira/browse/SPARK-6800 Here is the updated documentation of *org.apache.spark.sql.jdbc.JDBCRelation#columnPartition* method "Notice that lowerBound and upperBound are just used to decide

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread ayan guha

Can you do dedupe process locally for each file first and then globally? Also I did not fully get the logic of the part inside reducebykey. Can you kindly explain? On 14 Jun 2015 13:58, "Gavin Yue" wrote: > I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So > totally 5TB da

Re: Job marked as killed in spark 1.4

2015-06-14 Thread Nizan Grauer

hi update regarding that, hope it will get me some answers... When I enter one the workers log (for of its task), I can see the following exception: Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver@172.31.0.186:38560/), Path(/

Re: Job marked as killed in spark 1.4

2015-06-14 Thread Nizan Grauer

and last update for that - The job itself seems to be working and generates output on s3, it just reports itself as KILLED, and history server can't find the logs On Sun, Jun 14, 2015 at 3:55 PM, Nizan Grauer wrote: > hi > > update regarding that, hope it will get me some answers... > > When I

Re: lower&upperBound not working/spark 1.3

2015-06-14 Thread Sathish Kumaran Vairavelu

Hi I am also facing with same issue. Is it possible to view actual query passed to the database. Has anyone tried that? Also, what if we don't give upper and lower bound partition. Would we end up in data skew ? Thanks Sathish On Sun, Jun 14, 2015 at 5:02 AM Sujeevan wrote: > I also thought th

creation of RDD from a Tree

2015-06-14 Thread lisp

Hi there, I have a large amount of objects, which I have to partition into chunks with the help of a binary tree: after each object has been run through the tree, the leaves of that tree contain the chunks. Next I have to process each of those chunks in the same way with a function f(chunk). So I

RE: How to silence Parquet logging?

2015-06-14 Thread Chris Freeman

Hi Cheng, I'm using the official SparkR in the official 1.4 release, loading parquet files with read.df, like so: txnsRaw <- read.df(sqlCtx, "Customer_Transactions.parquet") And here's a sample of the logging info that gets printed to the console: Jun 14, 2015 9:49:52 AM INFO: parquet.hadoop.

Re: creation of RDD from a Tree

2015-06-14 Thread Will Briggs

If you are working on large structures, you probably want to look at the GraphX extension to Spark: https://spark.apache.org/docs/latest/graphx-programming-guide.html On June 14, 2015, at 10:50 AM, lisp wrote: Hi there, I have a large amount of objects, which I have to partition into chunks w

Re: Job marked as killed in spark 1.4

2015-06-14 Thread nizang

hi, A simple way to recreate the problem - I have two servers installations, one with spark 1.3.1, and one with spark 1.4.0 I ran the following on both servers: root@ip-172-31-6-108 ~]$ spark/bin/spark-shell --total-executor-cores 1 scala> val text = sc.textFile("hdfs:///some-file.txt”); s

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen

If your job is dying due to out of memory errors in the post-shuffle stage, I'd consider the following approach for implementing de-duplication / distinct(): - Use sortByKey() to perform a full sort of your dataset. - Use mapPartitions() to iterate through each partition of the sorted dataset, con

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Gavin Yue

Each folder should have no dups. Dups only exist among different folders. The logic inside is that only take the longest string value for each key. The current problem is exceeding the largest frame size when trying to write to hdfs, which is 500m which setting is 80m. Sent from my iPhone

What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-14 Thread Rex X

For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement - *binary / categorical (nominal), counts (ordinal), and ratio (scale)* To be concrete, for example, working with attributes of *city, zip, satisfaction_level, price* In the meanwh

Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu

Hello Everyone, I pulled 2 different tables from the JDBC source and then joined them using the cust_id *decimal* column. A simple join like as below. This simple join works perfectly in the database but not in Spark SQL. I am importing 2 tables as a data frame/registertemptable and firing sql on

Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Michael Armbrust

Sounds like SPARK-5456 . Which is fixed in Spark 1.4. On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Hello Everyone, > > I pulled 2 different tables from the JDBC source and then joined them > usi

DataFrame and JDBC regression?

2015-06-14 Thread Peter Haumer

Hello. I have an ETL app that appends to a JDBC table new results found at each run. In 1.3.1 I did this: testResultsDF.insertIntoJDBC(CONNECTION_URL + ";user=" + USER + ";password=" + PASSWORD, TABLE_NAME, false); When I do this now in 1.4 it complains that the "object" 'TABLE

Re: DataFrame and JDBC regression?

2015-06-14 Thread Michael Armbrust

Can you please file a JIRA? On Sun, Jun 14, 2015 at 2:20 PM, Peter Haumer wrote: > Hello. > I have an ETL app that appends to a JDBC table new results found at each > run. In 1.3.1 I did this: > > testResultsDF.insertIntoJDBC(CONNECTION_URL + ";user=" + USER + > ";password=" + PASSWORD, TABLE_

Re: --packages & Failed to load class for data source v1.4

2015-06-14 Thread Don Drake

I looked at this again, and when I use the Scala spark-shell and load a CSV using the same package it works just fine, so this seems specific to pyspark. I've created the following JIRA: https://issues.apache.org/jira/browse/SPARK-8365 -Don On Sat, Jun 13, 2015 at 11:46 AM, Don Drake wrote: >

Re: Spark SQL JDBC Source Join Error

2015-06-14 Thread Sathish Kumaran Vairavelu

Thank you.. it works in Spark 1.4. On Sun, Jun 14, 2015 at 3:51 PM Michael Armbrust wrote: > Sounds like SPARK-5456 . > Which is fixed in Spark 1.4. > > On Sun, Jun 14, 2015 at 11:57 AM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wr

Re: How to split log data into different files according to severity

2015-06-14 Thread Hao Wang

Thanks for the link. I’m still running 1.3.1 but will give it a try :) Hao > On Jun 13, 2015, at 9:38 AM, Will Briggs wrote: > > Check out this recent post by Cheng Liam regarding dynamic partitioning in > Spark 1.4: https://www.mail-archive.com/user@spark.apache.org/msg30204.html >

Spark SQL - Complex query pushdown

2015-06-14 Thread Sathish Kumaran Vairavelu

Hello, Is there a way in spark, where I define the data source (say the JDBC Source) and define the list of tables to be used on that data source. Like JDBC connection, where we define the connection and run execute statement based on that connection. In current external table implementation, each

Re: --packages & Failed to load class for data source v1.4

2015-06-14 Thread Burak Yavuz

Hi Don, This seems related to a known issue, where the classpath on the driver is missing the related classes. This is a bug in py4j as py4j uses the System Classloader rather than Spark's Context Classloader. However, this problem existed in 1.3.0 as well, therefore I'm curious whether it's the sa

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-14 Thread Haopu Wang

Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if t

Re: [Spark-1.4.0]jackson-databind conflict?

2015-06-14 Thread Earthson Lu

I’ve recompiled spark-1.4.0 with fasterxml-2.5.x, it works fine now:) -- Earthson Lu On June 12, 2015 at 23:24:32, Sean Owen (so...@cloudera.com) wrote: I see the same thing in an app that uses Jackson 2.5. Downgrading to 2.4 made it work. I meant to go back and figure out if there's somet

Re: Not albe to run FP-growth Example

2015-06-14 Thread masoom alam

*Getting the following error:* [INFO] [INFO] [INFO] Building example 0.0.1 [INFO] Downloading: http://repo.maven.apache.org/maven2/org/apache/spark/spa

Re: Join between DStream and Periodically-Changing-RDD

2015-06-14 Thread Ilove Data

@Akhil Das Join two Dstreams might not be an option since I want to join stream with historical data in HDFS folder. @Tagatha Das & @Evo Eftimov Batch RDD to be reloaded is considerably huge compare to Dstream data since it is historical data. To be more specific, most of join from rdd stream to h

Union of two Diff. types of DStreams

2015-06-14 Thread anshu shukla

How to take union of JavaPairDStream and JavaDStream . *>a.union(b) is working only with Dstreams of same type.* -- Thanks & Regards, Anshu Shukla

Re: Union of two Diff. types of DStreams

2015-06-14 Thread ayan guha

That's by design. You can make up the value in 2nd rdd by putting a default Int value. On 15 Jun 2015 15:30, "anshu shukla" wrote: > How to take union of JavaPairDStream and > JavaDStream . > > *>a.union(b) is working only with Dstreams of same type.* > > > -- > Thanks & Regards, > Ans

Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-14 Thread Proust GZ Feng

Hi, Spark Experts I have played with Spark several weeks, after some time testing, a reduce operation of DataFrame cost 40s on a cluster with 5 datanode executors. And the back-end rows is about 6,000, is this a normal case? Such performance looks too bad because in Java a loop for 6,000 rows ca

Re: Not albe to run FP-growth Example

2015-06-14 Thread masoom alam

even if the following POM is also not working: http://maven.apache.org/POM/4.0.0"; xmlns:xsi=" http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation=" http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd";> spark-parent_2.10 org.apache.spark

Re: Not albe to run FP-growth Example

Re: How to read avro in SparkR

Re: lower&upperBound not working/spark 1.3

Re: What is most efficient to do a large union and remove duplicates?

Re: Job marked as killed in spark 1.4

Re: Job marked as killed in spark 1.4

Re: lower&upperBound not working/spark 1.3

creation of RDD from a Tree

RE: How to silence Parquet logging?

Re: creation of RDD from a Tree

Re: Job marked as killed in spark 1.4

Re: What is most efficient to do a large union and remove duplicates?

Re: What is most efficient to do a large union and remove duplicates?

What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

Spark SQL JDBC Source Join Error

Re: Spark SQL JDBC Source Join Error

DataFrame and JDBC regression?

Re: DataFrame and JDBC regression?

Re: --packages & Failed to load class for data source v1.4

Re: Spark SQL JDBC Source Join Error

Re: How to split log data into different files according to severity

Spark SQL - Complex query pushdown

Re: --packages & Failed to load class for data source v1.4

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

Re: [Spark-1.4.0]jackson-databind conflict?

Re: Not albe to run FP-growth Example

Re: Join between DStream and Periodically-Changing-RDD

Union of two Diff. types of DStreams

Re: Union of two Diff. types of DStreams

Spark DataFrame Reduce Job Took 40s for 6000 Rows

Re: Not albe to run FP-growth Example

31 matches

Site Navigation

Mail list logo

Footer information