Re: How to read LZO file in Spark?

2017-09-28 Thread Vida Ha
https://docs.databricks.com/spark/latest/data-sources/read-lzo.html On Wed, Sep 27, 2017 at 6:36 AM 孫澤恩 wrote: > Hi All, > > Currently, I follow this blog > http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ > that > I could use hdfs dfs -text to read the

Re: How to join two PairRDD together?

2014-08-25 Thread Vida Ha
Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the code. On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li wrote: > Hello everyone, > I am transplanting a clustering algorithm to spark platform, and I > meet a problem confusing me for a long ti

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Vida Ha
Hi Chris, We have a knowledge base article to explain what's happening here: https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/javaionotserializableexception.md Let me know if the article is not clear enough - I would be happy to edit and improve it. -Vida On Wed,

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Vida Ha
Hi, I doubt the the broadcast variable is your problem, since you are seeing: org.apache.spark.SparkException: Task not serializable Caused by: java.io.NotSerializableException: org.apache.spark.sql .hive.HiveContext$$anon$3 We have a knowledgebase article that explains why this happens - it's a

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
Mon, Aug 18, 2014 at 4:25 PM, Vida Ha wrote: > Hi John, > > It seems like original problem you had was that you were initializing the > RabbitMQ connection on the driver, but then calling the code to write to > RabbitMQ on the workers (I'm guessing, but I don't know since

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
Hi John, It seems like original problem you had was that you were initializing the RabbitMQ connection on the driver, but then calling the code to write to RabbitMQ on the workers (I'm guessing, but I don't know since I didn't see your code). That's definitely a problem because the connection can

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
and then load >> it from there into Redshift. This is not a slow as you think, because Spark >> can write the output in parallel to S3, and Redshift, too, can load data >> from multiple files in parallel >> <http://docs.aws.amazon.com/redshift/latest/dg/c_best-pra

Re: Save an RDD to a SQL Database

2014-08-07 Thread Vida Ha
The use case I was thinking of was outputting calculations made in Spark into a SQL database for the presentation layer to access. So in other words, having a Spark backend in Java that writes to a SQL database and then having a Rails front-end that can display the data nicely. On Thu, Aug 7, 20

Save an RDD to a SQL Database

2014-08-05 Thread Vida Ha
Hi, I would like to save an RDD to a SQL database. It seems like this would be a common enough use case. Are there any built in libraries to do it? Otherwise, I'm just planning on mapping my RDD, and having that call a method to write to the database. Given that a lot of records are going to