partions, SQL tables, and Parquet I/O

2015-02-26 Thread Daniel, Ronald (ELS-SDG)
Short story: I want to write some parquet files so they are pre-partitioned by the same key. Then, when I read them back in, joining the two tables on that key should be about as fast as things can be done. Can I do that, and if so, how? I don't see how to control the partitioning of a SQL table

sparse vector operations in Python

2015-03-09 Thread Daniel, Ronald (ELS-SDG)
Hi, Sorry to ask this, but how do I compute the sum of 2 (or more) mllib SparseVectors in Python? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apac

RE: How to do spares vector product in Spark?

2015-03-13 Thread Daniel, Ronald (ELS-SDG)
> Any convenient tool to do this [sparse vector product] in Spark? Unfortunately, it seems that there are very few operations defined for sparse vectors. I needed to add some, and ended up converting them to (dense) numpy vectors and doing the addition on those. Best regards, Ron From: Xi She

RE: topic modeling using LDA in MLLib

2015-03-18 Thread Daniel, Ronald (ELS-SDG)
Wordcount is a very common example so you can find that several places in Spark documentation and tutorials. Beware! They typically tokenize the text by splitting on whitespace. That will leave you with tokens that have trailing commas, periods, and other things. Also, you probably want to lowe

Hash Partitioning and Dataframes

2015-05-08 Thread Daniel, Ronald (ELS-SDG)
Hi, How can I ensure that a batch of DataFrames I make are all partitioned based on the value of one column common to them all? For RDDs I would partitionBy a HashPartitioner, but I don't see that in the DataFrame API. If I partition the RDDs that way, then do a toDF(), will the partitioning be

RE: Hash Partitioning and Dataframes

2015-05-08 Thread Daniel, Ronald (ELS-SDG)
, May 08, 2015 3:15 PM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: Hash Partitioning and Dataframes What are you trying to accomplish? Internally Spark SQL will add Exchange operators to make sure that data is partitioned correctly for joins and aggregations. If you are

RE: creating a distributed index

2014-08-04 Thread Daniel, Ronald (ELS-SDG)
At the Spark Summit, the Typesafe people had a toy implementation of a full-text index that you could use as a starting point. The bare code is available in github at https://github.com/deanwampler/spark-workshop/blob/eb077a734aad166235de85494def8fe3d4d2ca66/src/main/scala/spark/InvertedIndex5b.

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Just to point out that the benchmark you point to has Redshift running on HDD machines instead of SSD, and it is still faster than Shark in all but one case. Like Gary, I'm also interested in replacing something we have on Redshift with Spark SQL, as it will give me much greater capability to pr

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Ron From: Gary Malouf [mailto:malouf.g...@gmail.com] Sent: Wednesday, August 06, 2014 1:17 PM To: Nicholas Chammas Cc: Daniel, Ronald (ELS-SDG); user@spark.apache.org Subject: Re: Regarding tooling/performance vs RedShift Also, regarding something like redshift not having MLlib built in, mu

Column width limits?

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are Strings holding the contents of those (UTF-8) files, but NOT split into lines. Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB? Similarly, can such a thing be registered as a table so that I can us

Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile("SomeArticle.txt") Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as show

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
ement-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) mailto:r.dan...@elsevier.com>> wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile("SomeArticle.txt") Also assume that the sentence

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks Xiangrui, that looks very helpful. Best regards, Ron > -Original Message- > From: Xiangrui Meng [mailto:men...@gmail.com] > Sent: Wednesday, September 03, 2014 1:19 PM > To: Daniel, Ronald (ELS-SDG) > Cc: Victor Tso-Guillen; user@spark.apache.org > Subj

filtering a SchemaRDD

2014-11-14 Thread Daniel, Ronald (ELS-SDG)
Hi all, I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one of which holds the text for a sentence from a document. # Load sentence data table sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing') sentenceRDD.take(3) Out[20]: [Row(annotID=118, annotSet=u'

RE: filtering a SchemaRDD

2014-11-16 Thread Daniel, Ronald (ELS-SDG)
Indeed it did. Thanks! Ron From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, November 14, 2014 9:53 PM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: filtering a SchemaRDD If I use row[6] instead of row["text"] I get what I am looking fo

Using TF-IDF from MLlib

2014-11-20 Thread Daniel, Ronald (ELS-SDG)
Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation? val sentsTmp = sqlContext.sql("SELECT text FROM sentenceTable")

RE: Using TF-IDF from MLlib

2014-11-21 Thread Daniel, Ronald (ELS-SDG)
Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD