Short story: I want to write some parquet files so they are pre-partitioned by
the same key. Then, when I read them
back in, joining the two tables on that key should be about as fast as things
can be done.
Can I do that, and if so, how? I don't see how to control the partitioning of a
SQL table
Hi,
Sorry to ask this, but how do I compute the sum of 2 (or more) mllib
SparseVectors in Python?
Thanks,
Ron
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apac
> Any convenient tool to do this [sparse vector product] in Spark?
Unfortunately, it seems that there are very few operations defined for sparse
vectors. I needed to add some, and ended up converting them to (dense) numpy
vectors and doing the addition on those.
Best regards,
Ron
From: Xi She
Wordcount is a very common example so you can find that several places in Spark
documentation and tutorials. Beware! They typically tokenize the text by
splitting on whitespace. That will leave you with tokens that have trailing
commas, periods, and other things.
Also, you probably want to lowe
Hi,
How can I ensure that a batch of DataFrames I make are all partitioned based on
the value of one column common to them all?
For RDDs I would partitionBy a HashPartitioner, but I don't see that in the
DataFrame API.
If I partition the RDDs that way, then do a toDF(), will the partitioning be
, May 08, 2015 3:15 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Hash Partitioning and Dataframes
What are you trying to accomplish? Internally Spark SQL will add Exchange
operators to make sure that data is partitioned correctly for joins and
aggregations. If you are
At the Spark Summit, the Typesafe people had a toy implementation of a
full-text index that you could use as a starting point. The bare code is
available in github at
https://github.com/deanwampler/spark-workshop/blob/eb077a734aad166235de85494def8fe3d4d2ca66/src/main/scala/spark/InvertedIndex5b.
Just to point out that the benchmark you point to has Redshift running on HDD
machines instead of SSD, and it is still faster than Shark in all but one case.
Like Gary, I'm also interested in replacing something we have on Redshift with
Spark SQL, as it will give me much greater capability to pr
Ron
From: Gary Malouf [mailto:malouf.g...@gmail.com]
Sent: Wednesday, August 06, 2014 1:17 PM
To: Nicholas Chammas
Cc: Daniel, Ronald (ELS-SDG); user@spark.apache.org
Subject: Re: Regarding tooling/performance vs RedShift
Also, regarding something like redshift not having MLlib built in, mu
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are
Strings holding the contents of those (UTF-8) files, but NOT split into lines.
Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB?
Similarly, can such a thing be registered as a table so that I can us
Hi all,
Assume I have read the lines of a text file into an RDD:
textFile = sc.textFile("SomeArticle.txt")
Also assume that the sentence breaks in SomeArticle.txt were done by machine
and have some errors, such as the break at Fig. in the sample text below.
Index Text
N...as show
ement-in-a-sorted-RDD-td12621.html#a12664
On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
mailto:r.dan...@elsevier.com>> wrote:
Hi all,
Assume I have read the lines of a text file into an RDD:
textFile = sc.textFile("SomeArticle.txt")
Also assume that the sentence
Thanks Xiangrui, that looks very helpful.
Best regards,
Ron
> -Original Message-
> From: Xiangrui Meng [mailto:men...@gmail.com]
> Sent: Wednesday, September 03, 2014 1:19 PM
> To: Daniel, Ronald (ELS-SDG)
> Cc: Victor Tso-Guillen; user@spark.apache.org
> Subj
Hi all,
I have a SchemaRDD that Is loaded from a file. Each Row contains 7 fields, one
of which holds the text for a sentence from a document.
# Load sentence data table
sentenceRDD = sqlContext.parquetFile('s3n://some/path/thing')
sentenceRDD.take(3)
Out[20]: [Row(annotID=118, annotSet=u'
Indeed it did. Thanks!
Ron
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, November 14, 2014 9:53 PM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: filtering a SchemaRDD
If I use row[6] instead of row["text"] I get what I am looking fo
Hi all,
I want to try the TF-IDF functionality in MLlib.
I can feed it words and generate the tf and idf RDD[Vector]s, using the code
below.
But how do I get this back to words and their counts and tf-idf values for
presentation?
val sentsTmp = sqlContext.sql("SELECT text FROM sentenceTable")
Thanks for the info Andy. A big help.
One thing - I think you can figure out which document is responsible for which
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that
into one RDD for each column.
The values in the doc_string RDD
17 matches
Mail list logo