Seemingly wasteful memory duplication in LDAModel getTopicDistributionMethod()

2019-02-22 Thread Andrew Mathis
In my usage of MLLib's LDA, I have noticed that repeated invocations of LDAModel.transform() result in the duplication of a matrix derived from the model's topic matrix. Because this derived matrix can be quite large (imagine hundreds of topics, and vocabulary size in the tens or hundreds of thousa

How can I parse an "unnamed" json array present in a column?

2019-02-22 Thread Yeikel
I have an "unnamed" json array stored in a *column*. The format is the following : column name : news Data : [ { "source": "source1", "name": "News site1" }, { "source": "source2", "name": "News site2" } ] Ideally , I'd like to parse it as : news ARRAY> I've tr

Detect data from textFile RDD

2019-02-22 Thread swastik mittal
Hey, I am working with spark source code. I am printing logs within the code to understand how hadoopRDD works. I wan't to print a timestamp when executor first reads the textFile RDD (input source(URL) in form of hdfs). I tried to print some logs in executor.scala but they do not display on the ru

Re: Standardized Join Types for DataFrames

2019-02-22 Thread Jules Damji
Also, Holden Karau conducts PR requests reviews and shows how you can contribute to this communal project. Attend one of her live PR sessions. Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On Feb 22, 2019, at 7:16 AM, Pooja Agrawal wrote: > > Hi, > > I am new to spark

Standardized Join Types for DataFrames

2019-02-22 Thread Pooja Agrawal
Hi, I am new to spark and want to start contributing to Apache spark to know more about it. I found this JIRA to have "Standardized Join Types for DataFrames", which I feel could be a good starter task for me. I wanted to confirm if this is a relevant/actionable task and if I can start working on

Occasional broadcast timeout when dynamic allocation is on

2019-02-22 Thread Artem P
Hi! We have dynamic allocation enabled for our regular jobs and sometimes they fail with java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]. Seems like spark driver starts broadcast just before the job has received any executors from the YARN and if it takes more than