Please vote on releasing the following candidate as Apache Spark version
1.6.2!
The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
majority of at least 3+1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.2
[ ] -1 Do not release this package because ...
Hi,
The way log instance inside Logger trait is current being initialized
doesn't seem to be thread safe [1]. Current implementation only guarantees
initializeLogIfNecessary() is initialized in lazy + thread safe way.
Is there a reason why it can't be just: [2]
@transient private lazy val log_ :
Hey there:
The frequent item in dataframe stat package seems not accurate. In the
documentation,it did mention that it has false positive but still seems
incorrect.
Wondering if this is all known problem or not?
Here is a quick example showing the problem.
val sqlContext = new SQLContext(sc)
i
Its not throwing away any information from the point of view of the SQL
optimizer. The schema preserves all the type information that the catalyst
uses. The type information T in Dataset[T] is only used at the API level to
ensure compilation-time type checks of the user program.
On Thu, Jun 16, 20
I'm clear on what a type alias is. My question is more that moving
from e.g. Dataset[T] to Dataset[Row] involves throwing away
information. Reading through code that uses the Dataframe alias, it's
a little hard for me to know when that's intentional or not.
On Thu, Jun 16, 2016 at 2:50 PM, Tath
There are different ways to view this. If its confusing to think that
Source API returning DataFrames, its equivalent to thinking that you are
returning a Dataset[Row], and DataFrame is just a shorthand.
And DataFrame/Datasetp[Row] is to Dataset[String] is what java
Array[Object] is to Array[String
Yea WRT Options maybe I'm thinking about it incorrectly or misrepresenting
it as relating to Encoders or to pure Option encoder. The semantics I'm
thinking of are around the deserialization of a type T and lifting it into
Option[T] via the Option.apply function which converts null to None. Tying
ba
Is this really an internal / external distinction?
For a concrete example, Source.getBatch seems to be a public
interface, but returns DataFrame.
On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
wrote:
> DataFrame is a type alias of Dataset[Row], so externally it seems like
> Dataset is the main t
There is no public API for writing encoders at the moment, though we are
hoping to open this up in Spark 2.1.
What is not working about encoders for options? Which version of Spark are
you running? This is working as I would expect?
https://databricks-prod-cloudfront.cloud.databricks.com/public
Hi,
I have a TPCDS query that fails in the stage 80 which is a ResultStage
(SparkSQL).
Ideally I would like to ‘checkpoint’ a previous stage which was executed
successfully and replay the failed stage for debug purposes.
Anyone managed to do something similar that could point some hints?
Maybe s
DataFrame is a type alias of Dataset[Row], so externally it seems like
Dataset is the main type and DataFrame is a derivative type.
However, internally, since everything is processed as Rows, everything uses
DataFrames, Type classes used in a Dataset is internally converted to rows
for processing.
Sorry, meant DataFrame vs Dataset
On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger wrote:
> Is there a principled reason why sql.streaming.* and
> sql.execution.streaming.* are making extensive use of DataFrame
> instead of Datasource?
>
> Or is that just a holdover from code written before the m
Is there a principled reason why sql.streaming.* and
sql.execution.streaming.* are making extensive use of DataFrame
instead of Datasource?
Or is that just a holdover from code written before the move / type alias?
-
To unsubscri
You should be fine in 1.6 onward. Count distinct doesn't require data to
fit in memory there.
On Thu, Jun 16, 2016 at 1:57 AM, Avshalom wrote:
> Hi all,
>
> We would like to perform a count distinct query based on a certain filter.
> e.g. our data is of the form:
>
> userId, Name, Restaurant na
Dear All,
Looking for guidance.
I am Interested in contributing to the Spark MLlib. Could you please take a few
minutes to guide me as to what you would consider an ideal path / skill an
individual should posses.
I know R / Python / Java / C and C++
I have a firm understanding of algorithms
Are there any user or dev guides for writing Encoders? I'm trying to read
through the source code to figure out how to write a proper Option[T]
encoder, but it's not straightforward without deep spark-sql source
knowledge. Is it unexpected for users to need to write their own Encoders
with the avai
+1
Tom
On Wednesday, June 15, 2016 2:01 PM, Reynold Xin
wrote:
It's been a while and we have accumulated quite a few bug fixes in branch-1.6.
I'm thinking about cutting 1.6.2 rc this week. Any patches somebody want to get
in last minute?
On a related note, I'm thinking about cutting 2
That's be awesome to have another 2.0 RC! I know many people who'd
consider it as a call to action to play with 2.0.
+1000
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskows
Hi all,
We would like to perform a count distinct query based on a certain filter.
e.g. our data is of the form:
userId, Name, Restaurant name, Restaurant Type
===
100,John, Pizza Hut,Pizza
100,John, Del Pepe, Pasta
100,John,
19 matches
Mail list logo