Hi all,
We would like to perform a count distinct query based on a certain filter.
e.g. our data is of the form:
userId, Name, Restaurant name, Restaurant Type
===
100,John, Pizza Hut,Pizza
100,John, Del Pepe, Pasta
100,John,
That's be awesome to have another 2.0 RC! I know many people who'd
consider it as a call to action to play with 2.0.
+1000
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskows
+1
Tom
On Wednesday, June 15, 2016 2:01 PM, Reynold Xin
wrote:
It's been a while and we have accumulated quite a few bug fixes in branch-1.6.
I'm thinking about cutting 1.6.2 rc this week. Any patches somebody want to get
in last minute?
On a related note, I'm thinking about cutting 2
Are there any user or dev guides for writing Encoders? I'm trying to read
through the source code to figure out how to write a proper Option[T]
encoder, but it's not straightforward without deep spark-sql source
knowledge. Is it unexpected for users to need to write their own Encoders
with the avai
Dear All,
Looking for guidance.
I am Interested in contributing to the Spark MLlib. Could you please take a few
minutes to guide me as to what you would consider an ideal path / skill an
individual should posses.
I know R / Python / Java / C and C++
I have a firm understanding of algorithms
You should be fine in 1.6 onward. Count distinct doesn't require data to
fit in memory there.
On Thu, Jun 16, 2016 at 1:57 AM, Avshalom wrote:
> Hi all,
>
> We would like to perform a count distinct query based on a certain filter.
> e.g. our data is of the form:
>
> userId, Name, Restaurant na
Is there a principled reason why sql.streaming.* and
sql.execution.streaming.* are making extensive use of DataFrame
instead of Datasource?
Or is that just a holdover from code written before the move / type alias?
-
To unsubscri
Sorry, meant DataFrame vs Dataset
On Thu, Jun 16, 2016 at 12:53 PM, Cody Koeninger wrote:
> Is there a principled reason why sql.streaming.* and
> sql.execution.streaming.* are making extensive use of DataFrame
> instead of Datasource?
>
> Or is that just a holdover from code written before the m
DataFrame is a type alias of Dataset[Row], so externally it seems like
Dataset is the main type and DataFrame is a derivative type.
However, internally, since everything is processed as Rows, everything uses
DataFrames, Type classes used in a Dataset is internally converted to rows
for processing.
Hi,
I have a TPCDS query that fails in the stage 80 which is a ResultStage
(SparkSQL).
Ideally I would like to ‘checkpoint’ a previous stage which was executed
successfully and replay the failed stage for debug purposes.
Anyone managed to do something similar that could point some hints?
Maybe s
There is no public API for writing encoders at the moment, though we are
hoping to open this up in Spark 2.1.
What is not working about encoders for options? Which version of Spark are
you running? This is working as I would expect?
https://databricks-prod-cloudfront.cloud.databricks.com/public
Is this really an internal / external distinction?
For a concrete example, Source.getBatch seems to be a public
interface, but returns DataFrame.
On Thu, Jun 16, 2016 at 1:42 PM, Tathagata Das
wrote:
> DataFrame is a type alias of Dataset[Row], so externally it seems like
> Dataset is the main t
Yea WRT Options maybe I'm thinking about it incorrectly or misrepresenting
it as relating to Encoders or to pure Option encoder. The semantics I'm
thinking of are around the deserialization of a type T and lifting it into
Option[T] via the Option.apply function which converts null to None. Tying
ba
There are different ways to view this. If its confusing to think that
Source API returning DataFrames, its equivalent to thinking that you are
returning a Dataset[Row], and DataFrame is just a shorthand.
And DataFrame/Datasetp[Row] is to Dataset[String] is what java
Array[Object] is to Array[String
I'm clear on what a type alias is. My question is more that moving
from e.g. Dataset[T] to Dataset[Row] involves throwing away
information. Reading through code that uses the Dataframe alias, it's
a little hard for me to know when that's intentional or not.
On Thu, Jun 16, 2016 at 2:50 PM, Tath
Its not throwing away any information from the point of view of the SQL
optimizer. The schema preserves all the type information that the catalyst
uses. The type information T in Dataset[T] is only used at the API level to
ensure compilation-time type checks of the user program.
On Thu, Jun 16, 20
Hey there:
The frequent item in dataframe stat package seems not accurate. In the
documentation,it did mention that it has false positive but still seems
incorrect.
Wondering if this is all known problem or not?
Here is a quick example showing the problem.
val sqlContext = new SQLContext(sc)
i
Hi,
The way log instance inside Logger trait is current being initialized
doesn't seem to be thread safe [1]. Current implementation only guarantees
initializeLogIfNecessary() is initialized in lazy + thread safe way.
Is there a reason why it can't be just: [2]
@transient private lazy val log_ :
Please vote on releasing the following candidate as Apache Spark version
1.6.2!
The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a
majority of at least 3+1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.6.2
[ ] -1 Do not release this package because ...
19 matches
Mail list logo