celestehorgan commented on code in PR #54428:
URL: https://github.com/apache/spark/pull/54428#discussion_r2848881324
##########
docs/quick-start.md:
##########
@@ -51,13 +51,19 @@ Or if PySpark is installed with pip in your current
environment:
pyspark
-Spark's primary abstraction is a distributed collection of items called a
Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files)
or by transforming other Datasets. Due to Python's dynamic nature, we don't
need the Dataset to be strongly-typed in Python. As a result, all Datasets in
Python are Dataset[Row], and we call it `DataFrame` to be consistent with the
data frame concept in Pandas and R. Let's make a new DataFrame from the text of
the README file in the Spark source directory:
+Spark's primary abstracted is called a **Dataset**. A Dataset is a structured
set of information. You can create datasets from Hadoop InputFormats (such as
HDFS
+files) or by transforming other Datasets.
+
+Datasets behave differently in some languages. Because Python allows for
dynamic typing, Datasets in Python are all `Dataset[Row]` on an implementation
level.
+This leads to another key Spark concept: a `DataFrame`, or a Dataset with
named columns. If you're familiar with DataFrames from pandas or R, you'll be
familiar with how DataFrames work in Spark. In other languages, like Java, the
difference between a Dataset and DataFrame is larger, but for now let's proceed
with Python.
+
+Let's make a new DataFrame using the `README.md` file in the Spark souce
directory via the command line:
{% highlight python %}
>>> textFile = spark.read.text("README.md")
{% endhighlight %}
-You can get values from DataFrame directly, by calling some actions, or
transform the DataFrame to get a new one. For more details, please read the
_[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
+Once you've created the DataFrame, you can perform actions against it. For
more details see the [API doc](api/python/index.html#pyspark.sql.DataFrame).
Review Comment:
Is this not addressed by the next section in L#76?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]