Re: [PR] [WIP][DOCS] Clarify DataFrames in quickstart [spark]

via GitHub Tue, 24 Feb 2026 10:32:43 -0800


celestehorgan commented on code in PR #54428:
URL: https://github.com/apache/spark/pull/54428#discussion_r2848881324



##########
docs/quick-start.md:
##########
@@ -51,13 +51,19 @@ Or if PySpark is installed with pip in your current 
environment:
 
     pyspark
 
-Spark's primary abstraction is a distributed collection of items called a 
Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) 
or by transforming other Datasets. Due to Python's dynamic nature, we don't 
need the Dataset to be strongly-typed in Python. As a result, all Datasets in 
Python are Dataset[Row], and we call it `DataFrame` to be consistent with the 
data frame concept in Pandas and R. Let's make a new DataFrame from the text of 
the README file in the Spark source directory:
+Spark's primary abstracted is called a **Dataset**. A Dataset is a structured 
set of information. You can create datasets from Hadoop InputFormats (such as 
HDFS
+files) or by transforming other Datasets. 
+
+Datasets behave differently in some languages. Because Python allows for 
dynamic typing, Datasets in Python are all `Dataset[Row]` on an implementation 
level.
+This leads to another key Spark concept: a `DataFrame`, or a Dataset with 
named columns. If you're familiar with DataFrames from pandas or R, you'll be 
familiar with how DataFrames work in Spark. In other languages, like Java, the 
difference between a Dataset and DataFrame is larger, but for now let's proceed 
with Python. 
+
+Let's make a new DataFrame using the `README.md` file in the Spark souce 
directory via the command line:
 
 {% highlight python %}
 >>> textFile = spark.read.text("README.md")
 {% endhighlight %}
 
-You can get values from DataFrame directly, by calling some actions, or 
transform the DataFrame to get a new one. For more details, please read the 
_[API doc](api/python/index.html#pyspark.sql.DataFrame)_.
+Once you've created the DataFrame, you can perform actions against it. For 
more details see the [API doc](api/python/index.html#pyspark.sql.DataFrame).

Review Comment:
   Is this not addressed by the next section in L#76? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][DOCS] Clarify DataFrames in quickstart [spark]

Reply via email to