This is an automated email from the ASF dual-hosted git repository. jiayu pushed a commit to branch docs-mar-07 in repository https://gitbox.apache.org/repos/asf/sedona.git
commit 2b78648254c05925c6ac57a5a8f6aebe099131fa Author: Jia Yu <[email protected]> AuthorDate: Mon Mar 10 00:26:24 2025 -0700 Fix typos --- docs/api/sql/Function.md | 2 +- docs/api/sql/Raster-map-algebra.md | 4 +- docs/api/sql/Spider.md | 2 +- docs/setup/release-notes.md | 2 +- docs/tutorial/files/geoparquet-sedona-spark.md | 10 +- docs/tutorial/files/shapefiles-sedona-spark.md | 8 +- docs/tutorial/python-vector-osm.md | 159 ------------------------- docs/tutorial/raster.md | 83 +------------ docs/tutorial/rdd.md | 4 +- docs/tutorial/snowflake/sql.md | 2 +- docs/tutorial/sql.md | 4 +- 11 files changed, 20 insertions(+), 260 deletions(-) diff --git a/docs/api/sql/Function.md b/docs/api/sql/Function.md index 51f63f3a13..2dc28f21ce 100644 --- a/docs/api/sql/Function.md +++ b/docs/api/sql/Function.md @@ -460,7 +460,7 @@ POINT ZM(1 1 1 1) ## ST_AsGeoJSON !!!note - This method is not recommended. Please use [Sedona GeoJSON data source](../../tutorial/sql.md#save-as-geojson) to write GeoJSON files. + This method is not recommended. Please use [Sedona GeoJSON data source](../../tutorial/sql.md#save-geojson) to write GeoJSON files. Introduction: Return the [GeoJSON](https://geojson.org/) string representation of a geometry diff --git a/docs/api/sql/Raster-map-algebra.md b/docs/api/sql/Raster-map-algebra.md index 5b22280ee6..70b7e52eef 100644 --- a/docs/api/sql/Raster-map-algebra.md +++ b/docs/api/sql/Raster-map-algebra.md @@ -34,7 +34,7 @@ RS_MapAlgebra(rast: Raster, pixelType: String, script: String, [noDataValue: Dou * `rast`: The raster to apply the map algebra expression to. * `pixelType`: The data type of the output raster. This can be one of `D` (double), `F` (float), `I` (integer), `S` (short), `US` (unsigned short) or `B` (byte). If specified `NULL`, the output raster will have the same data type as the input raster. -* `script`: The map algebra script. [Refer here for more details on the format.](#:~:text=The Jiffle script is,current output pixel value) +* `script`: The map algebra script. [Refer here for more details on the format.](https://github.com/geosolutions-it/jai-ext/wiki/Jiffle) * `noDataValue`: (Optional) The nodata value of the output raster. As of version `v1.5.1`, the `RS_MapAlgebra` function allows two raster column inputs, with multi-band rasters supported. The function accepts 5 parameters: @@ -46,7 +46,7 @@ RS_MapAlgebra(rast0: Raster, rast1: Raster, pixelType: String, script: String, n * `rast0`: The first raster to apply the map algebra expression to. * `rast1`: The second raster to apply the map algebra expression to. * `pixelType`: The data type of the output raster. This can be one of `D` (double), `F` (float), `I` (integer), `S` (short), `US` (unsigned short) or `B` (byte). If specified `NULL`, the output raster will have the same data type as the input raster. -* `script`: The map algebra script. [Refer here for more details on the format.](#:~:text=The Jiffle script is,current output pixel value) +* `script`: The map algebra script. [Refer here for more details on the format.](https://github.com/geosolutions-it/jai-ext/wiki/Jiffle) * `noDataValue`: (Not optional) The nodata value of the output raster, `null` is allowed. Spark SQL Example for two raster input `RS_MapAlgebra`: diff --git a/docs/api/sql/Spider.md b/docs/api/sql/Spider.md index 5dfa6569fe..207259d10e 100644 --- a/docs/api/sql/Spider.md +++ b/docs/api/sql/Spider.md @@ -21,7 +21,7 @@ Sedona offers a spatial data generator called Spider. It is a data source that g ## Quick Start -Once you have your [`SedonaContext` object created](../Overview#quick-start), you can create a DataFrame with the `spider` data source. +Once you have your [`SedonaContext` object created](Overview.md#quick-start), you can create a DataFrame with the `spider` data source. ```python df_random_points = sedona.read.format("spider").load(n=1000, distribution="uniform") diff --git a/docs/setup/release-notes.md b/docs/setup/release-notes.md index 7a5666846a..36c620beac 100644 --- a/docs/setup/release-notes.md +++ b/docs/setup/release-notes.md @@ -1164,7 +1164,7 @@ Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12, Java 8. * [X] **Sedona Spark & Flink** Serialize and deserialize geometries 3 - 7X faster * [X] **Sedona Spark & Flink** Google S2 based spatial join for fast approximate point-in-polygon join. See [Join query in Spark](../api/sql/Optimizer.md#google-s2-based-approximate-equi-join) and [Join query in Flink](../tutorial/flink/sql.md#join-query) -* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce memory consumption by 10X: see [explanation](../api/sql/Optimizer.md#Push-spatial-predicates-to-GeoParquet) +* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce memory consumption by 10X: see [explanation](../api/sql/Optimizer.md#push-spatial-predicates-to-geoparquet) * [X] **Sedona Spark** Automatically use broadcast index spatial join for small datasets * [X] **Sedona Spark** New RasterUDT added to Sedona GeoTiff reader. * [X] **Sedona Spark** A number of bug fixes and improvement to the Sedona R module. diff --git a/docs/tutorial/files/geoparquet-sedona-spark.md b/docs/tutorial/files/geoparquet-sedona-spark.md index 28da219474..11d95d9c6d 100644 --- a/docs/tutorial/files/geoparquet-sedona-spark.md +++ b/docs/tutorial/files/geoparquet-sedona-spark.md @@ -76,7 +76,7 @@ df.show(truncate=False) Here are the results: ``` -+---+---------------------+ ++---+---------------------+ |id |geometry | +---+---------------------+ |a |LINESTRING (2 5, 6 1)| @@ -199,10 +199,10 @@ The value of `geoparquet.crs` and `geoparquet.crs.<column_name>` can be one of t * `""` (empty string): Omit the `crs` field. This implies that the CRS is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84) for CRS-aware implementations. * `"{...}"` (PROJJSON string): The `crs` field will be set as the PROJJSON object representing the Coordinate Reference System (CRS) of the geometry. You can find the PROJJSON string of a specific CRS from here: https://epsg.io/ (click the JSON option at the bottom of the page). You can also customize your PROJJSON string as needed. -Please note that Sedona currently cannot set/get a projjson string to/from a CRS. Its geoparquet reader will ignore the projjson metadata and you will have to set your CRS via [`ST_SetSRID`](../api/sql/Function.md#st_setsrid) after reading the file. +Please note that Sedona currently cannot set/get a projjson string to/from a CRS. Its geoparquet reader will ignore the projjson metadata and you will have to set your CRS via [`ST_SetSRID`](../../api/sql/Function.md#st_setsrid) after reading the file. Its geoparquet writer will not leverage the SRID field of a geometry so you will have to always set the `geoparquet.crs` option manually when writing the file, if you want to write a meaningful CRS field. -Due to the same reason, Sedona geoparquet reader and writer do NOT check the axis order (lon/lat or lat/lon) and assume they are handled by the users themselves when writing / reading the files. You can always use [`ST_FlipCoordinates`](../api/sql/Function.md#st_flipcoordinates) to swap the axis order of your geometries. +Due to the same reason, Sedona geoparquet reader and writer do NOT check the axis order (lon/lat or lat/lon) and assume they are handled by the users themselves when writing / reading the files. You can always use [`ST_FlipCoordinates`](../../api/sql/Function.md#st_flipcoordinates) to swap the axis order of your geometries. ## Save GeoParquet with Covering Metadata @@ -231,7 +231,7 @@ df_bbox.write.format("geoparquet").option("geoparquet.covering.geometry", "bbox" ## Sort then Save GeoParquet -To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see [ST_GeoHash](../api/sql/Function.md#st_geohash)) and then save as a GeoParquet file. An example is as follows: +To maximize the performance of Sedona GeoParquet filter pushdown, we suggest that you sort the data by their geohash values (see [ST_GeoHash](../../api/sql/Function.md#st_geohash)) and then save as a GeoParquet file. An example is as follows: ``` SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash @@ -253,7 +253,7 @@ Let’s look at an example of a dataset with points and three bounding boxes. Now, let’s apply a spatial filter to read points within a particular area: - + Here is the query: diff --git a/docs/tutorial/files/shapefiles-sedona-spark.md b/docs/tutorial/files/shapefiles-sedona-spark.md index 3b24349b68..a7df23c521 100644 --- a/docs/tutorial/files/shapefiles-sedona-spark.md +++ b/docs/tutorial/files/shapefiles-sedona-spark.md @@ -196,11 +196,11 @@ Due to these limitations, other options are worth investigating. There are a variety of other file formats that are good for geometric data: * Iceberg -* [GeoParquet](../geoparquet-sedona-spark) +* [GeoParquet](geoparquet-sedona-spark.md) * FlatGeoBuf -* [GeoPackage](../geopackage-sedona-spark) -* [GeoJSON](../geojson-sedona-spark) -* [CSV](../csv-geometry-sedona-spark) +* [GeoPackage](geopackage-sedona-spark.md) +* [GeoJSON](geojson-sedona-spark.md) +* [CSV](csv-geometry-sedona-spark.md) * GeoTIFF ## Why Sedona does not support Shapefile writes diff --git a/docs/tutorial/python-vector-osm.md b/docs/tutorial/python-vector-osm.md deleted file mode 100644 index 00f19f4322..0000000000 --- a/docs/tutorial/python-vector-osm.md +++ /dev/null @@ -1,159 +0,0 @@ -<!-- - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. - --> - -# Example of spark + sedona + hdfs with slave nodes and OSM vector data consults - -``` -from IPython.display import display, HTML -from pyspark.sql import SparkSession -from pyspark import StorageLevel -import pandas as pd -from pyspark.sql.types import StructType, StructField,StringType, LongType, IntegerType, DoubleType, ArrayType -from pyspark.sql.functions import regexp_replace -from sedona.register import SedonaRegistrator -from sedona.utils import SedonaKryoRegistrator, KryoSerializer -from pyspark.sql.functions import col, split, expr -from pyspark.sql.functions import udf, lit -from sedona.utils import SedonaKryoRegistrator, KryoSerializer -from pyspark.sql.functions import col, split, expr -from pyspark.sql.functions import udf, lit, flatten -from pywebhdfs.webhdfs import PyWebHdfsClient -from datetime import date -from pyspark.sql.functions import monotonically_increasing_id -import json -``` - -## Registering spark session, adding node executor configurations and sedona registrator - -``` -spark = SparkSession.\ - builder.\ - appName("Overpass-API").\ - enableHiveSupport().\ - master("local[*]").\ - master("spark://spark-master:7077").\ - config("spark.executor.memory", "15G").\ - config("spark.driver.maxResultSize", "135G").\ - config("spark.sql.shuffle.partitions", "500").\ - config(' spark.sql.adaptive.coalescePartitions.enabled', True).\ - config('spark.sql.adaptive.enabled', True).\ - config('spark.sql.adaptive.coalescePartitions.initialPartitionNum', 125).\ - config("spark.sql.execution.arrow.pyspark.enabled", True).\ - config("spark.sql.execution.arrow.fallback.enabled", True).\ - config('spark.kryoserializer.buffer.max', 2047).\ - config("spark.serializer", KryoSerializer.getName).\ - config("spark.kryo.registrator", SedonaKryoRegistrator.getName).\ - config("spark.jars.packages", "org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2") .\ - enableHiveSupport().\ - getOrCreate() - -SedonaRegistrator.registerAll(spark) -sc = spark.sparkContext -``` - -## Connecting to Overpass API to search and downloading data for saving into HDFS - -``` -import requests -import json - -overpass_url = "http://overpass-api.de/api/interpreter" -overpass_query = """ -[out:json]; -area[name = "Foz do Iguaçu"]; -way(area)["highway"~""]; -out geom; ->; -out skel qt; -""" - -response = requests.get(overpass_url, - params={'data': overpass_query}) -data = response.json() -hdfs = PyWebHdfsClient(host='179.106.229.159',port='50070', user_name='root') -file_name = "foz_roads_osm.json" -hdfs.delete_file_dir(file_name) -hdfs.create_file(file_name, json.dumps(data)) - -``` - -## Connecting spark sedona with saved hdfs file - -``` -path = "hdfs://776faf4d6a1e:8020/"+file_name -df = spark.read.json(path, multiLine = "true") -``` - -## Consulting and organizing data for analysis - -``` -from pyspark.sql.functions import explode, arrays_zip - -df.createOrReplaceTempView("df") -tb = spark.sql("select *, size(elements) total_nodes from df") -tb.show(5) - -isolate_total_nodes = tb.select("total_nodes").toPandas() -total_nodes = isolate_total_nodes["total_nodes"].iloc[0] -print(total_nodes) - -isolate_ids = tb.select("elements.id").toPandas() -ids = pd.DataFrame(isolate_ids["id"].iloc[0]).drop_duplicates() -print(ids[0].iloc[1]) - -formatted_df = tb\ -.withColumn("id", explode("elements.id")) - -formatted_df.show(5) - -formatted_df = tb\ -.withColumn("new", arrays_zip("elements.id", "elements.geometry", "elements.nodes", "elements.tags"))\ -.withColumn("new", explode("new")) - -formatted_df.show(5) - -# formatted_df.printSchema() - -formatted_df = formatted_df.select("new.0","new.1","new.2","new.3.maxspeed","new.3.incline","new.3.surface", "new.3.name", "total_nodes") -formatted_df = formatted_df.withColumnRenamed("0","id").withColumnRenamed("1","geom").withColumnRenamed("2","nodes").withColumnRenamed("3","tags") -formatted_df.createOrReplaceTempView("formatted_df") -formatted_df.show(5) -# TODO atualizar daqui para baixo para considerar a linha inteira na lógica -points_tb = spark.sql("select geom, id from formatted_df where geom IS NOT NULL") -points_tb = points_tb\ -.withColumn("new", arrays_zip("geom.lat", "geom.lon"))\ -.withColumn("new", explode("new")) - -points_tb = points_tb.select("new.0","new.1", "id") - -points_tb = points_tb.withColumnRenamed("0","lat").withColumnRenamed("1","lon") -points_tb.printSchema() - -points_tb.createOrReplaceTempView("points_tb") - -points_tb.show(5) - -coordinates_tb = spark.sql("select (select collect_list(CONCAT(p1.lat,',',p1.lon)) from points_tb p1 where p1.id = p2.id group by p1.id) as coordinates, p2.id, p2.maxspeed, p2.incline, p2.surface, p2.name, p2.nodes, p2.total_nodes from formatted_df p2") -coordinates_tb.createOrReplaceTempView("coordinates_tb") -coordinates_tb.show(5) - -roads_tb = spark.sql("SELECT ST_LineStringFromText(REPLACE(REPLACE(CAST(coordinates as string),'[',''),']',''), ',') as geom, id, maxspeed, incline, surface, name, nodes, total_nodes FROM coordinates_tb WHERE coordinates IS NOT NULL") -roads_tb.createOrReplaceTempView("roads_tb") -roads_tb.show(5) -``` diff --git a/docs/tutorial/raster.md b/docs/tutorial/raster.md index 641b441428..65541ced6c 100644 --- a/docs/tutorial/raster.md +++ b/docs/tutorial/raster.md @@ -21,7 +21,7 @@ Sedona uses 1-based indexing for all raster functions except [map algebra function](../api/sql/Raster-map-algebra.md), which uses 0-based indexing. !!!note - Since v`1.5.0`, Sedona assumes geographic coordinates to be in longitude/latitude order. If your data is lat/lon order, please use `ST_FlipCoordinates` to swap X and Y. + Sedona assumes geographic coordinates to be in longitude/latitude order. If your data is lat/lon order, please use `ST_FlipCoordinates` to swap X and Y. Starting from `v1.1.0`, Sedona SQL supports raster data sources and raster operators in DataFrame and SQL. Raster support is available in all Sedona language bindings including ==Scala, Java, Python, and R==. @@ -67,8 +67,6 @@ Detailed SedonaSQL APIs are available here: [SedonaSQL API](../api/sql/Overview. Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please skip this step and use `spark` directly. -==Sedona >= 1.4.1== - You can add additional Spark runtime config to the config builder. For example, `SedonaContext.builder().config("spark.sql.autoBroadcastJoinThreshold", "10485760")` === "Scala" @@ -114,65 +112,10 @@ You can add additional Spark runtime config to the config builder. For example, ``` Please replace the `3.3` in the package name of sedona-spark-shaded with the corresponding major.minor version of Spark, such as `sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`. -==Sedona < 1.4.1== - -The following method has been deprecated since Sedona 1.4.1. Please use the method above to create your Sedona config. - -=== "Scala" - - ```scala - var sparkSession = SparkSession.builder() - .master("local[*]") // Delete this if run in cluster mode - .appName("readTestScala") // Change this to a proper name - // Enable Sedona custom Kryo serializer - .config("spark.serializer", classOf[KryoSerializer].getName) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", classOf[SedonaKryoRegistrator].getName) - .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator - ``` - If you use SedonaViz together with SedonaSQL, please use the following two lines to enable Sedona Kryo serializer instead: - ```scala - .config("spark.serializer", classOf[KryoSerializer].getName) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName) // org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator - ``` - -=== "Java" - - ```java - SparkSession sparkSession = SparkSession.builder() - .master("local[*]") // Delete this if run in cluster mode - .appName("readTestScala") // Change this to a proper name - // Enable Sedona custom Kryo serializer - .config("spark.serializer", KryoSerializer.class.getName) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", SedonaKryoRegistrator.class.getName) - .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator - ``` - If you use SedonaViz together with SedonaSQL, please use the following two lines to enable Sedona Kryo serializer instead: - ```scala - .config("spark.serializer", KryoSerializer.class.getName) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", SedonaVizKryoRegistrator.class.getName) // org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator - ``` - -=== "Python" - - ```python - sparkSession = SparkSession. \ - builder. \ - appName('appName'). \ - config("spark.serializer", KryoSerializer.getName). \ - config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \ - config('spark.jars.packages', - 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{ sedona.current_version }},' - 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}'). \ - getOrCreate() - ``` - Please replace the `3.3` in the package name of sedona-spark-shaded with the corresponding major.minor version of Spark, such as `sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`. - ## Initiate SedonaContext Add the following line after creating the Sedona config. If you already have a SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, please call `SedonaContext.create(spark)` instead. -==Sedona >= 1.4.1== - === "Scala" ```scala @@ -197,30 +140,6 @@ Add the following line after creating the Sedona config. If you already have a S sedona = SedonaContext.create(config) ``` -==Sedona < 1.4.1== - -The following method has been deprecated since Sedona 1.4.1. Please use the method above to create your SedonaContext. - -=== "Scala" - - ```scala - SedonaSQLRegistrator.registerAll(sparkSession) - ``` - -=== "Java" - - ```java - SedonaSQLRegistrator.registerAll(sparkSession) - ``` - -=== "Python" - - ```python - from sedona.register import SedonaRegistrator - - SedonaRegistrator.registerAll(spark) - ``` - You can also register everything by passing `--conf spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to `spark-submit` or `spark-shell`. ## Load data from files diff --git a/docs/tutorial/rdd.md b/docs/tutorial/rdd.md index fd5a1fcfde..d3b27896f7 100644 --- a/docs/tutorial/rdd.md +++ b/docs/tutorial/rdd.md @@ -765,9 +765,9 @@ Distance join can only accept `COVERED_BY` and `INTERSECTS` as spatial predicate The details of spatial partitioning in join query is [here](#use-spatial-partitioning). -The details of using spatial indexes in join query is [here](#use-spatial-indexes-2). +The details of using spatial indexes in join query is [here](#use-spatial-indexes_2). -The output format of the distance join query is [here](#output-format-2). +The output format of the distance join query is [here](#output-format_2). !!!note Distance join query is equal to the following query in Spatial SQL: diff --git a/docs/tutorial/snowflake/sql.md b/docs/tutorial/snowflake/sql.md index 02ef7c7e01..ba42f23138 100644 --- a/docs/tutorial/snowflake/sql.md +++ b/docs/tutorial/snowflake/sql.md @@ -302,7 +302,7 @@ Please use the following steps: ### 1. Generate S2 ids for both tables -Use [ST_S2CellIds](../../api/snowflake/vector-data/Function.md#ST_S2CellIDs) to generate cell IDs. Each geometry may produce one or more IDs. +Use [ST_S2CellIds](../../api/snowflake/vector-data/Function.md#st_s2cellids) to generate cell IDs. Each geometry may produce one or more IDs. ```sql SELECT * FROM lefts, TABLE(FLATTEN(ST_S2CellIDs(lefts.geom, 15))) s1 diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md index 2564e39c25..bd5327f675 100644 --- a/docs/tutorial/sql.md +++ b/docs/tutorial/sql.md @@ -299,7 +299,7 @@ Since v`1.7.0`, Sedona supports loading Shapefile as a DataFrame. The input path can be a directory containing one or multiple shapefiles, or path to a `.shp` file. -See [this page](files/shapefile-sedona-spark.md) for more information on loading Shapefiles. +See [this page](files/shapefiles-sedona-spark.md) for more information on loading Shapefiles. ## Load GeoParquet @@ -641,7 +641,7 @@ The output will look like this: +----------------+---+------+-------+ ``` -See [this page](../concepts/clustering-algorithms) for more information on the DBSCAN algorithm. +See [this page](concepts/clustering-algorithms.md) for more information on the DBSCAN algorithm. ## Calculate the Local Outlier Factor (LOF)
