第16章：空间连接（Spatial Join）

空间连接是地理空间分析中最常用的操作之一。它根据两组地理要素之间的空间关系（如相交、包含、邻近等），将一个 GeoDataFrame 的属性信息关联到另一个 GeoDataFrame 上。本章将全面介绍 GeoPandas 中 sjoin() 和 sjoin_nearest() 的使用方法、参数配置与性能优化策略。

16.1 空间连接概述

16.1.1 什么是空间连接

空间连接（Spatial Join）是指根据两个数据集中要素之间的空间关系，将一方的属性附加到另一方的过程。与 pandas 中基于列值的 merge() 类似，空间连接基于 几何对象的空间关系 进行匹配。

例如：

将每个 POI（兴趣点）关联到它所在的行政区
将建筑物关联到其所在的洪水风险区域
将交通事故关联到最近的道路段

16.1.2 空间连接与属性连接的区别

特征	属性连接（merge）	空间连接（sjoin）
连接依据	列值匹配	空间关系匹配
输入数据	DataFrame / GeoDataFrame	两个 GeoDataFrame
匹配方式	精确值匹配或键值匹配	几何关系判定
典型场景	合并统计表	点落入多边形分析
性能影响	取决于数据量	取决于几何复杂度与空间索引

16.1.3 空间连接的工作流程

空间连接通常经过以下步骤：

空间索引过滤：利用 R-tree 快速筛选可能匹配的候选要素对
精确谓词判定：对候选对执行精确的空间关系判断（如 intersects、within）
属性合并：将匹配要素的属性按指定方式合并到结果中

import geopandas as gpd
from shapely.geometry import Point, Polygon

# 创建点数据 - POI
pois = gpd.GeoDataFrame({
    'poi_name': ['商场A', '学校B', '医院C', '公园D', '餐厅E'],
    'category': ['商业', '教育', '医疗', '休闲', '餐饮'],
    'geometry': [
        Point(116.40, 39.92), Point(116.42, 39.94),
        Point(116.38, 39.90), Point(116.44, 39.96),
        Point(116.41, 39.93)
    ]
}, crs="EPSG:4326")

# 创建面数据 - 行政区
districts = gpd.GeoDataFrame({
    'district': ['朝阳区', '海淀区'],
    'geometry': [
        Polygon([(116.39, 39.91), (116.45, 39.91),
                 (116.45, 39.97), (116.39, 39.97)]),
        Polygon([(116.36, 39.88), (116.41, 39.88),
                 (116.41, 39.93), (116.36, 39.93)])
    ]
}, crs="EPSG:4326")

print("POI 数据:")
print(pois)
print("\n行政区数据:")
print(districts)

16.2 sjoin() 基础用法

16.2.1 基本语法

sjoin() 是 GeoPandas 提供的空间连接函数，基本语法如下：

geopandas.sjoin(left_df, right_df, how='inner', predicate='intersects')

也可以使用 GeoDataFrame 的方法形式：

left_df.sjoin(right_df, how='inner', predicate='intersects')

16.2.2 最简单的空间连接

# 将 POI 与行政区进行空间连接
result = gpd.sjoin(pois, districts, how='inner', predicate='intersects')
print(result)

输出结果将包含每个 POI 的原始属性，加上其所在行政区的属性。

16.2.3 how 参数 - 连接方式

how 参数控制连接方式，与 SQL 的 JOIN 类型类似：

how 值	说明	保留的行
`'inner'`	内连接（默认）	只保留两边都有匹配的行
`'left'`	左连接	保留左表所有行，右表无匹配则为 NaN
`'right'`	右连接	保留右表所有行，左表无匹配则为 NaN

# 内连接 - 只保留有匹配的 POI
inner_result = gpd.sjoin(pois, districts, how='inner')
print(f"内连接结果数: {len(inner_result)}")

# 左连接 - 保留所有 POI，没有落在任何区的 POI 属性为 NaN
left_result = gpd.sjoin(pois, districts, how='left')
print(f"左连接结果数: {len(left_result)}")

# 右连接 - 保留所有行政区
right_result = gpd.sjoin(pois, districts, how='right')
print(f"右连接结果数: {len(right_result)}")

16.2.4 predicate 参数 - 空间谓词

predicate 参数指定空间关系的判定方式：

# 使用 intersects（默认）- 相交
result_intersects = gpd.sjoin(pois, districts, predicate='intersects')

# 使用 within - 在...之内
result_within = gpd.sjoin(pois, districts, predicate='within')

# 使用 contains - 包含
result_contains = gpd.sjoin(districts, pois, predicate='contains')

16.3 sjoin() 参数详解

16.3.1 lsuffix 和 rsuffix

当两个 GeoDataFrame 存在同名列时，lsuffix 和 rsuffix 用于区分来源：

# 两个数据集都有 'name' 列
gdf1 = gpd.GeoDataFrame({
    'name': ['点A', '点B'],
    'value': [10, 20],
    'geometry': [Point(0, 0), Point(1, 1)]
})

gdf2 = gpd.GeoDataFrame({
    'name': ['区域1'],
    'area_km2': [100],
    'geometry': [Polygon([(-.5, -.5), (1.5, -.5), (1.5, 1.5), (-.5, 1.5)])]
})

# 默认后缀为 'left' 和 'right'
result = gpd.sjoin(gdf1, gdf2, how='inner')
print(result.columns.tolist())
# 输出: ['name_left', 'value', 'geometry', 'index_right', 'name_right', 'area_km2']

# 自定义后缀
result = gpd.sjoin(gdf1, gdf2, how='inner', lsuffix='poi', rsuffix='zone')
print(result.columns.tolist())

16.3.2 结果的索引

空间连接的结果保留左表的索引，并额外添加一个 index_right 列（或 index_left 列），记录匹配到的右表（或左表）行的原始索引：

result = gpd.sjoin(pois, districts)
print(result.index)        # 保留 pois 的索引
print(result['index_right'])  # 记录匹配到的 districts 索引

16.3.3 一对多匹配

当一个要素与多个要素匹配时，结果中会出现重复的索引：

from shapely.geometry import Point, Polygon

# 一个点同时落在两个重叠的多边形中
points = gpd.GeoDataFrame({
    'name': ['中心点'],
    'geometry': [Point(0.5, 0.5)]
})

polygons = gpd.GeoDataFrame({
    'zone': ['区域A', '区域B'],
    'geometry': [
        Polygon([(0, 0), (1, 0), (1, 1), (0, 1)]),
        Polygon([(0.3, 0.3), (1.3, 0.3), (1.3, 1.3), (0.3, 1.3)])
    ]
})

result = gpd.sjoin(points, polygons)
print(f"输入点数: {len(points)}, 输出行数: {len(result)}")
print(result)
# 中心点出现两次，分别匹配区域A和区域B

16.4 常用空间谓词

16.4.1 谓词概览

谓词	说明	典型场景
`intersects`	几何对象有任何交集	通用连接（默认）
`within`	左表几何完全在右表几何内部	点在多边形内
`contains`	左表几何完全包含右表几何	多边形包含点
`crosses`	几何对象交叉（部分重叠）	道路穿越区域
`touches`	几何对象仅边界接触	相邻区域判定
`overlaps`	几何对象部分重叠	区域重叠分析
`covers`	左表几何覆盖右表几何	覆盖关系分析
`covered_by`	左表几何被右表几何覆盖	被覆盖分析

16.4.2 intersects - 相交

intersects 是最宽泛的谓词，只要两个几何对象有任何共享的空间部分（包括边界接触），就认为匹配：

import geopandas as gpd
from shapely.geometry import Point, LineString, Polygon

# 多边形
poly = gpd.GeoDataFrame({
    'name': ['区域'],
    'geometry': [Polygon([(0, 0), (2, 0), (2, 2), (0, 2)])]
})

# 测试不同几何与多边形的 intersects 关系
tests = gpd.GeoDataFrame({
    'desc': ['内部点', '边界点', '外部点', '穿越线', '外部线'],
    'geometry': [
        Point(1, 1),           # 在内部
        Point(2, 2),           # 在边界上
        Point(3, 3),           # 在外部
        LineString([(1, -1), (1, 3)]),  # 穿越
        LineString([(3, 0), (4, 0)])    # 在外部
    ]
})

result = gpd.sjoin(tests, poly, predicate='intersects')
print("intersects 匹配的要素:")
print(result['desc'].tolist())
# 输出: ['内部点', '边界点', '穿越线']

16.4.3 within 与 contains

within 和 contains 是互逆关系：

# within: 左表在右表内
result_within = gpd.sjoin(pois, districts, predicate='within')
print("within 结果:", len(result_within))

# contains: 左表包含右表（注意左右表顺序颠倒）
result_contains = gpd.sjoin(districts, pois, predicate='contains')
print("contains 结果:", len(result_contains))

# 两者结果本质相同，只是视角不同

16.4.4 crosses 与 touches

from shapely.geometry import LineString, Polygon

# 道路数据
roads = gpd.GeoDataFrame({
    'road': ['主干道', '环路', '支路'],
    'geometry': [
        LineString([(0, 1), (3, 1)]),   # 穿越区域
        LineString([(0, 0), (2, 0)]),   # 接触边界
        LineString([(3, 0), (4, 1)])    # 完全在外部
    ]
})

# 区域数据
zones = gpd.GeoDataFrame({
    'zone': ['核心区'],
    'geometry': [Polygon([(0.5, 0), (2.5, 0), (2.5, 2), (0.5, 2)])]
})

# crosses - 穿越（几何有交集但不完全包含）
result_crosses = gpd.sjoin(roads, zones, predicate='crosses')
print("穿越核心区的道路:", result_crosses['road'].tolist())

# touches - 仅边界接触
result_touches = gpd.sjoin(roads, zones, predicate='touches')
print("与核心区边界接触的道路:", result_touches['road'].tolist())

16.4.5 选择合适的谓词

分析需求	推荐谓词	说明
点落入哪个区域	`within`	最精确，排除边界上的点
哪些要素与区域有关	`intersects`	最宽泛，包含所有相关要素
区域包含哪些要素	`contains`	注意左右表顺序
道路穿过哪些区域	`crosses`	仅匹配穿越的情况
寻找相邻区域	`touches`	仅边界相邻

16.5 sjoin_nearest() 最近邻连接

16.5.1 基本概念

sjoin_nearest() 将左表的每个要素与右表中 距离最近 的要素进行连接，即使它们没有空间交集。这在处理不完全重叠的数据时非常有用。

geopandas.sjoin_nearest(left_df, right_df, how='inner',
                        max_distance=None, distance_col=None)

16.5.2 基础用法

import geopandas as gpd
from shapely.geometry import Point

# 事故发生地点
accidents = gpd.GeoDataFrame({
    'accident_id': [1, 2, 3],
    'severity': ['严重', '轻微', '一般'],
    'geometry': [Point(116.40, 39.91), Point(116.43, 39.95), Point(116.37, 39.89)]
}, crs="EPSG:4326")

# 医院位置
hospitals = gpd.GeoDataFrame({
    'hospital': ['协和医院', '人民医院', '中日友好医院'],
    'beds': [2000, 1500, 1800],
    'geometry': [Point(116.41, 39.92), Point(116.38, 39.90), Point(116.44, 39.96)]
}, crs="EPSG:4326")

# 为每个事故找到最近的医院
result = gpd.sjoin_nearest(accidents, hospitals)
print(result[['accident_id', 'severity', 'hospital', 'beds']])

16.5.3 max_distance 参数

max_distance 限制最近邻搜索的最大距离。超过此距离的匹配将被排除：

# 投影到平面坐标系以使用米为单位
accidents_proj = accidents.to_crs(epsg=32650)
hospitals_proj = hospitals.to_crs(epsg=32650)

# 只查找 3000 米范围内的最近医院
result = gpd.sjoin_nearest(
    accidents_proj, hospitals_proj,
    max_distance=3000  # 单位: 米（取决于 CRS）
)
print(f"在 3000 米内找到匹配的事故数: {len(result)}")

注意：max_distance 的单位取决于 GeoDataFrame 的 CRS。使用地理坐标系（如 EPSG:4326）时单位为度，使用投影坐标系时单位为米。

16.5.4 distance_col 参数

distance_col 指定一个列名，用于在结果中存储实际距离值：

# 计算并保存距离
result = gpd.sjoin_nearest(
    accidents_proj, hospitals_proj,
    distance_col='dist_to_hospital'
)
print(result[['accident_id', 'hospital', 'dist_to_hospital']])

输出示例：

accident_id	hospital	dist_to_hospital
1	协和医院	1432.5
2	中日友好医院	1587.3
3	人民医院	1245.8

16.5.5 处理等距情况

当左表的一个要素与右表的多个要素距离相同时，所有等距的要素都会被返回：

# 一个点与两个点等距
center = gpd.GeoDataFrame({
    'name': ['中心'],
    'geometry': [Point(0, 0)]
})

targets = gpd.GeoDataFrame({
    'target': ['东', '西'],
    'geometry': [Point(1, 0), Point(-1, 0)]
})

result = gpd.sjoin_nearest(center, targets, distance_col='dist')
print(result)
# 可能返回两行，因为东和西距离相等

16.6 性能优化

16.6.1 空间索引的自动使用

sjoin() 在执行时会自动利用右表的空间索引进行加速：

import geopandas as gpd
from shapely.geometry import Point, Polygon
import numpy as np
import time

np.random.seed(42)

# 创建大规模数据
n_points = 100000
points = gpd.GeoDataFrame({
    'id': range(n_points),
    'geometry': [Point(x, y) for x, y in zip(
        np.random.uniform(0, 100, n_points),
        np.random.uniform(0, 100, n_points)
    )]
})

# 创建网格多边形
polygons = []
names = []
for i in range(10):
    for j in range(10):
        polygons.append(Polygon([
            (i*10, j*10), ((i+1)*10, j*10),
            ((i+1)*10, (j+1)*10), (i*10, (j+1)*10)
        ]))
        names.append(f"网格_{i}_{j}")

grid = gpd.GeoDataFrame({'name': names, 'geometry': polygons})

# 空间连接自动利用空间索引
start = time.time()
result = gpd.sjoin(points, grid, predicate='within')
elapsed = time.time() - start
print(f"空间连接 {n_points} 个点与 {len(grid)} 个网格: {elapsed:.3f} 秒")
print(f"匹配结果数: {len(result)}")

16.6.2 坐标参考系统的一致性

空间连接要求两个 GeoDataFrame 使用相同的 CRS，否则会引发警告或错误：

# 确保 CRS 一致
if pois.crs != districts.crs:
    districts = districts.to_crs(pois.crs)

result = gpd.sjoin(pois, districts)

16.6.3 数据预处理优化

# 1. 先用边界框粗筛，减少参与连接的数据量
bbox = districts.total_bounds  # [minx, miny, maxx, maxy]
pois_filtered = pois.cx[bbox[0]:bbox[2], bbox[1]:bbox[3]]

# 2. 投影到合适的平面坐标系
pois_proj = pois_filtered.to_crs(epsg=32650)
districts_proj = districts.to_crs(epsg=32650)

# 3. 执行空间连接
result = gpd.sjoin(pois_proj, districts_proj, predicate='within')

16.6.4 分块处理大数据集

import pandas as pd

def sjoin_chunked(left, right, chunk_size=10000, **kwargs):
    """分块执行空间连接，降低内存峰值"""
    results = []
    n_chunks = (len(left) + chunk_size - 1) // chunk_size

    for i in range(n_chunks):
        start_idx = i * chunk_size
        end_idx = min((i + 1) * chunk_size, len(left))
        chunk = left.iloc[start_idx:end_idx]
        chunk_result = gpd.sjoin(chunk, right, **kwargs)
        results.append(chunk_result)

    return pd.concat(results, ignore_index=False)

# 使用分块处理
result = sjoin_chunked(points, grid, chunk_size=20000, predicate='within')
print(f"分块连接结果数: {len(result)}")

16.6.5 性能对比

优化策略	适用场景	效果
自动空间索引	所有场景	默认启用，显著加速
CRS 一致性	多源数据	避免错误结果
边界框预筛选	数据分布不均匀	减少计算量
分块处理	内存不足	降低内存峰值
投影坐标系	需要距离计算	提高精度

16.7 实际应用案例

16.7.1 案例一：点在多边形分析 - 统计各区 POI 数量

import geopandas as gpd
from shapely.geometry import Point, Polygon
import numpy as np

np.random.seed(42)

# 模拟 POI 数据
n_pois = 5000
poi_data = gpd.GeoDataFrame({
    'poi_type': np.random.choice(['餐饮', '商业', '教育', '医疗'], n_pois),
    'geometry': [Point(x, y) for x, y in zip(
        np.random.uniform(116.2, 116.6, n_pois),
        np.random.uniform(39.8, 40.1, n_pois)
    )]
}, crs="EPSG:4326")

# 模拟行政区数据
districts_data = gpd.GeoDataFrame({
    'district': ['A区', 'B区', 'C区', 'D区'],
    'geometry': [
        Polygon([(116.2, 39.8), (116.4, 39.8), (116.4, 39.95), (116.2, 39.95)]),
        Polygon([(116.4, 39.8), (116.6, 39.8), (116.6, 39.95), (116.4, 39.95)]),
        Polygon([(116.2, 39.95), (116.4, 39.95), (116.4, 40.1), (116.2, 40.1)]),
        Polygon([(116.4, 39.95), (116.6, 39.95), (116.6, 40.1), (116.4, 40.1)])
    ]
}, crs="EPSG:4326")

# 空间连接
joined = gpd.sjoin(poi_data, districts_data, predicate='within')

# 按区和 POI 类型统计数量
poi_stats = joined.groupby(['district', 'poi_type']).size().unstack(fill_value=0)
print("各区 POI 数量统计:")
print(poi_stats)

# 计算各区 POI 总数
poi_total = joined.groupby('district').size()
print("\n各区 POI 总数:")
print(poi_total)

16.7.2 案例二：最近设施分析

import geopandas as gpd
from shapely.geometry import Point
import numpy as np

np.random.seed(42)

# 居民点
n_residents = 1000
residents = gpd.GeoDataFrame({
    'resident_id': range(n_residents),
    'geometry': [Point(x, y) for x, y in zip(
        np.random.uniform(0, 10000, n_residents),
        np.random.uniform(0, 10000, n_residents)
    )]
}, crs="EPSG:32650")

# 消防站
stations = gpd.GeoDataFrame({
    'station': ['消防站A', '消防站B', '消防站C', '消防站D', '消防站E'],
    'geometry': [
        Point(2000, 2000), Point(8000, 2000),
        Point(5000, 5000), Point(2000, 8000), Point(8000, 8000)
    ]
}, crs="EPSG:32650")

# 最近邻连接：找到每个居民点最近的消防站及距离
result = gpd.sjoin_nearest(
    residents, stations,
    distance_col='distance_m'
)

# 统计各消防站的服务人数
service_stats = result.groupby('station').agg(
    服务人数=('resident_id', 'count'),
    平均距离=('distance_m', 'mean'),
    最大距离=('distance_m', 'max')
).round(1)

print("消防站服务统计:")
print(service_stats)

# 标记超过 3000 米的居民点为"服务薄弱区"
result['risk'] = result['distance_m'] > 3000
print(f"\n距离最近消防站超过 3000 米的居民点: {result['risk'].sum()} 个")
print(f"占比: {result['risk'].mean():.1%}")

16.7.3 案例三：将属性从面传递到线

import geopandas as gpd
from shapely.geometry import LineString, Polygon

# 道路数据
roads = gpd.GeoDataFrame({
    'road_name': ['长安街', '三环路', '四环路段'],
    'length_km': [12.5, 48.3, 65.2],
    'geometry': [
        LineString([(116.30, 39.91), (116.50, 39.91)]),
        LineString([(116.32, 39.88), (116.48, 39.88), (116.48, 39.98)]),
        LineString([(116.28, 39.85), (116.52, 39.85)])
    ]
}, crs="EPSG:4326")

# 行政区
admin_zones = gpd.GeoDataFrame({
    'zone': ['城区', '郊区'],
    'zone_type': ['核心区', '发展区'],
    'geometry': [
        Polygon([(116.30, 39.87), (116.50, 39.87),
                 (116.50, 39.95), (116.30, 39.95)]),
        Polygon([(116.25, 39.83), (116.55, 39.83),
                 (116.55, 39.87), (116.25, 39.87)])
    ]
}, crs="EPSG:4326")

# 空间连接 - 找出道路穿过的区域
road_zones = gpd.sjoin(roads, admin_zones, predicate='intersects')
print("道路与区域的关联:")
print(road_zones[['road_name', 'zone', 'zone_type']])

16.8 本章小结

本章全面介绍了 GeoPandas 中的空间连接操作。主要内容回顾：

核心函数

函数	说明
`gpd.sjoin()`	基于空间谓词的空间连接
`gpd.sjoin_nearest()`	基于最近距离的空间连接

关键参数

参数	说明	默认值
`how`	连接方式（inner/left/right）	`'inner'`
`predicate`	空间谓词（intersects/within/contains 等）	`'intersects'`
`lsuffix`/`rsuffix`	同名列的后缀	`'left'`/`'right'`
`max_distance`	最近邻搜索最大距离	`None`
`distance_col`	存储距离值的列名	`None`

使用建议

选择合适的谓词：根据分析需求选择最精确的空间谓词，避免不必要的匹配
确保 CRS 一致：连接前务必检查并统一坐标参考系统
使用投影坐标系：涉及距离计算时，先投影到合适的平面坐标系
注意一对多：一个要素可能匹配多个要素，导致结果行数大于输入行数
利用空间索引：sjoin() 自动使用空间索引，无需手动创建

在下一章中，我们将学习几何叠加分析（Overlay），它是空间连接的延伸，不仅关联属性，还对几何对象本身进行交集、并集等操作。