K-Nearest Neighbour Clustering Of Massacres For The Identification Of Australian Wars
(c) Bill Pascoe and Kaine Usher, 2025
This notebook uses the k-nearest neighbour clustering method on data from Colonial Frontier Massacres in Australia, 1788-1930 (Ryan et al, 2025) project to help identify Australian Wars.
For important information on how to understand this notebook, see the Introduction AWR_Introduction.html.
Parameter Selection
The most informative clusters of massacres emerge by setting the value of k to be somewhere between 2 and 6. You can change the value of k here. Eg: set to k = 3. Then run the notebook again by pressing the two little triangles button above.
# Enter file path of dataset:
file_path = "CMassacres_TLCM_20250314.csv"
# Enter number of nearest neighbours:
k = 2
STKNN Clustering/Aggregation Code
The block below contains the code necessary for STKNN clustering/aggregating the data based on the k parameter you assigned. You do not need to change anything - simply run it as is.
import pandas as pd
df_initial = pd.read_csv(file_path)
df = df_initial.filter(["ghap_id", "title", "description", "latitude", "longitude", "datestart", "dateend", "linkback", "Victims", "VictimsDead", "Attackers", "AttackersDead", "MassacreGroup"], axis=1)
df["ghap_id"] = df["ghap_id"].astype(str)
from geojikuu.preprocessing.projection import MGA2020Projector
mga_2020_projector = MGA2020Projector("wgs84")
results = mga_2020_projector.project(list(zip(df["latitude"], df["longitude"])))
df["mga_2020"] = results["mga2020_coordinates"]
unit_conversion = results["unit_conversion"]
from geojikuu.preprocessing.conversion_tools import DateConvertor
date_convertor = DateConvertor(date_format_in="%Y-%m-%d", date_format_out="%Y-%m-%d")
df['date_converted'] = df['datestart'].apply(date_convertor.date_to_days)
from geojikuu.aggregation.point_aggregators import STKNearestNeighbours
st_knn = STKNearestNeighbours(data=df, coordinate_label="mga_2020", time_label="date_converted")
results = st_knn.aggregate(k=k, aggregate_type="mean")
results[["earliest_date", "latest_date"]] = results["temporal_extent"].str.replace('[()]', '', regex=True).str.split(',', expand=True).astype(int)
results["earliest_date"] = results['earliest_date'].apply(date_convertor.days_to_date)
results["latest_date"] = results['latest_date'].apply(date_convertor.days_to_date)
results["temporal_midpoint"] = results['date_converted'].apply(date_convertor.days_to_date)
Aggregated 438 points into 42 clusters.
results["spatial_midpoint"] = mga_2020_projector.inverse_project(results["midpoint"])
results[["lat_mid", "lon_mid"]] = results["spatial_midpoint"].astype(str).str.replace('[()]', '', regex=True).str.split(',', expand=True).astype(float)
results["mbr"] = results['mbr'] * unit_conversion
results = results.drop(["latitude", "longitude", "date_converted", "midpoint", "temporal_extent"], axis=1)
Output
The results can be output to a file for download and further processing. The output files are in the same directory as this notebook. The first few lines of the data are shown on screen.
stknn_clusters.csv output
results.to_csv('stknn_clusters_' + str(k) + '.csv')
results.head()
ghap_id | title | description | datestart | dateend | linkback | Victims | VictimsDead | Attackers | AttackersDead | MassacreGroup | scaled_st_coordinates | count | mbr | earliest_date | latest_date | temporal_midpoint | spatial_midpoint | lat_mid | lon_mid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | td0c77, td0cb4, td0cd3, td0d01, td0e1c, td0c7a... | Maiden Hills, Beveridge Island, Moira Swamp, J... | In April 1839, Assistant Protector Charles Sie... | 1839-02-01, 1848-06-01, 1843-12-15, 1846-01-01... | 1839-02-28, 1848-06-30, 1843-12-15, 1846-12-31... | https://c21ch.newcastle.edu.au/colonialmassacr... | Aboriginal or Torres Strait Islander People, A... | 28.937500 | Colonists, Colonists, Colonists, Colonists, Ab... | 0.062500 | nan, nan, nan, nan, nan, nan, nan, nan, nan, n... | (0.15940323731188752, 0.790640055117676, 0.331... | 16 | 230.979682 | 1836-01-01 | 1854-01-01 | 1841-01-06 | (-36.33312887014948, 144.9114467132039) | -36.333129 | 144.911447 |
1 | td0c7e, td0c88, td6286, td0c7c, td0c7d, td0e05... | Fighting Waterholes, Connell's Ford, Wootong V... | As reported in Clark (1995, 152), after the ma... | 1840-04-01, 1840-11-01, 1840-01-01, 1840-01-01... | 1840-04-01, 1840-11-30, 1840-12-31, 1840-02-28... | https://c21ch.newcastle.edu.au/colonialmassacr... | Aboriginal or Torres Strait Islander People, A... | 19.000000 | Colonists, Colonists, Colonists, Colonists, Co... | 0.000000 | nan, nan, nan, nan, nan, nan, nan | (0.15277451224048197, 0.7447994396540668, 0.33... | 7 | 66.220085 | 1833-03-01 | 1840-11-01 | 1839-01-12 | (-37.673028801624035, 141.57996532663594) | -37.673029 | 141.579965 |
2 | td0c7f, td0c84, td0c93, td0c95, td0ca3, td0ca4... | Mount Rouse, Victoria Valley, Tarrone Station,... | On 19 May 1840, overseer Patrick Codd was kill... | 1840-06-11, 1840-08-12, 1842-10-01, 1842-02-24... | 1840-06-11, 1840-08-20, 1842-10-28, 1842-02-24... | https://c21ch.newcastle.edu.au/colonialmassacr... | Aboriginal or Torres Strait Islander People, A... | 13.000000 | Colonists, Colonists, Colonists, Colonists, Co... | 0.064516 | nan, nan, nan, nan, nan, nan, nan, nan, nan, n... | (0.1421715655226905, 0.7616516835225238, 0.341... | 31 | 316.679656 | 1836-05-27 | 1854-11-01 | 1842-09-27 | (-36.57339880261106, 141.79090049863606) | -36.573399 | 141.790900 |
3 | td0c8b, td0cf6, td0e10, td628b, td628c, td0c85... | Laverton (1), York (2), Mount Ida, De Grey Sta... | Described in newspaper articles as 'tribal fig... | 1908-11-07, 1837-06-01, 1908-12-01, 1864-01-01... | 1908-11-08, 1837-11-16, 1908-12-31, 1864-08-31... | https://c21ch.newcastle.edu.au/colonialmassacr... | Aboriginal or Torres Strait Islander People, A... | 17.956522 | Aboriginal or Torres Strait Islander People, C... | 0.000000 | nan, nan, nan, nan, nan, nan, nan, nan, nan, n... | (0.33183357086151666, 0.2484757288852052, 0.85... | 23 | 1336.419626 | 1829-01-01 | 1910-09-11 | 1860-11-10 | (-28.82204349417086, 117.91368986697903) | -28.822043 | 117.913690 |
4 | td0c8c, td0c86, td6291, td0c99, td0c9c, td0c9f... | Butchers Creek, Gippsland, Boney Point, Gippsl... | According to Gippsland historian Peter Gardner... | 1841-01-01, 1840-10-01, 1843-07-01, 1842-12-01... | 1841-12-31, 1840-10-31, 1843-07-31, 1842-12-31... | https://c21ch.newcastle.edu.au/colonialmassacr... | Aboriginal or Torres Strait Islander People, A... | 26.600000 | Colonists, Colonists, Colonists, Colonists, Co... | 0.000000 | nan, nan, 1843: Warrigal Creek, Gippsland, PPD... | (0.1538484949792444, 0.8786890940032043, 0.345... | 10 | 99.566145 | 1840-01-01 | 1843-07-15 | 1842-06-03 | (-38.21941387333123, 147.164592936843) | -38.219414 | 147.164593 |
import geopandas
def getConvexHull(id, polygononly):
## query df_initial for assigned_cluster = id, and make into list, and make into convex hull and add to summary
cluster = df_initial[df_initial["assigned_cluster"] == id]
# temporarily use geopandas to create a 'geometry' from the coordinates in this cluster so we can call the convexhull method on it
gdf = geopandas.GeoDataFrame(
cluster, geometry=geopandas.points_from_xy(cluster.longitude, cluster.latitude), crs="EPSG:4326"
)
# print ("Convex Hull")
chull = gdf.geometry.union_all().convex_hull
#display(chull)
if len(cluster.index) > 2 and polygononly :
print("Cluster " + str(id) + " has " + str(len(cluster.index)) + " sites.")
return chull
else :
return None
output
def find_index(id):
for idx, ids in results['ghap_id'].items():
id_list = ids.split(', ')
if str(id) in id_list:
return idx
return None
df_initial['assigned_cluster'] = df_initial['ghap_id'].apply(find_index)
# preparing cluster summary and polygon
polygononly = True
clusterSummary = results.filter(["ghap_id", "title", "datestart", "dateend", "linkback", "Victims", "VictimsDead", "Attackers", "AttackersDead", "count", "mbr", "earliest_date", "latest_date", "temporal_midpoint", "spatial_midpoint", "lat_mid", "lon_mid"], axis=1)
clusterSummary['cluster_id'] = clusterSummary.index
clusterSummary['convex_hull'] = clusterSummary['cluster_id'].apply(getConvexHull, args = (polygononly,))
clusterSummary = clusterSummary[clusterSummary['convex_hull'].notnull()]
df_initial.to_csv('colfront_stknn_labelled_' + str(k) + '.csv')
df_initial.head()
Cluster 0 has 16 sites. Cluster 1 has 7 sites. Cluster 2 has 31 sites. Cluster 3 has 23 sites. Cluster 4 has 10 sites. Cluster 5 has 24 sites. Cluster 6 has 3 sites. Cluster 7 has 17 sites. Cluster 8 has 19 sites. Cluster 9 has 20 sites. Cluster 10 has 13 sites. Cluster 11 has 22 sites. Cluster 12 has 15 sites. Cluster 13 has 3 sites. Cluster 14 has 5 sites. Cluster 15 has 9 sites. Cluster 16 has 14 sites. Cluster 17 has 6 sites. Cluster 18 has 4 sites. Cluster 19 has 17 sites. Cluster 20 has 19 sites. Cluster 21 has 3 sites. Cluster 22 has 3 sites. Cluster 23 has 10 sites. Cluster 24 has 3 sites. Cluster 25 has 5 sites. Cluster 26 has 11 sites. Cluster 27 has 3 sites. Cluster 28 has 25 sites. Cluster 29 has 8 sites. Cluster 30 has 9 sites. Cluster 31 has 6 sites. Cluster 32 has 12 sites. Cluster 33 has 3 sites. Cluster 34 has 7 sites. Cluster 35 has 3 sites. Cluster 36 has 6 sites. Cluster 37 has 3 sites. Cluster 38 has 11 sites. Cluster 39 has 4 sites. Cluster 40 has 3 sites. Cluster 41 has 3 sites.
ghap_id | layer_id | title | record_type | description | latitude | longitude | datestart | dateend | source | ... | Motive | WeaponsUsed | CorroborationRating | RetaliationForDeaths | VictimNotes | AboriginalPlaceName | AttackerNotes | MassacreGroup | AttackerNames | assigned_cluster | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | td0c77 | 1336 | Maiden Hills | Site | In April 1839, Assistant Protector Charles Sie... | -37.446 | 143.735 | 1839-02-01 | 1839-02-28 | Orton Journal, 1840-42, 12 Jan 1841, cited in ... | ... | Opportunity | Firearm(s) | *** | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
1 | td0c7e | 1336 | Fighting Waterholes | Site | As reported in Clark (1995, 152), after the ma... | -37.470 | 141.568 | 1840-04-01 | 1840-04-01 | Clark 1995, pp 152-155; Palmer, 1973, p 72; Tr... | ... | Opportunity | Firearm(s) | ** | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
2 | td0c7f | 1336 | Mount Rouse | Site | On 19 May 1840, overseer Patrick Codd was kill... | -37.885 | 142.303 | 1840-06-11 | 1840-06-11 | Critchett, 1990, p 160; HRA, I, xxi, p 242; Cl... | ... | Reprisal | Firearm(s), Musket(s), Pistol(s) | *** | Patrick Codd | NaN | NaN | NaN | NaN | NaN | 2 |
3 | td0c84 | 1336 | Victoria Valley | Site | Following an earlier massacre in the Grampians... | -37.558 | 142.284 | 1840-08-12 | 1840-08-20 | Bride, 1899, p 163 <a href="https://ia601608.u... | ... | Reprisal | Firearm(s) | *** | NaN | NaN | NaN | NaN | NaN | NaN | 2 |
4 | td0c88 | 1336 | Connell's Ford | Site | In November 1840, squatter Augustine Barton re... | -37.608 | 141.423 | 1840-11-01 | 1840-11-30 | G A Robinson Papers, Vol. 54 ML A7052; Trangma... | ... | Reprisal | Poison, Arsenic | *** | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
5 rows × 37 columns
Visualisation
import random
import folium
def flipLatLng(ll) :
return (ll[1],ll[0])
map_center = [df_initial['latitude'].mean(), df_initial['longitude'].mean()]
mapc = folium.Map(location=map_center, zoom_start=4)
folium.TileLayer(
tiles = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}',
attr = 'Esri',
name = 'Esri Satellite',
overlay = False,
control = True
).add_to(mapc)
def random_color():
return "#" + ''.join([random.choice('0123456789ABCDEF') for _ in range(6)])
cluster_colors = {cluster: random_color() for cluster in df_initial['assigned_cluster'].unique()}
# Add polygons
fillpolygon = False;
if fillpolygon :
popacity = 0.4
else :
popacity = 0
for _, row in clusterSummary.iterrows():
# geopanda, spacey etc generate lat lng in the opposite order to what folium and leaflet assume, so we have to flip the coordinates
locpoly = list(map(flipLatLng, list(row["convex_hull"].exterior.coords)))
folium.Polygon(
locations=locpoly,
color=cluster_colors[row['cluster_id']],
weight=12,
opacity=0.2,
line_join='round',
fill_color=cluster_colors[row['cluster_id']],
fill_opacity=popacity,
fill=True,
popup=f"<b>Cluster:</b> {row['cluster_id']}<br><br>"
f"<b>Count:</b> {row['count']}<br><br>"
f"<b>MBR:</b> {row['mbr']}<br><br>"
f"<b>Earliest massacre:</b> {row['earliest_date']}<br><br>"
f"<b>Latest massacre:</b> {row['latest_date']}<br><br>"
f"<b>Temporal Midpoint:</b> {row['temporal_midpoint']}<br><br>"
f"<b>Spatial Midpoint:</b> {row['spatial_midpoint']}<br><br>",
tooltip="Cluster details",
).add_to(mapc)
# add points
for _, row in df_initial.iterrows():
folium.CircleMarker(
location=(row['latitude'], row['longitude']),
radius=5,
color=cluster_colors[row['assigned_cluster']],
fill=True,
fill_color=cluster_colors[row['assigned_cluster']],
fillOpacity=1,
popup=f"<b>Site:</b> {row['title']}<br><br>"
f"<b>Lat:</b> {row['latitude']}<br><br>"
f"<b>Lon:</b> {row['longitude']}<br><br>"
f"<b>Date:</b> {row['datestart']}<br><br>"
f"<b>Victims Dead:</b> {row['VictimsDead']}<br><br>"
f"<b>Attackers Dead:</b> {row['AttackersDead']}<br><br>"
f"<b>Assigned Cluster:</b> {row['assigned_cluster']}<br>"
f"<b>Link:</b> <a href='{row['linkback']}' target='_blank'>{row['linkback']}</a><br>"
).add_to(mapc)
mapc