K-Nearest Neighbour Clustering Of Massacres For The Identification Of Australian Wars

(c) Bill Pascoe and Kaine Usher, 2025

This notebook uses the k-nearest neighbour clustering method on data from Colonial Frontier Massacres in Australia, 1788-1930 (Ryan et al, 2025) project to help identify Australian Wars.

For important information on how to understand this notebook, see the Introduction AWR_Introduction.html.

Parameter Selection

The most informative clusters of massacres emerge by setting the value of k to be somewhere between 2 and 6. You can change the value of k here. Eg: set to k = 3. Then run the notebook again by pressing the two little triangles button above.

In [1]:
# Enter file path of dataset:
file_path = "CMassacres_TLCM_20250314.csv"

# Enter number of nearest neighbours:
k = 2

STKNN Clustering/Aggregation Code

The block below contains the code necessary for STKNN clustering/aggregating the data based on the k parameter you assigned. You do not need to change anything - simply run it as is.

In [2]:
import pandas as pd
df_initial = pd.read_csv(file_path)

df = df_initial.filter(["ghap_id", "title", "description", "latitude", "longitude", "datestart", "dateend", "linkback", "Victims", "VictimsDead", "Attackers", "AttackersDead", "MassacreGroup"], axis=1)
df["ghap_id"] = df["ghap_id"].astype(str)

from geojikuu.preprocessing.projection import MGA2020Projector
mga_2020_projector = MGA2020Projector("wgs84")
results = mga_2020_projector.project(list(zip(df["latitude"], df["longitude"])))
df["mga_2020"] = results["mga2020_coordinates"]
unit_conversion = results["unit_conversion"]

from geojikuu.preprocessing.conversion_tools import DateConvertor

date_convertor = DateConvertor(date_format_in="%Y-%m-%d", date_format_out="%Y-%m-%d")
df['date_converted'] = df['datestart'].apply(date_convertor.date_to_days)

from geojikuu.aggregation.point_aggregators import STKNearestNeighbours
st_knn = STKNearestNeighbours(data=df, coordinate_label="mga_2020", time_label="date_converted")
results = st_knn.aggregate(k=k, aggregate_type="mean")

results[["earliest_date", "latest_date"]] = results["temporal_extent"].str.replace('[()]', '', regex=True).str.split(',', expand=True).astype(int)
results["earliest_date"] = results['earliest_date'].apply(date_convertor.days_to_date)
results["latest_date"] = results['latest_date'].apply(date_convertor.days_to_date)
results["temporal_midpoint"] = results['date_converted'].apply(date_convertor.days_to_date)
Aggregated 438 points into 42 clusters.
In [3]:
results["spatial_midpoint"] = mga_2020_projector.inverse_project(results["midpoint"])
results[["lat_mid", "lon_mid"]] = results["spatial_midpoint"].astype(str).str.replace('[()]', '', regex=True).str.split(',', expand=True).astype(float)
results["mbr"] = results['mbr'] * unit_conversion

results = results.drop(["latitude", "longitude", "date_converted", "midpoint", "temporal_extent"], axis=1)

Output

The results can be output to a file for download and further processing. The output files are in the same directory as this notebook. The first few lines of the data are shown on screen.

stknn_clusters.csv output

In [4]:
results.to_csv('stknn_clusters_' + str(k) + '.csv')
results.head()
Out[4]:
ghap_id title description datestart dateend linkback Victims VictimsDead Attackers AttackersDead MassacreGroup scaled_st_coordinates count mbr earliest_date latest_date temporal_midpoint spatial_midpoint lat_mid lon_mid
0 td0c77, td0cb4, td0cd3, td0d01, td0e1c, td0c7a... Maiden Hills, Beveridge Island, Moira Swamp, J... In April 1839, Assistant Protector Charles Sie... 1839-02-01, 1848-06-01, 1843-12-15, 1846-01-01... 1839-02-28, 1848-06-30, 1843-12-15, 1846-12-31... https://c21ch.newcastle.edu.au/colonialmassacr... Aboriginal or Torres Strait Islander People, A... 28.937500 Colonists, Colonists, Colonists, Colonists, Ab... 0.062500 nan, nan, nan, nan, nan, nan, nan, nan, nan, n... (0.15940323731188752, 0.790640055117676, 0.331... 16 230.979682 1836-01-01 1854-01-01 1841-01-06 (-36.33312887014948, 144.9114467132039) -36.333129 144.911447
1 td0c7e, td0c88, td6286, td0c7c, td0c7d, td0e05... Fighting Waterholes, Connell's Ford, Wootong V... As reported in Clark (1995, 152), after the ma... 1840-04-01, 1840-11-01, 1840-01-01, 1840-01-01... 1840-04-01, 1840-11-30, 1840-12-31, 1840-02-28... https://c21ch.newcastle.edu.au/colonialmassacr... Aboriginal or Torres Strait Islander People, A... 19.000000 Colonists, Colonists, Colonists, Colonists, Co... 0.000000 nan, nan, nan, nan, nan, nan, nan (0.15277451224048197, 0.7447994396540668, 0.33... 7 66.220085 1833-03-01 1840-11-01 1839-01-12 (-37.673028801624035, 141.57996532663594) -37.673029 141.579965
2 td0c7f, td0c84, td0c93, td0c95, td0ca3, td0ca4... Mount Rouse, Victoria Valley, Tarrone Station,... On 19 May 1840, overseer Patrick Codd was kill... 1840-06-11, 1840-08-12, 1842-10-01, 1842-02-24... 1840-06-11, 1840-08-20, 1842-10-28, 1842-02-24... https://c21ch.newcastle.edu.au/colonialmassacr... Aboriginal or Torres Strait Islander People, A... 13.000000 Colonists, Colonists, Colonists, Colonists, Co... 0.064516 nan, nan, nan, nan, nan, nan, nan, nan, nan, n... (0.1421715655226905, 0.7616516835225238, 0.341... 31 316.679656 1836-05-27 1854-11-01 1842-09-27 (-36.57339880261106, 141.79090049863606) -36.573399 141.790900
3 td0c8b, td0cf6, td0e10, td628b, td628c, td0c85... Laverton (1), York (2), Mount Ida, De Grey Sta... Described in newspaper articles as 'tribal fig... 1908-11-07, 1837-06-01, 1908-12-01, 1864-01-01... 1908-11-08, 1837-11-16, 1908-12-31, 1864-08-31... https://c21ch.newcastle.edu.au/colonialmassacr... Aboriginal or Torres Strait Islander People, A... 17.956522 Aboriginal or Torres Strait Islander People, C... 0.000000 nan, nan, nan, nan, nan, nan, nan, nan, nan, n... (0.33183357086151666, 0.2484757288852052, 0.85... 23 1336.419626 1829-01-01 1910-09-11 1860-11-10 (-28.82204349417086, 117.91368986697903) -28.822043 117.913690
4 td0c8c, td0c86, td6291, td0c99, td0c9c, td0c9f... Butchers Creek, Gippsland, Boney Point, Gippsl... According to Gippsland historian Peter Gardner... 1841-01-01, 1840-10-01, 1843-07-01, 1842-12-01... 1841-12-31, 1840-10-31, 1843-07-31, 1842-12-31... https://c21ch.newcastle.edu.au/colonialmassacr... Aboriginal or Torres Strait Islander People, A... 26.600000 Colonists, Colonists, Colonists, Colonists, Co... 0.000000 nan, nan, 1843: Warrigal Creek, Gippsland, PPD... (0.1538484949792444, 0.8786890940032043, 0.345... 10 99.566145 1840-01-01 1843-07-15 1842-06-03 (-38.21941387333123, 147.164592936843) -38.219414 147.164593
In [5]:
import geopandas

def getConvexHull(id, polygononly):
    ## query df_initial for assigned_cluster = id, and make into list, and make into convex hull and add to summary
    cluster = df_initial[df_initial["assigned_cluster"] == id]


    # temporarily use geopandas to create a 'geometry' from the coordinates in this cluster so we can call the convexhull method on it
    gdf = geopandas.GeoDataFrame(
        cluster, geometry=geopandas.points_from_xy(cluster.longitude, cluster.latitude), crs="EPSG:4326"
    )
    # print ("Convex Hull")
    chull = gdf.geometry.union_all().convex_hull
    #display(chull)

    
    if len(cluster.index) > 2 and polygononly :
        print("Cluster " + str(id) + " has " + str(len(cluster.index)) + " sites.")
        return chull
    else :
        return None

output

In [6]:
def find_index(id):
    for idx, ids in results['ghap_id'].items():
        id_list = ids.split(', ')
        if str(id) in id_list:
            return idx
    return None

df_initial['assigned_cluster'] = df_initial['ghap_id'].apply(find_index)


# preparing cluster summary and polygon
polygononly = True
clusterSummary = results.filter(["ghap_id", "title", "datestart", "dateend", "linkback", "Victims", "VictimsDead", "Attackers", "AttackersDead", "count", "mbr", "earliest_date", "latest_date", "temporal_midpoint", "spatial_midpoint", "lat_mid", "lon_mid"], axis=1)
clusterSummary['cluster_id'] = clusterSummary.index
clusterSummary['convex_hull'] = clusterSummary['cluster_id'].apply(getConvexHull, args = (polygononly,))

clusterSummary = clusterSummary[clusterSummary['convex_hull'].notnull()]

df_initial.to_csv('colfront_stknn_labelled_' + str(k) + '.csv')
df_initial.head()
Cluster 0 has 16 sites.
Cluster 1 has 7 sites.
Cluster 2 has 31 sites.
Cluster 3 has 23 sites.
Cluster 4 has 10 sites.
Cluster 5 has 24 sites.
Cluster 6 has 3 sites.
Cluster 7 has 17 sites.
Cluster 8 has 19 sites.
Cluster 9 has 20 sites.
Cluster 10 has 13 sites.
Cluster 11 has 22 sites.
Cluster 12 has 15 sites.
Cluster 13 has 3 sites.
Cluster 14 has 5 sites.
Cluster 15 has 9 sites.
Cluster 16 has 14 sites.
Cluster 17 has 6 sites.
Cluster 18 has 4 sites.
Cluster 19 has 17 sites.
Cluster 20 has 19 sites.
Cluster 21 has 3 sites.
Cluster 22 has 3 sites.
Cluster 23 has 10 sites.
Cluster 24 has 3 sites.
Cluster 25 has 5 sites.
Cluster 26 has 11 sites.
Cluster 27 has 3 sites.
Cluster 28 has 25 sites.
Cluster 29 has 8 sites.
Cluster 30 has 9 sites.
Cluster 31 has 6 sites.
Cluster 32 has 12 sites.
Cluster 33 has 3 sites.
Cluster 34 has 7 sites.
Cluster 35 has 3 sites.
Cluster 36 has 6 sites.
Cluster 37 has 3 sites.
Cluster 38 has 11 sites.
Cluster 39 has 4 sites.
Cluster 40 has 3 sites.
Cluster 41 has 3 sites.
Out[6]:
ghap_id layer_id title record_type description latitude longitude datestart dateend source ... Motive WeaponsUsed CorroborationRating RetaliationForDeaths VictimNotes AboriginalPlaceName AttackerNotes MassacreGroup AttackerNames assigned_cluster
0 td0c77 1336 Maiden Hills Site In April 1839, Assistant Protector Charles Sie... -37.446 143.735 1839-02-01 1839-02-28 Orton Journal, 1840-42, 12 Jan 1841, cited in ... ... Opportunity Firearm(s) *** NaN NaN NaN NaN NaN NaN 0
1 td0c7e 1336 Fighting Waterholes Site As reported in Clark (1995, 152), after the ma... -37.470 141.568 1840-04-01 1840-04-01 Clark 1995, pp 152-155; Palmer, 1973, p 72; Tr... ... Opportunity Firearm(s) ** NaN NaN NaN NaN NaN NaN 1
2 td0c7f 1336 Mount Rouse Site On 19 May 1840, overseer Patrick Codd was kill... -37.885 142.303 1840-06-11 1840-06-11 Critchett, 1990, p 160; HRA, I, xxi, p 242; Cl... ... Reprisal Firearm(s), Musket(s), Pistol(s) *** Patrick Codd NaN NaN NaN NaN NaN 2
3 td0c84 1336 Victoria Valley Site Following an earlier massacre in the Grampians... -37.558 142.284 1840-08-12 1840-08-20 Bride, 1899, p 163 <a href="https://ia601608.u... ... Reprisal Firearm(s) *** NaN NaN NaN NaN NaN NaN 2
4 td0c88 1336 Connell's Ford Site In November 1840, squatter Augustine Barton re... -37.608 141.423 1840-11-01 1840-11-30 G A Robinson Papers, Vol. 54 ML A7052; Trangma... ... Reprisal Poison, Arsenic *** NaN NaN NaN NaN NaN NaN 1

5 rows × 37 columns

Visualisation

In [7]:
import random
import folium

def flipLatLng(ll) :
    return (ll[1],ll[0])

map_center = [df_initial['latitude'].mean(), df_initial['longitude'].mean()]
mapc = folium.Map(location=map_center, zoom_start=4)

folium.TileLayer(
    tiles = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}',
    attr = 'Esri',
    name = 'Esri Satellite',
    overlay = False,
    control = True
    ).add_to(mapc)

def random_color():
    return "#" + ''.join([random.choice('0123456789ABCDEF') for _ in range(6)])

cluster_colors = {cluster: random_color() for cluster in df_initial['assigned_cluster'].unique()}


# Add polygons
fillpolygon = False;
if fillpolygon : 
    popacity = 0.4
else :
    popacity = 0

for _, row in clusterSummary.iterrows():
    
    # geopanda, spacey etc generate lat lng in the opposite order to what folium and leaflet assume, so we have to flip the coordinates
    locpoly = list(map(flipLatLng, list(row["convex_hull"].exterior.coords)))
    
    folium.Polygon(
        locations=locpoly,
        color=cluster_colors[row['cluster_id']],
        weight=12,
        opacity=0.2,
        line_join='round',
        fill_color=cluster_colors[row['cluster_id']],
        fill_opacity=popacity,
        fill=True,
        popup=f"<b>Cluster:</b> {row['cluster_id']}<br><br>"
              f"<b>Count:</b> {row['count']}<br><br>"
              f"<b>MBR:</b> {row['mbr']}<br><br>"
              f"<b>Earliest massacre:</b> {row['earliest_date']}<br><br>"
              f"<b>Latest massacre:</b> {row['latest_date']}<br><br>"
              f"<b>Temporal Midpoint:</b> {row['temporal_midpoint']}<br><br>"
              f"<b>Spatial Midpoint:</b> {row['spatial_midpoint']}<br><br>",
        tooltip="Cluster details",
    ).add_to(mapc)

# add points
for _, row in df_initial.iterrows():
    folium.CircleMarker(
        location=(row['latitude'], row['longitude']),
        radius=5,
        color=cluster_colors[row['assigned_cluster']],
        fill=True,
        fill_color=cluster_colors[row['assigned_cluster']],
        fillOpacity=1,
        popup=f"<b>Site:</b> {row['title']}<br><br>"
                  f"<b>Lat:</b> {row['latitude']}<br><br>"
                  f"<b>Lon:</b> {row['longitude']}<br><br>"
                  f"<b>Date:</b> {row['datestart']}<br><br>"
                  f"<b>Victims Dead:</b> {row['VictimsDead']}<br><br>"
                  f"<b>Attackers Dead:</b> {row['AttackersDead']}<br><br>"
                  f"<b>Assigned Cluster:</b> {row['assigned_cluster']}<br>"
                  f"<b>Link:</b> <a href='{row['linkback']}' target='_blank'>{row['linkback']}</a><br>"
        ).add_to(mapc)
mapc
Out[7]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]: