# SocialRedistrict.ing: Reproducing the Optimization & Mapping Pipeline

Interested readers and researchers can reproduce our complete redistricting optimization, dataset compilation, and web asset packaging pipeline with a single command-line execution.

---

## 1. Prerequisites & Environment Setup

This project uses the high-performance Python package manager **`uv`** (by Astro), which guarantees deterministic virtual environments and lighting-fast dependency downloads.

### Installing `uv`
To install `uv` on macOS or Linux:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

### Python Dependencies
The pipeline requires Python 3.10+ and uses several geospatial and network analysis packages. 

Create a virtual environment and install the required dependencies:
```bash
# Create virtual environment
uv venv

# Install required packages
uv pip install geopandas pandas networkx gerrychain shapely numpy
```

#### Apple Silicon Metal GPU Acceleration (Optional)
If running on Apple Silicon (M1/M2/M3/M4/M5 Mac), you can optionally install Apple's machine learning library **`mlx`** to leverage hardware-accelerated, parallelized cut-flow calculations on the GPU using Metal:
```bash
uv pip install mlx
```
Our pipeline will automatically detect `mlx` and use GPU acceleration for simulated annealing step evaluations, reducing runtime by over 70%. If `mlx` is not installed, it falls back seamlessly to standard NumPy vectorization.

### Census API Key (Required for Data Download)
To download the Citizen Voting Age Population (CVAP) data directly from the U.S. Census Bureau API, the download script requires a Census API key (available for free from the [Census API Key Request Page](https://api.census.gov/data/key_signup.html)).

Before running the download script, you can set the key as an environment variable:
```bash
export CENSUS_API_KEY="your_census_api_key_here"
```
If the environment variable is not set, the script will prompt you to enter it in the terminal during execution.

---

## 2. Input Data & Integrity Verification

To keep the repository lightweight, raw datasets should be downloaded from their official sources rather than stored in the repository. To guarantee replication integrity, we provide the **BLAKE3** cryptographic hashes of the exact file versions used in our pipeline.

### Geographic Delineation & Vintage Verification
We have verified that the Meta Social Connectedness Index (SCI) datasets map successfully to the Census 2020 ZCTA boundary files. The pipeline matches ZCTA codes directly, achieving a >99.8% match rate for populated ZCTAs. This ensures complete geographic delineation compatibility between the Facebook-derived social ties and the U.S. Census Bureau shapefiles/population tables.

### Raw Data Sources & Retrieval:
1. **Meta Social Connectedness Index (SCI):** Aggregated friendship flows between ZCTAs. Retrieve the national adjacency list and state-level matrices from [Meta Data for Good / Humanitarian Data Exchange (HDX)](https://data.humdata.org/).
2. **U.S. Census Bureau ACS 5-Year Estimates:** Extract total population and Citizen Voting Age Population (CVAP) at the ZCTA level from the [Census Data Portal](https://data.census.gov/).
3. **U.S. Census Cartographic Boundaries:** Download the 2020 5-digit ZCTA shapefiles (500k scale) from the [Census Cartographic Boundary Portal](https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html).
4. **ZCTA-to-District Relationship Files:** Download the Congressional District relationship files from the [Census Geographies Relationship Portal](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html).

### Data Layout & BLAKE3 Verification Hashes:
Ensure the following files are placed in the `data_crunching/` directory before running the pipeline:

* **`zcta_metadata.json`**: Structural JSON table listing ZCTA centroids, state FIPS, and baseline assignments.
  * *BLAKE3:* `7945ade390d2c0c219e8dc8c6e32c73dae4f7a1bf183c636b01ff8eb89a2529f`
* **`zcta_population.csv`**: Total population counts per ZCTA.
  * *BLAKE3:* `340f1aacff97cc0247ee43ae40f23a100f39dfbaae9d809f7b8d07fcc4dd80ae`
* **`zcta_adjacency_sci.csv`**: Rook adjacency graph mapped with Meta friendship SCI values.
  * *BLAKE3:* `1bcabef4107627b30a00d5f543fb01b4d4a9df8eb42cf33306d4c434b43bba44`
* **`zcta_cvap.csv`**: Citizen Voting Age Population counts.
  * *BLAKE3:* `9307c240291cf68e92f263a54c35821879ad81a8f30afacadf104296eb3b5c9c`
* **`tab20_cd11920_zcta520_natl.txt`**: Census relationship crosswalk file.
  * *BLAKE3:* `0b85adddf79a4021f85c2da7d006e193fa92e66ef3eebf28916aa0b955f71fb1`
* **`cb_2020_us_zcta520_500k/`**: Unzipped directory containing the official US Census ZCTA shapefiles.
* **`state_sci/state_[FIPS]_sci.csv`**: Complete state-level friendship flow matrices between all ZCTAs.

---

## 3. Running the Pipeline

You can run the entire pipeline (Simulated Annealing, SVG creation, State-level summaries, and ZCTA assignment CSV creation) by passing the target population deviation tolerance (as a float) to the unified pipeline script:

```bash
# Run with the default 8% population deviation constraint
uv run python data_crunching/social_redistrict_pipeline.py 0.08

# Run with a stricter 5% population deviation constraint
uv run python data_crunching/social_redistrict_pipeline.py 0.05

# Run with a relaxed 10% population deviation constraint
uv run python data_crunching/social_redistrict_pipeline.py 0.10
```

### What the Pipeline Does:
1. **Optimizes Districts**: Performs Simulated Annealing using MGGG GerryChain ReCom proposals. The objective function optimizes for the *raw number of social ties* rather than relative connectedness. We renormalize the edge weights by multiplying the relative SCI values by the populations of both connected ZCTAs:
   $$W(e_{ij}) = SCI_{ij} \times Pop_i \times Pop_j$$
   Missing same-state pairs default to 0.0 (no imputation of 1.0). Spanning tree edge weights are biased as `1 / (1 + log(max(W, 1.0)))` to keep highly connected friendship groups in the same district, while temperature cooling guides population deviations strictly inside the target constraint.
2. **Dissolves Shapefiles**: Converts the optimized ZCTA-level assignments into consolidated district geometries, running a hole-plugging GIS routine to fill in gaps and shorelines.
3. **Generates SVGs**: Compiles simplified vector paths of both the baseline actual and optimal districts for every state, caching them into `website/state_svgs_[Dev].json`.
4. **Calculates Summaries**: Computes total within-state social connection flow, baseline cuts, optimal cuts, and social ties survival rates, outputting Markdown and CSV summary tables in the workspace.
5. **Exports Assignments CSV**: Compiles the final ZCTA-to-district assignments table for public download (`website/zcta_district_assignments.csv`).
