Downloads

Bulk downloads of the SSPsyGene Knowledge Base data for offline analysis. Gene identifiers have been resolved to gene symbols (HGNC for human, MGI for mouse) where mappings exist (how does this work?).

Full database

All tables (TSV ZIP)SQLite database Ensembl ID ↔ symbol map (TSV)Manifest (TSV)README

The TSV ZIP contains one tab-separated file per dataset, plus per-table metadata YAMLs and a manifest. The SQLite database is the same file the website queries; it includes the central gene table and many-to-many link tables for advanced users.

Sample loading code

# In R
manifest <- read.delim("manifest.tsv", stringsAsFactors = FALSE)
tbl      <- read.delim("tables/SCZ_Risk_Arrayed_RNAseq_supp_1.tsv",
                       stringsAsFactors = FALSE)
head(tbl)

# In Python (pandas)
import pandas as pd
manifest = pd.read_csv("manifest.tsv", sep="\t")
tbl      = pd.read_csv("tables/SCZ_Risk_Arrayed_RNAseq_supp_1.tsv", sep="\t")

Per-dataset downloads

Click Data (TSV) for the full table, Metadata (YAML) for column descriptions, citation, and source links, or — when present — Preprocessing (YAML) for the per-step record of how the raw data was cleaned before loading.

About preprocessing provenance

Each dataset's Preprocessing (YAML) file lists every action the data wrangler's preprocess.py script applied to the raw data — gene-symbol rescues, dropped rows, renamed columns, custom transforms — in the order they executed. Read it to audit how a published table was turned into the table you can search and download here.

Common fields you'll see:

step: clean_gene_column — gene-symbol resolution for one column. counts.passed_through = rows whose original symbol resolved directly; counts.rescued_excel = rows where Excel-mangled values like 9-Sep were repaired to SEPTIN9; counts.rescued_make_unique = R make.unique suffixes (MATR3.1 → MATR3) stripped; counts.rescued_manual_alias = wrangler-curated successor map hits (NOV → CCN3); counts.rescued_ensembl_map = ENSG/ENSMUSG IDs resolved to symbols; counts.unresolved = rows the cleaner could not resolve (kept as-is). The first ~10 unresolved values appear in sample_unresolved for inspection. See the gene-parser doc for what each rescue step does.
step: dropna / step: filter_rows — rows removed by a predicate. rows_before / rows_after / dropped tell you the exact counts.
step: rename / step: drop_columns / step: reorder — schema reshape.
step: transform_column — a one-off custom string fixup; the description field explains what was done.
step: read_csv / step: write_csv — bookends recording the source filename and the final column list.

Each cleaned table also keeps two extra columns for row-level provenance: <gene_col>_raw (the original value before cleaning) and _<gene_col>_resolution (the per-row tag — passed_through, rescued_excel,unresolved, etc.). Cross-reference those with the YAML to investigate any specific row. Full walkthrough: how the gene parser works.

Loading datasets...