How the Gene-Symbol Parser Works
What the SSPsyGene loader does to the gene-name column of every published table — and what the _raw and _resolution columns next to it mean.
Why this is hard
Most published gene tables contain at least one column of “gene names” that nobody fully cleaned. By the time a spreadsheet has travelled from the lab through Excel, R, a Python notebook, a supplementary PDF, and our import script, the same gene can show up as BRCA1, brca1, HGNC:1100, ENSG00000012048.18, BRCA1.1, or even — for SEPTIN9 — the literal string 9-Sep.
A short, incomplete list of the things that go wrong:
- Excel date coercion. A symbol like
SEPT9opens in Excel as the date “September 9” and gets re-saved as9-Sepor2023-09-09. Affects roughly one in five published gene-list papers despite HGNC's SEPTIN renames. - HGNC alias churn. Symbols are renamed (e.g.
NOV → CCN3,QARS → QARS1), and old papers keep using the retired form. - R's
make.unique. When a data frame has duplicate row names, R appends.1,.2, etc., turningMATR3intoMATR3.1. - Mixed Ensembl IDs. Some tables interleave symbols with raw
ENSG…/ENSMUSG…identifiers, sometimes with version suffixes (ENSG00000012048.18) and sometimes without. - Legacy GENCODE clone names. Older annotations (GENCODE pre-v22) used BAC/PAC/cosmid clone labels like
RP11-783K16.5. Many of these now have HGNC symbols; some only have a current Ensembl gene ID; some only a GenBank accession. - Non-gene rows mixed in. RNA-family labels (
Y_RNA,U6,SNORA74,MIR5096), assembly contig accessions (AC012345.6), and bare GenBank accessions (KC877982) all show up where a gene symbol is expected.
We're not aware of an off-the-shelf tool that handles every one of these in a single pass. The SSPsyGene loader does, and keeps a per-row audit trail of which rule fired so you can verify any rescue after the fact. The rest of this page walks through what each rule does.
What we run, in order
For every value in a gene column, the parser tries the following rules in order. The first rule that produces a current approved symbol wins; the parser then tags the row with the rule's name and moves on. The tag names match the values you'll see in the _resolution column described in the next section.
1. Direct lookup — passed_through
The most common case. The raw value is already a current HGNC symbol (human) or MGI symbol (mouse), or an unambiguous alias / previous symbol. We map it to the current canonical form and move on. Empty / NaN cells are also tagged passed_through.
2. HGNC ID rescue — rescued_hgnc_id
If the value is a literal HGNC identifier like HGNC:1100, we look it up in HGNC and substitute the current approved symbol.
3. Excel demangle — rescued_excel
Repairs the two date-coercion forms Excel produces — classic (9-Sep, 1-Mar) and ISO (2023-09-04) — back to the intended symbol (SEPTIN9, MARCHF1, …). Each candidate is verified against the live HGNC/MGI table before substitution, so we never invent a symbol the org doesn't recognise.
4. make.unique suffix strip — rescued_make_unique
Strips the trailing .N R make.unique() appends to disambiguate duplicate row names (e.g. MATR3.1 → MATR3). Only fires when the unsuffixed form resolves and the original does not, so we don't accidentally clobber GENCODE clone names like RP11-783K16.5 that legitimately end in .5.
5. Symbol/ENSG split — rescued_symbol_ensg
Some tools concatenate SYMBOL_ENSG… into a single column value. We split on _ENSG and resolve the symbol portion.
6. Manual aliases — rescued_manual_alias
A small wrangler-curated successor map for retired symbols whose current name HGNC's alias table doesn't resolve automatically. Cross-dataset entries today: NOV → CCN3, MUM1 → PWWP3A, QARS → QARS1, SARS → SARS1, TAZ → TAFAZZIN. The target is verified through HGNC before substitution, so a typo in the successor name fails loudly rather than silently corrupting the data.
7. Ensembl gene ID map — rescued_ensembl_map
Resolves bare ENSG… / ENSMUSG… identifiers to their current symbol via HGNC's hgnc_complete_set.txt (human) and Alliance HGNC_AllianceHomology.rpt (mouse). Versioned IDs like ENSG00000012048.18 are handled by stripping the version suffix before lookup. If an Ensembl ID has no symbol mapping, it falls through to the silencer below (it's kept as a stable identifier, just not promoted to a gene symbol).
8. GENCODE clone-name resolver — rescued_gencode_clone_*
Looks up legacy GENCODE/HAVANA clone identifiers (RP11-…, CTD-…, KB-…, XXbac-…, etc.) in a prebuilt cross-reference table assembled from GENCODE v38 (Ensembl 104, May 2021). Each clone resolves to one of three things, in preference order:
- its current HGNC symbol, if HGNC has assigned one (tag
rescued_gencode_clone_hgnc_symbol); - otherwise its current Ensembl gene ID (tag
rescued_gencode_clone_current_ensg); - otherwise its current AC/AL/AP accession (tag
rescued_gencode_clone_current_ac_accession).
The verbose clone label is replaced with the resolved value, so downstream queries find the locus under its modern identifier. Clones absent from the table fall through to the silencer.
9. Non-symbol silencer — non_symbol_*
Catches values that recognisably aren't gene symbols, so the loader stops warning about them. Six categories:
| Category | Matches | Examples |
|---|---|---|
ensembl_human | ENSG\d+(\.\d+)? | ENSG00000123456, ENSG00000123456.5 |
ensembl_mouse | ENSMUSG\d+(\.\d+)? | ENSMUSG00000071265 |
contig | Sanger / WGS contig accessions | AC012345.6, AUXG01000058.1 |
gencode_clone | BAC / PAC / cosmid clone names | RP11-783K16.5, CTD-2331H12.4 |
genbank_accession | [A-Z]{1,2}\d{5,6}(\.\d+)? | KC877982, L29074.1 |
rna_family | RNA-family labels (not loci) | Y_RNA, U6, SNORA74, MIR5096 |
Silenced values are kept in the table as-is (with the original value preserved in _raw), but they don't produce a “not in gene maps” warning at load time, and they aren't inserted into the central gene table as if they were genuine gene records.
10. Unresolved fallback — unresolved
Anything that fell through every rule above. The original value is kept in the table (and in _raw), but it's the only tag that still triggers a warning at load time. Values landing here are usually dataset-specific noise (lab nicknames, sample IDs misfiled into the gene column) or genuinely retired symbols whose successor isn't in HGNC's automatic alias table; the latter become candidates for a manual-alias entry.
The _raw and _resolution columns
Every cleaned table keeps two extra columns next to each gene-name column:
<col>_raw— the original value before any cleaning. Always populated, even for rows that passed through unchanged._<col>_resolution— the per-row tag identifying which rule fired (or that the value is still unresolved).
Together they let you audit any row in a published table without cross-referencing the source. Example:
target_gene target_gene_raw _target_gene_resolution
BRCA1 BRCA1 passed_through
SEPTIN9 9-Sep rescued_excel
MATR3 MATR3.1 rescued_make_unique
CCN3 NOV rescued_manual_alias
ENSG00000… ENSG00000… non_symbol_ensembl_human
NOTAGENE NOTAGENE unresolvedThe full set of resolution tags:
| Tag | Meaning |
|---|---|
passed_through | Resolved via the live HGNC/MGI table (or the value was empty/NaN). |
rescued_hgnc_id | Literal HGNC:NNNNN identifier resolved. |
rescued_excel | Excel-mangled date repaired to its original symbol. |
rescued_make_unique | R make.unique .N suffix stripped. |
rescued_symbol_ensg | <symbol>_ENSG… composite split. |
rescued_manual_alias | Wrangler-curated retired-symbol successor used. |
rescued_ensembl_map | Bare ENSG/ENSMUSG identifier resolved to a current symbol. |
rescued_gencode_clone_hgnc_symbol | GENCODE clone resolved to its current HGNC symbol. |
rescued_gencode_clone_current_ensg | GENCODE clone resolved to a stable Ensembl gene ID (no HGNC symbol assigned). |
rescued_gencode_clone_current_ac_accession | GENCODE clone resolved to an AC/AL/AP accession (no HGNC or Ensembl available). |
non_symbol_ensembl_human / _ensembl_mouse / _contig / _gencode_clone / _genbank_accession / _rna_family | Recognisably not a gene symbol — kept as-is, silenced, classified by which pattern matched. |
unresolved | Genuinely unknown; kept as raw text and warned at load time. |
_raw is your audit trail — it preserves exactly what the wrangler's preprocessing script saw before any of these rules fired.Resolution preference order
When a single locus could plausibly resolve to multiple stable identifiers — typically a GENCODE clone that has both a current ENSG and an AC accession, or an ENSG that has both a symbol and a raw ID — we always prefer:
- HGNC symbol — the canonical, human-curated identifier; the only form most biologists recognise on sight.
- Ensembl gene ID (
ENSG…/ENSMUSG…) — stable across releases, machine-readable, supported by every downstream lookup we use. - AC / AL / AP accession — last-resort but unambiguous; used only for legacy clones that have no symbol and no current Ensembl gene ID.
- Silenced as non-symbol— for values that recognisably aren't loci at all (RNA families, contigs, GenBank accessions).
The same ordering governs the GENCODE clone resolver (rescued_gencode_clone_*), the Ensembl-map rescue (rescued_ensembl_map), and the silencer's classification — pick the most-curated form that still uniquely identifies the locus.
Species-specific notes
Human (HGNC). The parser uses HGNC's hgnc_complete_set.txt: approved symbols, their alias and previous symbols (with ambiguous many-to-one aliases dropped to avoid false rescues), and the literal HGNC:NNNNN identifiers. ENSG → symbol mappings come from the same file.
Mouse (MGI / Ensembl). The parser uses MGI_EntrezGene.rpt: approved MGI symbols, withdrawn-to-current mappings, synonyms, plus a case-insensitive fallback (Slc30A3 → Slc30a3). ENSMUSG → MGI symbol mappings come from the Alliance HGNC_AllianceHomology.rpt file, which also gives us mouse → human ortholog links used elsewhere in the database.
Zebrafish. Only one dataset (zebraAsd) uses zebrafish. There's no full ZFIN normalizer; the wrangler's preprocessing script up-cases the gene token and applies a manual paralog mapping (SCN1LAB → SCN1A) before the symbol reaches the central gene table.
What still slips through
Rows tagged unresolved are rare but not zero — the most recent full rebuild left ~75 across the entire database. They're kept in the table as raw text; the parser doesn't guess.
If you want to look at them, every dataset's Preprocessing (YAML) file (downloadable from the Downloads page) lists the first ~10 unresolved values per gene column under sample_unresolved, alongside the counts of every rescue rule that fired. That's the place to start when you suspect a particular paper's gene column wasn't cleanly imported.
Implementation history
Two GitHub issues track the bulk of the technical history if you'd like the development backstory:
- psypheno #119 — moving ENSG → symbol resolution from runtime into the preprocessing step, so cleaned tables ship with a real symbol already in place and an audit trail beside it.
- psypheno #139 — building the GENCODE clone-name resolver and pinning the clone table to GENCODE v38.
For statistical methods used elsewhere on the site (Fisher, Cauchy, harmonic-mean p-value combination), see Meta-analysis methods.
