SSPsyGene Logo

How the Gene-Symbol Parser Works

What the SSPsyGene loader does to the gene-name column of every published table — and what the _raw and _resolution columns next to it mean.

Why this is hard

Most published gene tables contain at least one column of “gene names” that nobody fully cleaned. By the time a spreadsheet has travelled from the lab through Excel, R, a Python notebook, a supplementary PDF, and our import script, the same gene can show up as BRCA1, brca1, HGNC:1100, ENSG00000012048.18, BRCA1.1, or even — for SEPTIN9 — the literal string 9-Sep.

A short, incomplete list of the things that go wrong:

  • Excel date coercion. A symbol like SEPT9 opens in Excel as the date “September 9” and gets re-saved as 9-Sep or 2023-09-09. Affects roughly one in five published gene-list papers despite HGNC's SEPTIN renames.
  • HGNC alias churn. Symbols are renamed (e.g. NOV → CCN3, QARS → QARS1), and old papers keep using the retired form.
  • R's make.unique. When a data frame has duplicate row names, R appends .1, .2, etc., turning MATR3 into MATR3.1.
  • Mixed Ensembl IDs. Some tables interleave symbols with raw ENSG…/ ENSMUSG… identifiers, sometimes with version suffixes (ENSG00000012048.18) and sometimes without.
  • Legacy GENCODE clone names. Older annotations (GENCODE pre-v22) used BAC/PAC/cosmid clone labels like RP11-783K16.5. Many of these now have HGNC symbols; some only have a current Ensembl gene ID; some only a GenBank accession.
  • Non-gene rows mixed in. RNA-family labels (Y_RNA, U6, SNORA74, MIR5096), assembly contig accessions (AC012345.6), and bare GenBank accessions (KC877982) all show up where a gene symbol is expected.

We're not aware of an off-the-shelf tool that handles every one of these in a single pass. The SSPsyGene loader does, and keeps a per-row audit trail of which rule fired so you can verify any rescue after the fact. The rest of this page walks through what each rule does.

What we run, in order

For every value in a gene column, the parser tries the following rules in order. The first rule that produces a current approved symbol wins; the parser then tags the row with the rule's name and moves on. The tag names match the values you'll see in the _resolution column described in the next section.

1. Direct lookup — passed_through

The most common case. The raw value is already a current HGNC symbol (human) or MGI symbol (mouse), or an unambiguous alias / previous symbol. We map it to the current canonical form and move on. Empty / NaN cells are also tagged passed_through.

2. HGNC ID rescue — rescued_hgnc_id

If the value is a literal HGNC identifier like HGNC:1100, we look it up in HGNC and substitute the current approved symbol.

3. Excel demangle — rescued_excel

Repairs the two date-coercion forms Excel produces — classic (9-Sep, 1-Mar) and ISO (2023-09-04) — back to the intended symbol (SEPTIN9, MARCHF1, …). Each candidate is verified against the live HGNC/MGI table before substitution, so we never invent a symbol the org doesn't recognise.

4. make.unique suffix strip — rescued_make_unique

Strips the trailing .N R make.unique() appends to disambiguate duplicate row names (e.g. MATR3.1 → MATR3). Only fires when the unsuffixed form resolves and the original does not, so we don't accidentally clobber GENCODE clone names like RP11-783K16.5 that legitimately end in .5.

5. Symbol/ENSG split — rescued_symbol_ensg

Some tools concatenate SYMBOL_ENSG… into a single column value. We split on _ENSG and resolve the symbol portion.

6. Manual aliases — rescued_manual_alias

A small wrangler-curated successor map for retired symbols whose current name HGNC's alias table doesn't resolve automatically. Cross-dataset entries today: NOV → CCN3, MUM1 → PWWP3A, QARS → QARS1, SARS → SARS1, TAZ → TAFAZZIN. The target is verified through HGNC before substitution, so a typo in the successor name fails loudly rather than silently corrupting the data.

7. Ensembl gene ID map — rescued_ensembl_map

Resolves bare ENSG… / ENSMUSG… identifiers to their current symbol via HGNC's hgnc_complete_set.txt (human) and Alliance HGNC_AllianceHomology.rpt (mouse). Versioned IDs like ENSG00000012048.18 are handled by stripping the version suffix before lookup. If an Ensembl ID has no symbol mapping, it falls through to the silencer below (it's kept as a stable identifier, just not promoted to a gene symbol).

8. GENCODE clone-name resolver — rescued_gencode_clone_*

Looks up legacy GENCODE/HAVANA clone identifiers (RP11-…, CTD-…, KB-…, XXbac-…, etc.) in a prebuilt cross-reference table assembled from GENCODE v38 (Ensembl 104, May 2021). Each clone resolves to one of three things, in preference order:

  • its current HGNC symbol, if HGNC has assigned one (tag rescued_gencode_clone_hgnc_symbol);
  • otherwise its current Ensembl gene ID (tag rescued_gencode_clone_current_ensg);
  • otherwise its current AC/AL/AP accession (tag rescued_gencode_clone_current_ac_accession).

The verbose clone label is replaced with the resolved value, so downstream queries find the locus under its modern identifier. Clones absent from the table fall through to the silencer.

9. Non-symbol silencer — non_symbol_*

Catches values that recognisably aren't gene symbols, so the loader stops warning about them. Six categories:

CategoryMatchesExamples
ensembl_humanENSG\d+(\.\d+)?ENSG00000123456, ENSG00000123456.5
ensembl_mouseENSMUSG\d+(\.\d+)?ENSMUSG00000071265
contigSanger / WGS contig accessionsAC012345.6, AUXG01000058.1
gencode_cloneBAC / PAC / cosmid clone namesRP11-783K16.5, CTD-2331H12.4
genbank_accession[A-Z]{1,2}\d{5,6}(\.\d+)?KC877982, L29074.1
rna_familyRNA-family labels (not loci)Y_RNA, U6, SNORA74, MIR5096

Silenced values are kept in the table as-is (with the original value preserved in _raw), but they don't produce a “not in gene maps” warning at load time, and they aren't inserted into the central gene table as if they were genuine gene records.

10. Unresolved fallback — unresolved

Anything that fell through every rule above. The original value is kept in the table (and in _raw), but it's the only tag that still triggers a warning at load time. Values landing here are usually dataset-specific noise (lab nicknames, sample IDs misfiled into the gene column) or genuinely retired symbols whose successor isn't in HGNC's automatic alias table; the latter become candidates for a manual-alias entry.

The _raw and _resolution columns

Every cleaned table keeps two extra columns next to each gene-name column:

  • <col>_raw — the original value before any cleaning. Always populated, even for rows that passed through unchanged.
  • _<col>_resolution — the per-row tag identifying which rule fired (or that the value is still unresolved).

Together they let you audit any row in a published table without cross-referencing the source. Example:

target_gene  target_gene_raw   _target_gene_resolution
BRCA1        BRCA1             passed_through
SEPTIN9      9-Sep             rescued_excel
MATR3        MATR3.1           rescued_make_unique
CCN3         NOV               rescued_manual_alias
ENSG00000…   ENSG00000…        non_symbol_ensembl_human
NOTAGENE     NOTAGENE          unresolved

The full set of resolution tags:

TagMeaning
passed_throughResolved via the live HGNC/MGI table (or the value was empty/NaN).
rescued_hgnc_idLiteral HGNC:NNNNN identifier resolved.
rescued_excelExcel-mangled date repaired to its original symbol.
rescued_make_uniqueR make.unique .N suffix stripped.
rescued_symbol_ensg<symbol>_ENSG… composite split.
rescued_manual_aliasWrangler-curated retired-symbol successor used.
rescued_ensembl_mapBare ENSG/ENSMUSG identifier resolved to a current symbol.
rescued_gencode_clone_hgnc_symbolGENCODE clone resolved to its current HGNC symbol.
rescued_gencode_clone_current_ensgGENCODE clone resolved to a stable Ensembl gene ID (no HGNC symbol assigned).
rescued_gencode_clone_current_ac_accessionGENCODE clone resolved to an AC/AL/AP accession (no HGNC or Ensembl available).
non_symbol_ensembl_human / _ensembl_mouse / _contig / _gencode_clone / _genbank_accession / _rna_familyRecognisably not a gene symbol — kept as-is, silenced, classified by which pattern matched.
unresolvedGenuinely unknown; kept as raw text and warned at load time.
Tip: if you suspect a rescue has gone wrong, _raw is your audit trail — it preserves exactly what the wrangler's preprocessing script saw before any of these rules fired.

Resolution preference order

When a single locus could plausibly resolve to multiple stable identifiers — typically a GENCODE clone that has both a current ENSG and an AC accession, or an ENSG that has both a symbol and a raw ID — we always prefer:

  1. HGNC symbol — the canonical, human-curated identifier; the only form most biologists recognise on sight.
  2. Ensembl gene ID (ENSG… / ENSMUSG…) — stable across releases, machine-readable, supported by every downstream lookup we use.
  3. AC / AL / AP accession — last-resort but unambiguous; used only for legacy clones that have no symbol and no current Ensembl gene ID.
  4. Silenced as non-symbol— for values that recognisably aren't loci at all (RNA families, contigs, GenBank accessions).

The same ordering governs the GENCODE clone resolver (rescued_gencode_clone_*), the Ensembl-map rescue (rescued_ensembl_map), and the silencer's classification — pick the most-curated form that still uniquely identifies the locus.

Species-specific notes

Human (HGNC). The parser uses HGNC's hgnc_complete_set.txt: approved symbols, their alias and previous symbols (with ambiguous many-to-one aliases dropped to avoid false rescues), and the literal HGNC:NNNNN identifiers. ENSG → symbol mappings come from the same file.

Mouse (MGI / Ensembl). The parser uses MGI_EntrezGene.rpt: approved MGI symbols, withdrawn-to-current mappings, synonyms, plus a case-insensitive fallback (Slc30A3 Slc30a3). ENSMUSG → MGI symbol mappings come from the Alliance HGNC_AllianceHomology.rpt file, which also gives us mouse → human ortholog links used elsewhere in the database.

Zebrafish. Only one dataset (zebraAsd) uses zebrafish. There's no full ZFIN normalizer; the wrangler's preprocessing script up-cases the gene token and applies a manual paralog mapping (SCN1LAB → SCN1A) before the symbol reaches the central gene table.

What still slips through

Rows tagged unresolved are rare but not zero — the most recent full rebuild left ~75 across the entire database. They're kept in the table as raw text; the parser doesn't guess.

If you want to look at them, every dataset's Preprocessing (YAML) file (downloadable from the Downloads page) lists the first ~10 unresolved values per gene column under sample_unresolved, alongside the counts of every rescue rule that fired. That's the place to start when you suspect a particular paper's gene column wasn't cleanly imported.

Implementation history

Two GitHub issues track the bulk of the technical history if you'd like the development backstory:

  • psypheno #119 — moving ENSG → symbol resolution from runtime into the preprocessing step, so cleaned tables ship with a real symbol already in place and an audit trail beside it.
  • psypheno #139 — building the GENCODE clone-name resolver and pinning the clone table to GENCODE v38.

For statistical methods used elsewhere on the site (Fisher, Cauchy, harmonic-mean p-value combination), see Meta-analysis methods.