Genomics & Expression Databases¶

GTEx¶

The Genotype-Tissue Expression project provides RNA-seq expression data across 54 human tissues from ~1000 donors. The median_tissue_tpm dataset gives median TPM per gene per tissue — the primary source for the unified_expression Athena view.

Dataset	Key Columns	Description
`median_tissue_tpm`	`gene_id`, `gene_name`, `tissue` columns (54 tissues)	Median TPM per gene per tissue
`gene_tpm`	`gene_id`, `sample_id`, `tpm`	Per-sample TPM (large)

Identifiers: Ensembl gene ID with version (ENSG00000146648.12), gene symbol Join key: Strip version from gene_id → protein_map.ensembl_gene_id

uv run bioingest download gtex --all
# Athena — tissue expression for EGFR
SELECT * FROM bioingest.unified_expression WHERE gene_name = 'EGFR' ORDER BY tpm DESC;

Ensembl¶

EMBL-EBI's genome annotation system providing gene models, transcript structures, and cross-references for all human genes. The bioingest datasets extract gene metadata and ID cross-references in flat-file format.

Dataset	Key Columns	Description
`gene_info`	`gene_id`, `gene_name`, `biotype`, `chromosome`, `start`, `end`, `strand`, `description`	Gene catalog (~60K entries)
`xref`	`gene_id`, `db_name`, `xref_id`	Cross-references (HGNC, EntrezGene, UniProt, etc.)

Identifiers: Ensembl gene ID (ENSG00000146648) Join key: protein_map.ensembl_gene_id

uv run bioingest download ensembl --all
# Athena
SELECT gene_id, gene_name, biotype, chromosome FROM bioingest.ensembl__gene_info
WHERE gene_name = 'EGFR';

gnomAD¶

The Genome Aggregation Database provides population-level variant frequencies and gene constraint metrics from >125K exomes and >76K genomes. The constraint dataset is key for identifying loss-of-function intolerant genes.

Dataset	Key Columns	Description
`constraint_metrics`	`gene`, `transcript`, `pLI`, `oe_lof_upper` (LOEUF), `mis_z`, `syn_z`	Gene constraint scores
`exomes_sites_vcf`	VCF columns (`CHROM`, `POS`, `REF`, `ALT`, `AF`, `AC`, `AN`)	Variant-level frequencies (very large)

Identifiers: Gene symbol, Ensembl transcript ID Join key: constraint_metrics.gene → protein_map.gene_name

uv run bioingest download gnomad --datasets constraint_metrics
# Athena — most constrained genes
SELECT gene, pLI, oe_lof_upper FROM bioingest.gnomad__constraint_metrics
WHERE pLI > 0.9 ORDER BY oe_lof_upper ASC LIMIT 50;

dbSNP¶

NCBI's database of single nucleotide polymorphisms and other short variants. Provides rs-numbers (the universal variant identifier), population allele frequencies, and clinical significance annotations.

Dataset	Key Columns	Description
`variant_freq`	`rs_id`, `chromosome`, `position`, `ref`, `alt`, `af_total`, `af_eur`, `af_eas`, `af_afr`	Population frequencies per variant
`refsnp_merged`	`rs_id_old`, `rs_id_new`	Merged RS number mappings

Identifiers: RS number (rs1050171) Join key: Positional overlap with clinvar__variant_summary or annotation pipelines

uv run bioingest download dbsnp --datasets variant_freq
# Athena
SELECT rs_id, af_total, af_eur FROM bioingest.dbsnp__variant_freq WHERE rs_id = 'rs1050171';

UK Biobank (Pan-UKB)¶

Pan-ancestry GWAS summary statistics from ~500K UK Biobank participants across ~7K phenotypes. Provides genome-wide association p-values and effect sizes for common diseases and quantitative traits.

Dataset	Key Columns	Description
`pan_ukb`	`phenotype`, `gene`, `variant`, `pval`, `beta`, `se`, `af`	GWAS summary statistics

Identifiers: Phenotype codes (ICD-10, custom UKB), variant positions Join key: Gene symbol → protein_map.gene_name; phenotype ICD codes → disease_map.icd10_code

uv run bioingest download ukb_disease_assoc
# Athena
SELECT gene, phenotype, pval, beta FROM bioingest.ukb_disease_assoc__pan_ukb
WHERE gene = 'APOE' AND pval < 5e-8 ORDER BY pval;