Skip to content

Genomics & Expression Databases

GTEx

The Genotype-Tissue Expression project provides RNA-seq expression data across 54 human tissues from ~1000 donors. The median_tissue_tpm dataset gives median TPM per gene per tissue — the primary source for the unified_expression Athena view.

Dataset Key Columns Description
median_tissue_tpm gene_id, gene_name, tissue columns (54 tissues) Median TPM per gene per tissue
gene_tpm gene_id, sample_id, tpm Per-sample TPM (large)

Identifiers: Ensembl gene ID with version (ENSG00000146648.12), gene symbol Join key: Strip version from gene_idprotein_map.ensembl_gene_id

uv run bioingest download gtex --all
# Athena — tissue expression for EGFR
SELECT * FROM bioingest.unified_expression WHERE gene_name = 'EGFR' ORDER BY tpm DESC;

Ensembl

EMBL-EBI's genome annotation system providing gene models, transcript structures, and cross-references for all human genes. The bioingest datasets extract gene metadata and ID cross-references in flat-file format.

Dataset Key Columns Description
gene_info gene_id, gene_name, biotype, chromosome, start, end, strand, description Gene catalog (~60K entries)
xref gene_id, db_name, xref_id Cross-references (HGNC, EntrezGene, UniProt, etc.)

Identifiers: Ensembl gene ID (ENSG00000146648) Join key: protein_map.ensembl_gene_id

uv run bioingest download ensembl --all
# Athena
SELECT gene_id, gene_name, biotype, chromosome FROM bioingest.ensembl__gene_info
WHERE gene_name = 'EGFR';

gnomAD

The Genome Aggregation Database provides population-level variant frequencies and gene constraint metrics from >125K exomes and >76K genomes. The constraint dataset is key for identifying loss-of-function intolerant genes.

Dataset Key Columns Description
constraint_metrics gene, transcript, pLI, oe_lof_upper (LOEUF), mis_z, syn_z Gene constraint scores
exomes_sites_vcf VCF columns (CHROM, POS, REF, ALT, AF, AC, AN) Variant-level frequencies (very large)

Identifiers: Gene symbol, Ensembl transcript ID Join key: constraint_metrics.geneprotein_map.gene_name

uv run bioingest download gnomad --datasets constraint_metrics
# Athena — most constrained genes
SELECT gene, pLI, oe_lof_upper FROM bioingest.gnomad__constraint_metrics
WHERE pLI > 0.9 ORDER BY oe_lof_upper ASC LIMIT 50;

dbSNP

NCBI's database of single nucleotide polymorphisms and other short variants. Provides rs-numbers (the universal variant identifier), population allele frequencies, and clinical significance annotations.

Dataset Key Columns Description
variant_freq rs_id, chromosome, position, ref, alt, af_total, af_eur, af_eas, af_afr Population frequencies per variant
refsnp_merged rs_id_old, rs_id_new Merged RS number mappings

Identifiers: RS number (rs1050171) Join key: Positional overlap with clinvar__variant_summary or annotation pipelines

uv run bioingest download dbsnp --datasets variant_freq
# Athena
SELECT rs_id, af_total, af_eur FROM bioingest.dbsnp__variant_freq WHERE rs_id = 'rs1050171';

UK Biobank (Pan-UKB)

Pan-ancestry GWAS summary statistics from ~500K UK Biobank participants across ~7K phenotypes. Provides genome-wide association p-values and effect sizes for common diseases and quantitative traits.

Dataset Key Columns Description
pan_ukb phenotype, gene, variant, pval, beta, se, af GWAS summary statistics

Identifiers: Phenotype codes (ICD-10, custom UKB), variant positions Join key: Gene symbol → protein_map.gene_name; phenotype ICD codes → disease_map.icd10_code

uv run bioingest download ukb_disease_assoc
# Athena
SELECT gene, phenotype, pval, beta FROM bioingest.ukb_disease_assoc__pan_ukb
WHERE gene = 'APOE' AND pval < 5e-8 ORDER BY pval;