Genomics & Expression Databases¶
GTEx¶
The Genotype-Tissue Expression project provides RNA-seq expression data across 54 human tissues from ~1000 donors. The median_tissue_tpm dataset gives median TPM per gene per tissue — the primary source for the unified_expression Athena view.
| Dataset | Key Columns | Description |
|---|---|---|
median_tissue_tpm |
gene_id, gene_name, tissue columns (54 tissues) |
Median TPM per gene per tissue |
gene_tpm |
gene_id, sample_id, tpm |
Per-sample TPM (large) |
Identifiers: Ensembl gene ID with version (ENSG00000146648.12), gene symbol
Join key: Strip version from gene_id → protein_map.ensembl_gene_id
uv run bioingest download gtex --all
# Athena — tissue expression for EGFR
SELECT * FROM bioingest.unified_expression WHERE gene_name = 'EGFR' ORDER BY tpm DESC;
Ensembl¶
EMBL-EBI's genome annotation system providing gene models, transcript structures, and cross-references for all human genes. The bioingest datasets extract gene metadata and ID cross-references in flat-file format.
| Dataset | Key Columns | Description |
|---|---|---|
gene_info |
gene_id, gene_name, biotype, chromosome, start, end, strand, description |
Gene catalog (~60K entries) |
xref |
gene_id, db_name, xref_id |
Cross-references (HGNC, EntrezGene, UniProt, etc.) |
Identifiers: Ensembl gene ID (ENSG00000146648)
Join key: protein_map.ensembl_gene_id
uv run bioingest download ensembl --all
# Athena
SELECT gene_id, gene_name, biotype, chromosome FROM bioingest.ensembl__gene_info
WHERE gene_name = 'EGFR';
gnomAD¶
The Genome Aggregation Database provides population-level variant frequencies and gene constraint metrics from >125K exomes and >76K genomes. The constraint dataset is key for identifying loss-of-function intolerant genes.
| Dataset | Key Columns | Description |
|---|---|---|
constraint_metrics |
gene, transcript, pLI, oe_lof_upper (LOEUF), mis_z, syn_z |
Gene constraint scores |
exomes_sites_vcf |
VCF columns (CHROM, POS, REF, ALT, AF, AC, AN) |
Variant-level frequencies (very large) |
Identifiers: Gene symbol, Ensembl transcript ID
Join key: constraint_metrics.gene → protein_map.gene_name
uv run bioingest download gnomad --datasets constraint_metrics
# Athena — most constrained genes
SELECT gene, pLI, oe_lof_upper FROM bioingest.gnomad__constraint_metrics
WHERE pLI > 0.9 ORDER BY oe_lof_upper ASC LIMIT 50;
dbSNP¶
NCBI's database of single nucleotide polymorphisms and other short variants. Provides rs-numbers (the universal variant identifier), population allele frequencies, and clinical significance annotations.
| Dataset | Key Columns | Description |
|---|---|---|
variant_freq |
rs_id, chromosome, position, ref, alt, af_total, af_eur, af_eas, af_afr |
Population frequencies per variant |
refsnp_merged |
rs_id_old, rs_id_new |
Merged RS number mappings |
Identifiers: RS number (rs1050171)
Join key: Positional overlap with clinvar__variant_summary or annotation pipelines
uv run bioingest download dbsnp --datasets variant_freq
# Athena
SELECT rs_id, af_total, af_eur FROM bioingest.dbsnp__variant_freq WHERE rs_id = 'rs1050171';
UK Biobank (Pan-UKB)¶
Pan-ancestry GWAS summary statistics from ~500K UK Biobank participants across ~7K phenotypes. Provides genome-wide association p-values and effect sizes for common diseases and quantitative traits.
| Dataset | Key Columns | Description |
|---|---|---|
pan_ukb |
phenotype, gene, variant, pval, beta, se, af |
GWAS summary statistics |
Identifiers: Phenotype codes (ICD-10, custom UKB), variant positions
Join key: Gene symbol → protein_map.gene_name; phenotype ICD codes → disease_map.icd10_code