Skip to content

Publishing to S3 & Athena

Convert local data to Parquet and upload to S3 for SQL querying via AWS Athena.

Usage

bioingest publish --crawl                    # all sources + trigger Glue crawler
bioingest publish --source uniprot           # just one source
bioingest publish --dry-run                  # preview without uploading

Format Handling

Input Format Action
TSV/CSV/TXT Convert to Parquet
Parquet Upload directly
ZIP Unzip → convert TSVs to Parquet
BGZ Decompress → convert to Parquet
OBO Parse terms → tabular Parquet

Infrastructure

  • S3bioingest-datalake-{account} bucket
  • Glue Catalogbioingest database, auto-detected schemas
  • Glue Crawlerbioingest-crawler discovers tables
  • Athena — serverless SQL across all sources

Configuration

AWS_PROFILE=dsinternal
AWS_REGION=eu-north-1
# Or: --bucket, --profile, --region flags

Querying

SELECT entry, gene_names FROM bioingest.uniprot__swissprot_tsv WHERE gene_names LIKE '%EGFR%';

Table naming: {source_id}__{dataset_name}