Publishing to S3 & Athena¶
Convert local data to Parquet and upload to S3 for SQL querying via AWS Athena.
Usage¶
bioingest publish --crawl # all sources + trigger Glue crawler
bioingest publish --source uniprot # just one source
bioingest publish --dry-run # preview without uploading
Format Handling¶
| Input Format | Action |
|---|---|
| TSV/CSV/TXT | Convert to Parquet |
| Parquet | Upload directly |
| ZIP | Unzip → convert TSVs to Parquet |
| BGZ | Decompress → convert to Parquet |
| OBO | Parse terms → tabular Parquet |
Infrastructure¶
- S3 —
bioingest-datalake-{account}bucket - Glue Catalog —
bioingestdatabase, auto-detected schemas - Glue Crawler —
bioingest-crawlerdiscovers tables - Athena — serverless SQL across all sources
Configuration¶
Querying¶
Table naming: {source_id}__{dataset_name}