Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
6813a25
Merge pull request #61 from kbase/rebased_on_main
ialarmedalien Dec 11, 2025
4c3411a
update GHA
ialarmedalien Dec 19, 2025
dd5420b
Merge pull request #64 from kbase/update_run_tests
ialarmedalien Dec 19, 2025
f90470f
Adding new spark_delta utils and logger module.
ialarmedalien Dec 23, 2025
5da7ad6
Disabling broken tests.
ialarmedalien Dec 23, 2025
47fb069
Fix formatting and remove unneeded imports
ialarmedalien Dec 23, 2025
b3d8269
Remove unnecessary lambdas
ialarmedalien Dec 23, 2025
dd0c28e
Merge pull request #65 from kbase/fix_failing_tests
ialarmedalien Dec 23, 2025
3317f59
update the refseq docs
alinakbase Dec 24, 2025
7cd75da
Adding in downloader code
ialarmedalien Jan 21, 2026
be70e26
adding in gzip utils
ialarmedalien Jan 21, 2026
43e57c5
adding in files to get tests working
ialarmedalien Jan 21, 2026
0b58a13
Fixing up silly errors
ialarmedalien Jan 21, 2026
70b3d7b
Adding missing files for testing
ialarmedalien Jan 21, 2026
7031200
Ruff ruff!
ialarmedalien Jan 21, 2026
ac86512
Merge pull request #69 from kbase/add_in_downloaders
ialarmedalien Jan 21, 2026
c2ec0d6
add audit and validator components
ialarmedalien Jan 21, 2026
3b8ffbb
fix failfast not failing fast enough
ialarmedalien Jan 22, 2026
839930c
spark sucks
ialarmedalien Jan 22, 2026
77cbede
spark sucks
ialarmedalien Jan 22, 2026
5636d49
Adding in tsv/csv parse test
ialarmedalien Jan 22, 2026
4b318aa
Merge pull request #70 from kbase/add_audit
ialarmedalien Jan 22, 2026
f887ab3
Merge branch 'develop' into refseq-docs-walkthrough
ialarmedalien Jan 22, 2026
b54dbea
Merge pull request #66 from kbase/refseq-docs-walkthrough
ialarmedalien Jan 22, 2026
1b10016
Bump python-multipart in the uv group across 1 directory
dependabot[bot] Jan 27, 2026
3c7a560
Merge pull request #73 from kbase/dependabot/uv/uv-51f24ed7e3
ialarmedalien Jan 27, 2026
2db2ff5
minor checkm parser fix
ialarmedalien Jan 27, 2026
24cc1aa
Merge pull request #74 from kbase/checkm_fix
ialarmedalien Jan 27, 2026
37c9268
First draft of idmapping.py
ialarmedalien Jan 23, 2026
bc996f1
fixing text element retrieval
ialarmedalien Jan 23, 2026
b8ee622
resolving uniprot file name issue
ialarmedalien Jan 23, 2026
9ed9312
fix output ordering
ialarmedalien Jan 23, 2026
11d3667
Various fixes, including for the spark fixture
ialarmedalien Jan 28, 2026
b9900c2
fixing spark session pollution issue... or trying to
ialarmedalien Jan 28, 2026
4cf9ebb
Merge pull request #72 from kbase/idmapping
ialarmedalien Jan 28, 2026
938386e
Adding CLI to idmapping script
ialarmedalien Jan 29, 2026
eb0d786
Merge pull request #75 from kbase/add_script
ialarmedalien Jan 29, 2026
c8f1cb4
Create logger singleton variable; remove rdds
ialarmedalien Jan 29, 2026
6c20de7
Adding very important missing constant
ialarmedalien Jan 29, 2026
2fbd088
Merge pull request #76 from kbase/fix_logger_handlers
ialarmedalien Jan 29, 2026
59a0ec3
A couple of corrections to the pyproject file: moving berdl-notebook-…
ialarmedalien Jan 29, 2026
f6fab46
Moving dependencies; using venv in test scripts
ialarmedalien Jan 29, 2026
8f9a3d4
Fixing test calls
ialarmedalien Jan 29, 2026
0301f1b
Merge pull request #77 from kbase/pyproj_edits
ialarmedalien Jan 29, 2026
990bef2
Spark bugs workarounds
ialarmedalien Feb 4, 2026
a8dbd77
fixes
ialarmedalien Feb 4, 2026
780d606
Merge pull request #81 from kbase/spark_bugs_workaround
ialarmedalien Feb 4, 2026
8e0b62d
docs update for assert df equal
ialarmedalien Feb 4, 2026
e6b3b2f
Merge branch 'develop' into spark_bugs_workaround
ialarmedalien Feb 4, 2026
5fe75e8
Merge pull request #82 from kbase/spark_bugs_workaround
ialarmedalien Feb 4, 2026
6c54858
Update path for genome loader in README
crockettz Feb 20, 2026
41685da
Merge pull request #85 from crockettz/develop
ialarmedalien Feb 20, 2026
187f0a9
First pass UniProtKB pipeline push
ialarmedalien Feb 3, 2026
834204e
Potential fix for pull request finding 'Variable defined multiple times'
ialarmedalien Feb 22, 2026
78ab6e4
ensuring that boto3[crt] is loaded for minio interactions
ialarmedalien Feb 22, 2026
7e8001c
Merge pull request #86 from kbase/uniprot_pipeline
ialarmedalien Feb 22, 2026
fa6f887
removing accidental inclusion of function
ialarmedalien Feb 23, 2026
0f83d2b
Merge pull request #87 from kbase/rm_extra_function
ialarmedalien Feb 23, 2026
6eb97a6
Merge branch 'main' into develop
ialarmedalien Feb 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .dlt/config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[normalize.data_writer]
disable_compression = "false"

[runtime]
http_show_error_body = "true"
log_level = "INFO"
log_format = "JSON"

[extract]
workers = 10

# 'local_fs' destination: save to the `output` directory
[destination.local_fs]
destination_type = "filesystem"
bucket_url = "/output_dir"

[destination.minio]
destination_type = "filesystem"
bucket_url = "s3://cdm-lake/tenant-general-warehouse/kbase/datasets/uniprot/"
59 changes: 59 additions & 0 deletions .github/workflows/docker_build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# This is boilerplate for publishing a Docker image via Github Actions.

name: Docker

on:
workflow_dispatch:
push:
branches: ["main", "develop"]
# Publish semver tags as releases.
tags:
- "v[0-9]+.[0-9]+.[0-9]+"
- "[0-9]+.[0-9]+.[0-9]+"
- "[0-9]+.[0-9]+.[0-9]+-*"
pull_request:
branches: ["main", "develop"]
release:
types: [published]

env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}

jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write

steps:
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Extract Docker metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

- name: Build and push Docker image
id: build-and-push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
6 changes: 3 additions & 3 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ permissions:

on:
push:
branches: [main]
branches: [main, develop]

pull_request:
types:
Expand Down Expand Up @@ -87,7 +87,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.12", "3.13"]
python-version: ["3.13", "3.14"]
os: ["ubuntu-24.04"]

steps:
Expand All @@ -103,7 +103,7 @@ jobs:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: uv sync --dev
run: uv sync --dev --group local

- name: Run local tests
shell: bash
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -166,5 +166,6 @@ cython_debug/
#.idea/


#Ignore vscode AI rules
# codacy stuff
.github/instructions/codacy.instructions.md
.codacy
3 changes: 2 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,6 @@
"python.testing.unittestEnabled": false,
"python.testing.pytestArgs": [
"tests"
]
],
"editor.formatOnSave": true
}
44 changes: 44 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Use a Python image with uv pre-installed
FROM ghcr.io/astral-sh/uv:python3.13-bookworm-slim

# Install the project into `/app`
WORKDIR /app

# Enable bytecode compilation
ENV UV_COMPILE_BYTECODE=1

# Copy from the cache instead of linking since it's a mounted volume
ENV UV_LINK_MODE=copy

# Omit development dependencies
ENV UV_NO_DEV=1

# Ensure installed tools can be executed out of the box
ENV UV_TOOL_BIN_DIR=/usr/local/bin

# Install the project's dependencies using the lockfile and settings
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=uv.lock,target=uv.lock \
--mount=type=bind,source=pyproject.toml,target=pyproject.toml \
uv sync --locked --no-install-project

# Then, add the rest of the project source code and install it
# Installing separately from its dependencies allows optimal layer caching
COPY . /app
RUN --mount=type=cache,target=/root/.cache/uv \
uv sync --locked

# Place executables in the environment at the front of the path
ENV PATH="/app/.venv/bin:$PATH"

# Setup a non-root user
RUN groupadd --system --gid 999 nonroot \
&& useradd --system --gid 999 --uid 999 --create-home nonroot

# Use the non-root user to run our application
USER nonroot

# Reset the entrypoint, don't invoke `uv`
ENTRYPOINT []

# CMD ["uv", "run", "python", "--version"]
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ The standard python `coverage` package is used and coverage can be generated as

## Loading genomes, contigs, and features

The [genome loader](src/parsers/genome_loader.py) can be used to load and integrate data from related GFF and FASTA files. Currently, the loader requires a GFF file and two FASTA files (one for amino acid seqs, one for nucleic acid seqs) for each genome. The list of files to be processed should be specified in the genome paths file, which has the following format:
The [genome loader](src/cdm_data_loader_utils/parsers/genome_loader.py) can be used to load and integrate data from related GFF and FASTA files. Currently, the loader requires a GFF file and two FASTA files (one for amino acid seqs, one for nucleic acid seqs) for each genome. The list of files to be processed should be specified in the genome paths file, which has the following format:

```json
{
Expand Down
Loading
Loading