fix: Prevent Docker multi-arch race condition (fixes #328) (#334)

* fix: Prevent Docker multi-arch race condition (fixes #328)

Resolves race condition where docker-build.yml and release.yml both
push to 'latest' tag simultaneously, causing temporary ARM64-only
manifest that breaks AMD64 users.

Root Cause Analysis:
- During v2.20.0 release, 5 workflows ran concurrently on same commit
- docker-build.yml (triggered by main push + v* tag)
- release.yml (triggered by package.json version change)
- Both workflows pushed to 'latest' tag with no coordination
- Temporal window existed where only ARM64 platform was available

Changes - docker-build.yml:
- Remove v* tag trigger (let release.yml handle versioned releases)
- Add concurrency group to prevent overlapping runs on same branch
- Enable build cache (change no-cache: true -> false)
- Add cache-from/cache-to for consistency with release.yml
- Add multi-arch manifest verification after push

Changes - release.yml:
- Update concurrency group to be ref-specific (release-${{ github.ref }})
- Add multi-arch manifest verification for 'latest' tag
- Add multi-arch manifest verification for version tag
- Add 5s delay before verification to ensure registry processes push

Impact:
 Eliminates race condition between workflows
 Ensures 'latest' tag always has both AMD64 and ARM64
 Faster builds (caching enabled in docker-build.yml)
 Automatic verification catches incomplete pushes
 Clearer separation: docker-build.yml for CI, release.yml for releases

Testing:
- TypeScript compilation passes
- YAML syntax validated
- Will test on feature branch before merge

Closes #328

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Address code review - use shared concurrency group and add retry logic

Critical fixes based on code review feedback:

1. CRITICAL: Fixed concurrency groups to be shared between workflows
   - Changed from workflow-specific groups to shared 'docker-push-${{ github.ref }}'
   - This actually prevents the race condition (previous groups were isolated)
   - Both workflows now serialize Docker pushes to prevent simultaneous updates

2. Added retry logic with exponential backoff
   - Replaced fixed 5s sleep with intelligent retry mechanism
   - Retries up to 5 times with exponential backoff: 2s, 4s, 8s, 16s
   - Accounts for registry propagation delays
   - Fails fast if manifest is still incomplete after all retries

3. Improved Railway build job
   - Added 'needs: build' dependency to ensure sequential execution
   - Enabled caching (no-cache: false) for faster builds
   - Added cache-from/cache-to for consistency

4. Enhanced verification messaging
   - Clarified version tag format (without 'v' prefix)
   - Added attempt counters and wait time indicators
   - Better error messages with full manifest output

Previous Issue:
- docker-build.yml used group: docker-build-${{ github.ref }}
- release.yml used group: release-${{ github.ref }}
- These are DIFFERENT groups, so no serialization occurred

Fixed:
- Both now use group: docker-push-${{ github.ref }}
- Workflows will wait for each other to complete
- Race condition eliminated

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* chore: bump version to 2.20.1 and update CHANGELOG

Version Changes:
- package.json: 2.20.0 → 2.20.1
- package.runtime.json: 2.19.6 → 2.20.1 (sync with main version)

CHANGELOG Updates:
- Added comprehensive v2.20.1 entry documenting Issue #328 fix
- Detailed problem analysis with race condition timeline
- Root cause explanation (separate concurrency groups)
- Complete list of fixes and improvements
- Before/after comparison showing impact
- Technical details on concurrency serialization and retry logic
- References to issue #328, PR #334, and code review

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
Romuald Członkowski
2025-10-18 20:32:20 +02:00
committed by GitHub
parent 5881304ed8
commit 05f68b8ea1
5 changed files with 318 additions and 9 deletions

View File

@@ -5,8 +5,6 @@ on:
push:
branches:
- main
tags:
- 'v*'
paths-ignore:
- '**.md'
- '**.txt'
@@ -38,6 +36,12 @@ on:
- 'CODE_OF_CONDUCT.md'
workflow_dispatch:
# Prevent concurrent Docker pushes across all workflows (shared with release.yml)
# This ensures docker-build.yml and release.yml never push to 'latest' simultaneously
concurrency:
group: docker-push-${{ github.ref }}
cancel-in-progress: false
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
@@ -89,16 +93,54 @@ jobs:
uses: docker/build-push-action@v5
with:
context: .
no-cache: true
no-cache: false
platforms: linux/amd64,linux/arm64
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
provenance: false
- name: Verify multi-arch manifest for latest tag
if: github.event_name != 'pull_request' && github.ref == 'refs/heads/main'
run: |
echo "Verifying multi-arch manifest for latest tag..."
# Retry with exponential backoff (registry propagation can take time)
MAX_ATTEMPTS=5
ATTEMPT=1
WAIT_TIME=2
while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do
echo "Attempt $ATTEMPT of $MAX_ATTEMPTS..."
MANIFEST=$(docker buildx imagetools inspect ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest 2>&1 || true)
# Check for both platforms
if echo "$MANIFEST" | grep -q "linux/amd64" && echo "$MANIFEST" | grep -q "linux/arm64"; then
echo "✅ Multi-arch manifest verified: both amd64 and arm64 present"
echo "$MANIFEST"
exit 0
fi
if [ $ATTEMPT -lt $MAX_ATTEMPTS ]; then
echo "⏳ Registry still propagating, waiting ${WAIT_TIME}s before retry..."
sleep $WAIT_TIME
WAIT_TIME=$((WAIT_TIME * 2)) # Exponential backoff: 2s, 4s, 8s, 16s
fi
ATTEMPT=$((ATTEMPT + 1))
done
echo "❌ ERROR: Multi-arch manifest incomplete after $MAX_ATTEMPTS attempts!"
echo "$MANIFEST"
exit 1
build-railway:
name: Build Railway Docker Image
runs-on: ubuntu-latest
needs: build
permissions:
contents: read
packages: write
@@ -143,11 +185,13 @@ jobs:
with:
context: .
file: ./Dockerfile.railway
no-cache: true
no-cache: false
platforms: linux/amd64
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta-railway.outputs.tags }}
labels: ${{ steps.meta-railway.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
provenance: false
# Nginx build commented out until Phase 2

View File

@@ -13,9 +13,10 @@ permissions:
issues: write
pull-requests: write
# Prevent concurrent releases
# Prevent concurrent Docker pushes across all workflows (shared with docker-build.yml)
# This ensures release.yml and docker-build.yml never push to 'latest' simultaneously
concurrency:
group: release
group: docker-push-${{ github.ref }}
cancel-in-progress: false
env:
@@ -435,7 +436,76 @@ jobs:
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Verify multi-arch manifest for latest tag
run: |
echo "Verifying multi-arch manifest for latest tag..."
# Retry with exponential backoff (registry propagation can take time)
MAX_ATTEMPTS=5
ATTEMPT=1
WAIT_TIME=2
while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do
echo "Attempt $ATTEMPT of $MAX_ATTEMPTS..."
MANIFEST=$(docker buildx imagetools inspect ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest 2>&1 || true)
# Check for both platforms
if echo "$MANIFEST" | grep -q "linux/amd64" && echo "$MANIFEST" | grep -q "linux/arm64"; then
echo "✅ Multi-arch manifest verified: both amd64 and arm64 present"
echo "$MANIFEST"
exit 0
fi
if [ $ATTEMPT -lt $MAX_ATTEMPTS ]; then
echo "⏳ Registry still propagating, waiting ${WAIT_TIME}s before retry..."
sleep $WAIT_TIME
WAIT_TIME=$((WAIT_TIME * 2)) # Exponential backoff: 2s, 4s, 8s, 16s
fi
ATTEMPT=$((ATTEMPT + 1))
done
echo "❌ ERROR: Multi-arch manifest incomplete after $MAX_ATTEMPTS attempts!"
echo "$MANIFEST"
exit 1
- name: Verify multi-arch manifest for version tag
run: |
VERSION="${{ needs.detect-version-change.outputs.new-version }}"
echo "Verifying multi-arch manifest for version tag :$VERSION (without 'v' prefix)..."
# Retry with exponential backoff (registry propagation can take time)
MAX_ATTEMPTS=5
ATTEMPT=1
WAIT_TIME=2
while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do
echo "Attempt $ATTEMPT of $MAX_ATTEMPTS..."
MANIFEST=$(docker buildx imagetools inspect ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:$VERSION 2>&1 || true)
# Check for both platforms
if echo "$MANIFEST" | grep -q "linux/amd64" && echo "$MANIFEST" | grep -q "linux/arm64"; then
echo "✅ Multi-arch manifest verified for $VERSION: both amd64 and arm64 present"
echo "$MANIFEST"
exit 0
fi
if [ $ATTEMPT -lt $MAX_ATTEMPTS ]; then
echo "⏳ Registry still propagating, waiting ${WAIT_TIME}s before retry..."
sleep $WAIT_TIME
WAIT_TIME=$((WAIT_TIME * 2)) # Exponential backoff: 2s, 4s, 8s, 16s
fi
ATTEMPT=$((ATTEMPT + 1))
done
echo "❌ ERROR: Multi-arch manifest incomplete for version $VERSION after $MAX_ATTEMPTS attempts!"
echo "$MANIFEST"
exit 1
- name: Extract metadata for Railway image
id: meta-railway
uses: docker/metadata-action@v5