16 KiB
Epic 2: Automated Content Ingestion & Podcast Generation Pipeline
Goal: To implement the complete automated daily workflow within the backend. This includes fetching Hacker News post data, scraping and extracting content from linked external articles, aggregating and formatting text, submitting it to Play.ai, managing job status via polling, and retrieving/storing the final audio file and associated metadata. This epic delivers the core value proposition of generating the daily audio content and making it ready for consumption via an API.
User Stories
Story 2.1: AWS CDK Extension for Epic 2 Resources
- User Story Statement: As a Developer, I need to extend the existing AWS CDK stack within the
bmad-daily-digest-backendproject to define and provision all new AWS resources required for the content ingestion and podcast generation pipeline—including theBmadDailyDigestEpisodesDynamoDB table (with GSI), theHackerNewsPostProcessStateDynamoDB table, an S3 bucket for audio storage, and the AWS Step Functions state machine for orchestrating the Play.ai job status polling—so that all backend infrastructure for this epic is managed as code and ready for the application logic. - Acceptance Criteria (ACs):
- The existing AWS CDK application (from Epic 1) must be extended.
- The
BmadDailyDigestEpisodesDynamoDB table resource must be defined in CDK as specified in the System Architecture Document's "Data Models" section (withepisodeIdPK, key attributes likepublicationDate,episodeNumber,podcastGeneratedTitle,audioS3Key,audioS3Bucket,playAiJobId,playAiSourceAudioUrl,sourceHNPostslist,status,createdAt,updatedAt), including a GSI for chronological sorting (e.g., PKstatus, SKpublicationDate), and PAY_PER_REQUEST billing. - The
HackerNewsPostProcessStateDynamoDB table resource must be defined in CDK as specified in the System Architecture Document's "Data Models" section (withhnPostIdPK and attributes likelastCommentFetchTimestamp,lastSuccessfullyScrapedTimestamp,lastKnownRank), and PAY_PER_REQUEST billing. - An S3 bucket resource (e.g.,
bmad-daily-digest-audio-{unique-suffix}) must be defined via CDK for audio storage, with private access by default. - An AWS Step Functions state machine resource must be defined via CDK to manage the Play.ai job status polling workflow (as detailed in Story 2.6).
- Necessary IAM roles and permissions for Lambda functions within this epic to interact with DynamoDB, S3, Step Functions, CloudWatch Logs must be defined via CDK, adhering to least privilege.
- The updated CDK stack must synthesize (
cdk synth) and deploy (cdk deploy) successfully. - All new CDK code must adhere to project ESLint/Prettier standards.
Story 2.2: Fetch Top Hacker News Posts & Identify Repeats
- User Story Statement: As the System, I need to reliably fetch the top N (configurable, e.g., 10) current Hacker News posts daily using the Algolia HN API, including their essential metadata. I also need to identify if each fetched post has been processed in a recent digest by checking against the
HackerNewsPostProcessStatetable, so that I have an accurate list of stories and their status (new or repeat) to begin generating the daily digest. - Acceptance Criteria (ACs):
- A
hackerNewsService.tsfunction must fetch top N HN posts (stories only) viaaxiosfrom Algolia API (configurableHN_POSTS_COUNT). - Extracted metadata per post: Title, Article URL, HN Post URL, HN Post ID (
objectID), Author, Points, Creation timestamp. - For each post, the function must query the
HackerNewsPostProcessStateDynamoDB table to determine itsisUpdateStatus(true iflastSuccessfullyScrapedTimestampandlastCommentFetchTimestampindicate prior full processing) and retrievelastCommentFetchTimestampandlastKnownRankif available. - Function must return an array of HN post objects with metadata,
isUpdateStatus,lastCommentFetchTimestamp, andlastKnownRank. - Error handling for Algolia/DynamoDB calls must be implemented and logged.
- Unit tests (Jest) must verify API calls, data extraction, repeat identification (mocked DDB), and error handling. All tests must pass.
- A
Story 2.3: Article Content Scraping & Extraction (Conditional)
- User Story Statement: As the System, for each Hacker News post identified as new (or for which article scraping previously failed and is being retried), I need to robustly fetch its HTML content from the linked article URL and extract the primary textual content and title using libraries like Cheerio and Mozilla Readability. If scraping fails, a fallback mechanism must be triggered.
- Acceptance Criteria (ACs):
- An
articleScraperService.tsfunction must accept an article URL andisUpdateStatus. - If
isUpdateStatusis true (article already successfully scraped and stored, though we are not storing full articles long term - this implies we have the article data available from a previous step if it's a repeat post where we don't re-scrape), scraping must be skipped. (For MVP, if it's a repeat post, we assume we don't need to re-scrape the article itself, only comments, as per user feedback. This story focuses on new scrapes or retries of failed scrapes). - If a new scrape is needed: use
axios(timeout, User-Agent) to fetch HTML. - Use
Mozilla Readability(JS port) and/orCheerioto extract main article text and title. - Return
{ success: true, title: string, content: string }on success. - If scraping fails: log failure, return
{ success: false, error: string, fallbackNeeded: true }. - No specific "polite" inter-article scraping delays for MVP.
- Unit tests (Jest) must mock
axios, test successful extraction, skip logic for non-applicable cases, and failure/fallback scenarios. All tests must pass.
- An
Story 2.4: Fetch Hacker News Comments (Conditional Logic)
- User Story Statement: As the System, I need to fetch comments for each selected Hacker News post using the Algolia HN API, adjusting the strategy to fetch up to N comments for new posts, only new comments since last fetch for repeat posts, or up to 3N comments if article scraping failed.
- Acceptance Criteria (ACs):
hackerNewsService.tsmust be extended to fetch comments for an HN Post ID, acceptingisUpdateStatus,lastCommentFetchTimestamp(fromHackerNewsPostProcessState), andarticleScrapingFailedflags.- Use
axiosto call Algolia HN API item endpoint. - Comment Fetching Logic:
- If
articleScrapingFailed: Fetch up to 3 *HN_COMMENTS_COUNT_PER_POSTavailable comments. - If
isUpdateStatus: Fetch all comments, then filter client-side for comments withcreated_at_i>lastCommentFetchTimestamp. Select up toHN_COMMENTS_COUNT_PER_POSTof these new comments. - Else (new post, successful scrape): Fetch up to
HN_COMMENTS_COUNT_PER_POST.
- If
- For selected comments, extract plain text (HTML stripped), author, creation timestamp.
- Return array of comment objects; empty if none. An updated
lastCommentFetchTimestamp(maxcreated_at_iof fetched comments for this post) should be available for updatingHackerNewsPostProcessState. - Error handling and logging for API calls.
- Unit tests (Jest) must mock
axiosand verify all conditional fetching logic, comment selection/filtering, data extraction, and error handling. All tests must pass.
Story 2.5: Content Aggregation and Formatting for Play.ai
- User Story Statement: As the System, I need to aggregate the collected Hacker News post data (titles), associated article content (full, truncated, or fallback summary), and comments (new, updated, or extended sets) for all top stories, and format this combined text according to the specified structure for the play.ai PlayNote API, including special phrasing for different post types (new, update, scrape-failed) and configurable article truncation.
- Acceptance Criteria (ACs):
- A
contentFormatterService.tsmust be implemented. - Inputs: Array of processed HN post objects (with metadata, statuses, content, comments).
- Output: A single string.
- String starts: "It's a top 10 countdown for [Current Date]".
- Posts sequenced in reverse rank order.
- Formatting (new post): "Story [Rank] - [Article Title]. [Full/Truncated Article Text]. Comments Section. [Number] comments follow. Comment 1: [Text]..."
- Formatting (repeat post): "Story [Rank] (previously Rank [OldRank] yesterday) - [Article Title]. We're bringing you new comments on this popular story. Comments Section. [Number] new comments follow. Comment 1: [Text]..."
- Formatting (scrape-failed post): "Story [Rank] - [Article Title]. We couldn't retrieve the full article, but here's a summary if available and the latest comments. [Optional HN Summary]. Comments Section. [Number] comments follow. Comment 1: [Text]..."
- Article Truncation: If
MAX_ARTICLE_LENGTH(env var) set and article exceeds, truncate aiming to preserve intro/conclusion. - Graceful handling for missing parts.
- Unit tests (Jest) verify all formatting, truncation, data merging, error handling. All tests must pass.
- A
Story 2.6 (REVISED): Implement Podcast Generation Status Polling via Play.ai API
- User Story Statement: As the System, after submitting a podcast generation job to Play.ai and receiving a
jobId, I need an AWS Step Function state machine to periodically poll the Play.ai API for the status of this specific job, continuing until the job is reported as "completed" or "failed" (or a configurable max duration/attempts limit is reached), so the system can reliably determine when the podcast audio is ready or if an error occurred. - Acceptance Criteria (ACs):
- The AWS Step Function state machine (CDK defined in Story 2.1) must manage the polling workflow.
- Input:
jobId. - States: Invoke Poller Lambda (calls Play.ai status GET endpoint with
axios), Wait (configurablePOLLING_INTERVAL_MINUTES), Choice (evaluates status: "processing", "completed", "failed"). - Loop if "processing". Stop if "completed" or "failed".
- Max polling duration/attempts (configurable env vars
MAX_POLLING_DURATION_MINUTES,MAX_POLLING_ATTEMPTS) must be enforced, treating expiry as failure. - If "completed": extract
audioUrl, trigger next step (Story 2.8 process) with data. - If "failed"/"timeout": log event, record failure (e.g., update episode status in DDB via a Lambda), terminate.
- Poller Lambda handles API errors gracefully.
- Unit tests for Poller Lambda logic; Step Function definition tested (locally if possible, or via AWS console tests). All tests must pass.
Story 2.7: Submit Content to Play.ai PlayNote API & Initiate Podcast Generation
- User Story Statement: As the System, I need to securely submit the aggregated and formatted text content (using
sourceText) to the play.ai PlayNote API via anapplication/jsonrequest to initiate the podcast generation process, and I must capture thejobIdreturned by Play.ai, so that thisjobIdcan be passed to the status polling mechanism (Step Function). - Acceptance Criteria (ACs):
- A
playAiService.tsfunction must handle submission. - Input: formatted text (from Story 2.5).
- Use
axiosforPOSTto Play.ai endpoint (e.g.,https://api.play.ai/api/v1/playnotes). - Request
Content-Type: application/json. - JSON body:
sourceText, and configurabletitle,voiceId1,name1(default "Angelo"),voiceId2,name2(default "Deedee"),styleGuidance(default "podcast") from env vars. - Headers:
Authorization: Bearer <PLAY_AI_BEARER_TOKEN>,X-USER-ID: <PLAY_AI_USER_ID>(from env vars). - No
webHookUrlsent. - On success: extract
jobId, log it, initiate polling Step Function (Story 2.6) withjobIdand other context (like internalepisodeId). - Error handling for API submission; log and flag failure.
- Unit tests (Jest) mock
axios, verify API call, auth, payload,jobIdextraction, Step Function initiation (mocked), error handling. All tests must pass.
- A
Story 2.8: Retrieve, Store Generated Podcast Audio & Persist Episode Metadata
- User Story Statement: As the System, once the podcast generation status polling (Story 2.6) indicates a Play.ai job is "completed," I need to download the generated audio file from the provided
audioUrl, store this file in our designated S3 bucket, and then save/update all relevant metadata for the episode (including S3 audio location,episodeNumber,podcastGeneratedTitle,playAiSourceAudioUrl, and source HN post information including theirlastCommentFetchTimestamp) into our DynamoDB tables (BmadDailyDigestEpisodesandHackerNewsPostProcessState), so that the daily digest is fully processed, archived, and ready for access. - Acceptance Criteria (ACs):
- A
podcastStorageService.tsfunction must be triggered by Step Function (Story 2.6) on "completed" status, receivingaudioUrl, Play.aijobId, and original context (like internalepisodeId, list of source HN posts with their metadata and processing status). - Use
axiosto download audio fromaudioUrl. - Upload audio to S3 bucket (from Story 2.1), using key (e.g.,
YYYY/MM/DD/episodeId.mp3). - Prepare
Episodemetadata forBmadDailyDigestEpisodestable:episodeId(UUID),publicationDate(YYYY-MM-DD),episodeNumber(sequential logic, TBD),podcastGeneratedTitle(from Play.ai or constructed),audioS3Bucket,audioS3Key,playAiJobId,playAiSourceAudioUrl,sourceHNPosts(array of objects:{ hnPostId, title, originalArticleUrl, hnLink, isUpdateStatus, oldRank, articleScrapingFailed }),status: "Published",createdAt,updatedAt. - For each
hnPostIdinsourceHNPosts, update its corresponding item in theHackerNewsPostProcessStatetable with thelastCommentFetchTimestamp(current time or max comment time from this run),lastProcessedDate(current date), andlastKnownRank. IfarticleScrapingFailedwas false for this run, updatelastSuccessfullyScrapedTimestamp. - Save
Episodemetadata toBmadDailyDigestEpisodesDynamoDB table. - Error handling for download, S3 upload, DDB writes; failure should result in episode
status: "Failed". - Unit tests (Jest) mock
axios, AWS SDK (S3, DynamoDB); verify data handling, storage, metadata construction for both tables, errors. All tests must pass.
- A
Story 2.9: Daily Workflow Orchestration & Scheduling
- User Story Statement: As the System Administrator, I need the entire daily backend workflow (Stories 2.2 through 2.8) to be fully orchestrated by the primary AWS Step Function state machine and automatically scheduled to run once per day using Amazon EventBridge Scheduler, ensuring it handles re-runs for the same day by overwriting/starting over (for MVP), so that "BMad Daily Digest" episodes are produced consistently and reliably.
- Acceptance Criteria (ACs):
- The primary AWS Step Function state machine must orchestrate the sequence: Fetch HN Posts & Identify Repeats (2.2); For each post: conditionally Scrape Article (2.3) & Fetch Comments (2.4); then Aggregate & Format Content (2.5); then Submit to Play.ai & get
jobId(2.7); then initiate/manage Polling (2.6 usingjobId); on "completed" polling, trigger Retrieve & Store Audio/Metadata (2.8). - State machine must manage data flow (inputs/outputs) between steps correctly.
- Overall workflow error handling: critical step failure marks state machine execution as "Failed" and logs comprehensively. Steps use retries for transient errors.
- Idempotency (MVP): Re-running for the same
publicationDatemust re-process and effectively overwrite previous data for that date. - Amazon EventBridge Scheduler rule (CDK defined) must trigger the main Step Function daily at 12:00 UTC (default, configurable via
DAILY_JOB_SCHEDULE_UTC_CRON). - Successful end-to-end run must be demonstrated (e.g., processing sample data through the pipeline).
- Step Function execution history must provide a clear audit trail of steps and data.
- Unit tests for any new orchestrator-specific Lambda functions (if any not covered). All tests must pass.
- The primary AWS Step Function state machine must orchestrate the sequence: Fetch HN Posts & Identify Repeats (2.2); For each post: conditionally Scrape Article (2.3) & Fetch Comments (2.4); then Aggregate & Format Content (2.5); then Submit to Play.ai & get