4.3 KiB
Epic 2: Article Scraping
This document is a granulated shard from the main "BETA-V3/v3-demos/full-stack-app-demo/8-prd-po-updated.md" focusing on "Epic 2: Article Scraping".
-
Goal: Implement the functionality to scrape and store linked articles from HN posts, enriching the data available for summarization and the newsletter. Ensure this functionality is triggered by database events and can be tested via API/CLI (if retained). Implement the database event mechanism to trigger subsequent processing.
-
Story 2.1: As a system, I want to identify URLs within the top 30 (configurable via environment variable) Hacker News posts, so that I can extract the content of linked articles.
- Acceptance Criteria:
- The system parses the top N (configurable via env var) Hacker News posts to identify URLs.
- The system filters out any URLs that are not relevant to article scraping (e.g., links to images, videos, etc.).
- Acceptance Criteria:
-
Story 2.2: As a system, I want to scrape the content of the identified article URLs using Cheerio, so that I can provide summaries in the newsletter.
- Acceptance Criteria:
- The system scrapes the content from the identified article URLs using Cheerio.
- The system extracts relevant content such as the article title, author, publication date, and main text.
- The system handles potential issues during scraping, such as website errors or changes in website structure, logging errors for review.
- Acceptance Criteria:
-
Story 2.3: As a system, I want to store the scraped article content in the Supabase database, associated with the corresponding Hacker News post and workflow run, so that it can be used for summarization and newsletter generation.
- Acceptance Criteria:
- Scraped article content is stored in the
scraped_articlestable, linked to thehn_post_idand the currentworkflow_run_id. - The system ensures that the stored data includes all extracted information (title, author, date, text).
- The
scraping_statusand anyerror_messageare recorded in thescraped_articlestable. - Upon completion of scraping an article (success or failure), the service updates the
workflow_runs.details(e.g., incrementing scraped counts) viaWorkflowTrackerService. - A Supabase migration for the
scraped_articlestable (as defined inarchitecture.txt) is created and applied before data operations.
- Scraped article content is stored in the
- Acceptance Criteria:
-
Story 2.4: As a developer, I want to trigger the article scraping process via the API and CLI, so that I can manually initiate it for testing and debugging.
- Architect's Note: This story might become redundant if the main workflow trigger (Story 1.3) handles the entire pipeline initiation and individual service testing is done via direct function invocation or unit/integration tests.
- Acceptance Criteria:
- The API endpoint can trigger the article scraping process.
- The CLI command can trigger the article scraping process locally.
- The system logs the start and completion of the scraping process, including any errors encountered.
- All API requests and CLI command executions are logged, including timestamps and any relevant data.
- The system handles partial execution gracefully (i.e., if triggered before Epic 1 components like
WorkflowTrackerServiceare available, it logs a message and exits). - If retained for isolated testing, all scraping operations initiated via this trigger must be associated with a valid
workflow_run_idand update theworkflow_runstable accordingly viaWorkflowTrackerService.
-
Story 2.5 (New): Implement Database Event/Webhook:
scraped_articlesSuccess to Summarization Service- Goal: To ensure that the successful scraping and storage of an article in
scraped_articlesautomatically triggers theSummarizationService. - Acceptance Criteria:
- A Supabase database trigger or webhook mechanism is implemented on the
scraped_articlestable (e.g., on INSERT or UPDATE wherescraping_statusis 'success'). - The trigger successfully invokes the
SummarizationService(Supabase Function). - The invocation passes necessary parameters like
scraped_article_idandworkflow_run_idto theSummarizationService. - The mechanism is robust and includes error handling/logging for the trigger/webhook itself.
- Unit/integration tests are created to verify the trigger fires correctly and the service is invoked with correct parameters.
- A Supabase database trigger or webhook mechanism is implemented on the
- Goal: To ensure that the successful scraping and storage of an article in