Files
BMAD-METHOD/CURRENT-V2/usage-demo/epic3-draft.txt

111 lines
9.5 KiB
Plaintext

# Epic 3: Article Scraping & Persistence
**Goal:** Implement a best-effort article scraping mechanism to fetch and extract plain text content from the external URLs associated with fetched HN stories. Handle failures gracefully and persist successfully scraped text locally. Implement a stage testing utility for scraping.
## Story List
### Story 3.1: Implement Basic Article Scraper Module
- **User Story / Goal:** As a developer, I want a module that attempts to fetch HTML from a URL and extract the main article text using basic methods, handling common failures gracefully, so article content can be prepared for summarization.
- **Detailed Requirements:**
- Create a new module: `src/scraper/articleScraper.ts`.
- Add a suitable HTML parsing/extraction library dependency (e.g., `@extractus/article-extractor` recommended for simplicity, or `cheerio` for more control). Run `npm install @extractus/article-extractor --save-prod` (or chosen alternative).
- Implement an async function `scrapeArticle(url: string): Promise<string | null>` within the module.
- Inside the function:
- Use native `Workspace` to retrieve content from the `url`. Set a reasonable timeout (e.g., 10-15 seconds). Include a `User-Agent` header to mimic a browser.
- Handle potential `Workspace` errors (network errors, timeouts) using `try...catch`.
- Check the `response.ok` status. If not okay, log error and return `null`.
- Check the `Content-Type` header of the response. If it doesn't indicate HTML (e.g., does not include `text/html`), log warning and return `null`.
- If HTML is received, attempt to extract the main article text using the chosen library (`article-extractor` preferred).
- Wrap the extraction logic in a `try...catch` to handle library-specific errors.
- Return the extracted plain text string if successful. Ensure it's just text, not HTML markup.
- Return `null` if extraction fails or results in empty content.
- Log all significant events, errors, or reasons for returning null (e.g., "Scraping URL...", "Fetch failed:", "Non-HTML content type:", "Extraction failed:", "Successfully extracted text") using the logger utility.
- Define TypeScript types/interfaces as needed.
- **Acceptance Criteria (ACs):**
- AC1: The `articleScraper.ts` module exists and exports the `scrapeArticle` function.
- AC2: The chosen scraping library (e.g., `@extractus/article-extractor`) is added to `dependencies` in `package.json`.
- AC3: `scrapeArticle` uses native `Workspace` with a timeout and User-Agent header.
- AC4: `scrapeArticle` correctly handles fetch errors, non-OK responses, and non-HTML content types by logging and returning `null`.
- AC5: `scrapeArticle` uses the chosen library to attempt text extraction from valid HTML content.
- AC6: `scrapeArticle` returns the extracted plain text on success, and `null` on any failure (fetch, non-HTML, extraction error, empty result).
- AC7: Relevant logs are produced for success, failure modes, and errors encountered during the process.
---
### Story 3.2: Integrate Article Scraping into Main Workflow
- **User Story / Goal:** As a developer, I want to integrate the article scraper into the main workflow (`src/index.ts`), attempting to scrape the article for each HN story that has a valid URL, after fetching its data.
- **Detailed Requirements:**
- Modify the main execution flow in `src/index.ts`.
- Import the `scrapeArticle` function from `src/scraper/articleScraper.ts`.
- Within the main loop iterating through the fetched stories (after comments are fetched in Epic 2):
- Check if `story.url` exists and appears to be a valid HTTP/HTTPS URL. A simple check for starting with `http://` or `https://` is sufficient.
- If the URL is missing or invalid, log a warning ("Skipping scraping for story {storyId}: Missing or invalid URL") and proceed to the next story's processing step.
- If a valid URL exists, log ("Attempting to scrape article for story {storyId} from {story.url}").
- Call `await scrapeArticle(story.url)`.
- Store the result (the extracted text string or `null`) in memory, associated with the story object (e.g., add property `articleContent: string | null`).
- Log the outcome clearly (e.g., "Successfully scraped article for story {storyId}", "Failed to scrape article for story {storyId}").
- **Acceptance Criteria (ACs):**
- AC1: Running `npm run dev` executes Epic 1 & 2 steps, and then attempts article scraping for stories with valid URLs.
- AC2: Stories with missing or invalid URLs are skipped, and a corresponding log message is generated.
- AC3: For stories with valid URLs, the `scrapeArticle` function is called.
- AC4: Logs clearly indicate the start and success/failure outcome of the scraping attempt for each relevant story.
- AC5: Story objects held in memory after this stage contain an `articleContent` property holding the scraped text (string) or `null` if scraping was skipped or failed.
---
### Story 3.3: Persist Scraped Article Text Locally
- **User Story / Goal:** As a developer, I want to save successfully scraped article text to a separate local file for each story, so that the text content is available as input for the summarization stage.
- **Detailed Requirements:**
- Import Node.js `fs` and `path` modules if not already present in `src/index.ts`.
- In the main workflow (`src/index.ts`), immediately after a successful call to `scrapeArticle` for a story (where the result is a non-null string):
- Retrieve the full path to the current date-stamped output directory.
- Construct the filename: `{storyId}_article.txt`.
- Construct the full file path using `path.join()`.
- Get the successfully scraped article text string (`articleContent`).
- Use `fs.writeFileSync(fullPath, articleContent, 'utf-8')` to save the text to the file. Wrap in `try...catch` for file system errors.
- Log the successful saving of the file (e.g., "Saved scraped article text to {filename}") or any file writing errors encountered.
- Ensure *no* `_article.txt` file is created if `scrapeArticle` returned `null` (due to skipping or failure).
- **Acceptance Criteria (ACs):**
- AC1: After running `npm run dev`, the date-stamped output directory contains `_article.txt` files *only* for those stories where `scrapeArticle` succeeded and returned text content.
- AC2: The name of each article text file is `{storyId}_article.txt`.
- AC3: The content of each `_article.txt` file is the plain text string returned by `scrapeArticle`.
- AC4: Logs confirm the successful writing of each `_article.txt` file or report specific file writing errors.
- AC5: No empty `_article.txt` files are created. Files only exist if scraping was successful.
---
### Story 3.4: Implement Stage Testing Utility for Scraping
- **User Story / Goal:** As a developer, I want a separate script/command to test the article scraping logic using HN story data from local files, allowing independent testing and debugging of the scraper.
- **Detailed Requirements:**
- Create a new standalone script file: `src/stages/scrape_articles.ts`.
- Import necessary modules: `fs`, `path`, `logger`, `config`, `scrapeArticle`.
- The script should:
- Initialize the logger.
- Load configuration (to get `OUTPUT_DIR_PATH`).
- Determine the target date-stamped directory path (e.g., `${OUTPUT_DIR_PATH}/YYYY-MM-DD`, using the current date or potentially an optional CLI argument). Ensure this directory exists.
- Read the directory contents and identify all `{storyId}_data.json` files.
- For each `_data.json` file found:
- Read and parse the JSON content.
- Extract the `storyId` and `url`.
- If a valid `url` exists, call `await scrapeArticle(url)`.
- If scraping succeeds (returns text), save the text to `{storyId}_article.txt` in the same directory (using logic from Story 3.3). Overwrite if the file exists.
- Log the progress and outcome (skip/success/fail) for each story processed.
- Add a new script command to `package.json`: `"stage:scrape": "ts-node src/stages/scrape_articles.ts"`. Consider adding argument parsing later if needed to specify a date/directory.
- **Acceptance Criteria (ACs):**
- AC1: The file `src/stages/scrape_articles.ts` exists.
- AC2: The script `stage:scrape` is defined in `package.json`.
- AC3: Running `npm run stage:scrape` (assuming a directory with `_data.json` files exists from a previous `stage:fetch` run) reads these files.
- AC4: The script calls `scrapeArticle` for stories with valid URLs found in the JSON files.
- AC5: The script creates/updates `{storyId}_article.txt` files in the target directory corresponding to successfully scraped articles.
- AC6: The script logs its actions (reading files, attempting scraping, saving results) for each story ID processed.
- AC7: The script operates solely based on local `_data.json` files and fetching from external article URLs; it does not call the Algolia HN API.
## Change Log
| Change | Date | Version | Description | Author |
| ------------- | ---------- | ------- | ------------------------- | -------------- |
| Initial Draft | 2025-05-04 | 0.1 | First draft of Epic 3 | 2-pm |