9.5 KiB
9.5 KiB
Epic 3: Article Scraping & Persistence
Goal: Implement a best-effort article scraping mechanism to fetch and extract plain text content from the external URLs associated with fetched HN stories. Handle failures gracefully and persist successfully scraped text locally. Implement a stage testing utility for scraping.
Story List
Story 3.1: Implement Basic Article Scraper Module
- User Story / Goal: As a developer, I want a module that attempts to fetch HTML from a URL and extract the main article text using basic methods, handling common failures gracefully, so article content can be prepared for summarization.
- Detailed Requirements:
- Create a new module:
src/scraper/articleScraper.ts. - Add a suitable HTML parsing/extraction library dependency (e.g.,
@extractus/article-extractorrecommended for simplicity, orcheeriofor more control). Runnpm install @extractus/article-extractor --save-prod(or chosen alternative). - Implement an async function
scrapeArticle(url: string): Promise<string | null>within the module. - Inside the function:
- Use native
Workspaceto retrieve content from theurl. Set a reasonable timeout (e.g., 10-15 seconds). Include aUser-Agentheader to mimic a browser. - Handle potential
Workspaceerrors (network errors, timeouts) usingtry...catch. - Check the
response.okstatus. If not okay, log error and returnnull. - Check the
Content-Typeheader of the response. If it doesn't indicate HTML (e.g., does not includetext/html), log warning and returnnull. - If HTML is received, attempt to extract the main article text using the chosen library (
article-extractorpreferred). - Wrap the extraction logic in a
try...catchto handle library-specific errors. - Return the extracted plain text string if successful. Ensure it's just text, not HTML markup.
- Return
nullif extraction fails or results in empty content.
- Use native
- Log all significant events, errors, or reasons for returning null (e.g., "Scraping URL...", "Fetch failed:", "Non-HTML content type:", "Extraction failed:", "Successfully extracted text") using the logger utility.
- Define TypeScript types/interfaces as needed.
- Create a new module:
- Acceptance Criteria (ACs):
- AC1: The
articleScraper.tsmodule exists and exports thescrapeArticlefunction. - AC2: The chosen scraping library (e.g.,
@extractus/article-extractor) is added todependenciesinpackage.json. - AC3:
scrapeArticleuses nativeWorkspacewith a timeout and User-Agent header. - AC4:
scrapeArticlecorrectly handles fetch errors, non-OK responses, and non-HTML content types by logging and returningnull. - AC5:
scrapeArticleuses the chosen library to attempt text extraction from valid HTML content. - AC6:
scrapeArticlereturns the extracted plain text on success, andnullon any failure (fetch, non-HTML, extraction error, empty result). - AC7: Relevant logs are produced for success, failure modes, and errors encountered during the process.
- AC1: The
Story 3.2: Integrate Article Scraping into Main Workflow
- User Story / Goal: As a developer, I want to integrate the article scraper into the main workflow (
src/index.ts), attempting to scrape the article for each HN story that has a valid URL, after fetching its data. - Detailed Requirements:
- Modify the main execution flow in
src/index.ts. - Import the
scrapeArticlefunction fromsrc/scraper/articleScraper.ts. - Within the main loop iterating through the fetched stories (after comments are fetched in Epic 2):
- Check if
story.urlexists and appears to be a valid HTTP/HTTPS URL. A simple check for starting withhttp://orhttps://is sufficient. - If the URL is missing or invalid, log a warning ("Skipping scraping for story {storyId}: Missing or invalid URL") and proceed to the next story's processing step.
- If a valid URL exists, log ("Attempting to scrape article for story {storyId} from {story.url}").
- Call
await scrapeArticle(story.url). - Store the result (the extracted text string or
null) in memory, associated with the story object (e.g., add propertyarticleContent: string | null). - Log the outcome clearly (e.g., "Successfully scraped article for story {storyId}", "Failed to scrape article for story {storyId}").
- Check if
- Modify the main execution flow in
- Acceptance Criteria (ACs):
- AC1: Running
npm run devexecutes Epic 1 & 2 steps, and then attempts article scraping for stories with valid URLs. - AC2: Stories with missing or invalid URLs are skipped, and a corresponding log message is generated.
- AC3: For stories with valid URLs, the
scrapeArticlefunction is called. - AC4: Logs clearly indicate the start and success/failure outcome of the scraping attempt for each relevant story.
- AC5: Story objects held in memory after this stage contain an
articleContentproperty holding the scraped text (string) ornullif scraping was skipped or failed.
- AC1: Running
Story 3.3: Persist Scraped Article Text Locally
- User Story / Goal: As a developer, I want to save successfully scraped article text to a separate local file for each story, so that the text content is available as input for the summarization stage.
- Detailed Requirements:
- Import Node.js
fsandpathmodules if not already present insrc/index.ts. - In the main workflow (
src/index.ts), immediately after a successful call toscrapeArticlefor a story (where the result is a non-null string):- Retrieve the full path to the current date-stamped output directory.
- Construct the filename:
{storyId}_article.txt. - Construct the full file path using
path.join(). - Get the successfully scraped article text string (
articleContent). - Use
fs.writeFileSync(fullPath, articleContent, 'utf-8')to save the text to the file. Wrap intry...catchfor file system errors. - Log the successful saving of the file (e.g., "Saved scraped article text to {filename}") or any file writing errors encountered.
- Ensure no
_article.txtfile is created ifscrapeArticlereturnednull(due to skipping or failure).
- Import Node.js
- Acceptance Criteria (ACs):
- AC1: After running
npm run dev, the date-stamped output directory contains_article.txtfiles only for those stories wherescrapeArticlesucceeded and returned text content. - AC2: The name of each article text file is
{storyId}_article.txt. - AC3: The content of each
_article.txtfile is the plain text string returned byscrapeArticle. - AC4: Logs confirm the successful writing of each
_article.txtfile or report specific file writing errors. - AC5: No empty
_article.txtfiles are created. Files only exist if scraping was successful.
- AC1: After running
Story 3.4: Implement Stage Testing Utility for Scraping
- User Story / Goal: As a developer, I want a separate script/command to test the article scraping logic using HN story data from local files, allowing independent testing and debugging of the scraper.
- Detailed Requirements:
- Create a new standalone script file:
src/stages/scrape_articles.ts. - Import necessary modules:
fs,path,logger,config,scrapeArticle. - The script should:
- Initialize the logger.
- Load configuration (to get
OUTPUT_DIR_PATH). - Determine the target date-stamped directory path (e.g.,
${OUTPUT_DIR_PATH}/YYYY-MM-DD, using the current date or potentially an optional CLI argument). Ensure this directory exists. - Read the directory contents and identify all
{storyId}_data.jsonfiles. - For each
_data.jsonfile found:- Read and parse the JSON content.
- Extract the
storyIdandurl. - If a valid
urlexists, callawait scrapeArticle(url). - If scraping succeeds (returns text), save the text to
{storyId}_article.txtin the same directory (using logic from Story 3.3). Overwrite if the file exists. - Log the progress and outcome (skip/success/fail) for each story processed.
- Add a new script command to
package.json:"stage:scrape": "ts-node src/stages/scrape_articles.ts". Consider adding argument parsing later if needed to specify a date/directory.
- Create a new standalone script file:
- Acceptance Criteria (ACs):
- AC1: The file
src/stages/scrape_articles.tsexists. - AC2: The script
stage:scrapeis defined inpackage.json. - AC3: Running
npm run stage:scrape(assuming a directory with_data.jsonfiles exists from a previousstage:fetchrun) reads these files. - AC4: The script calls
scrapeArticlefor stories with valid URLs found in the JSON files. - AC5: The script creates/updates
{storyId}_article.txtfiles in the target directory corresponding to successfully scraped articles. - AC6: The script logs its actions (reading files, attempting scraping, saving results) for each story ID processed.
- AC7: The script operates solely based on local
_data.jsonfiles and fetching from external article URLs; it does not call the Algolia HN API.
- AC1: The file
Change Log
| Change | Date | Version | Description | Author |
|---|---|---|---|---|
| Initial Draft | 2025-05-04 | 0.1 | First draft of Epic 3 | 2-pm |