iPage extraction script

Canonical script now lives in the repo: catalog/packages/scraper/src/ipage_extract.py (run with uv run).

This note no longer holds a copy of the code (two copies just diverge). It records how the extraction works and what changed from the original draft.

Key finding: no browser automation needed

A captured session cookie + plain HTTP does everything. Confirmed all three endpoints return 200 with a valid cookie and no JS:

IPAGE_COOKIE='JSESSIONID=...; ...; ipageUserLoggedIn=...'   # from a logged-in tab
curl -A "<chrome UA>" -H "Cookie: $IPAGE_COOKIE" \
  "https://ipage.ingramcontent.com/ipage/product/search/savedSearches.action?action=select&searchId=206002"

The Browserbase/browser-session plan in the original architecture is dropped.

Endpoints

  1. savedSearches.action?action=select&searchId=206002 → grid + lastSearch token
  2. productdetail?queryString=<lastSearch>&R=<productId>&dNo=0 → metadata + stock
  3. servlet/ibg.common.titledetail.imageloader?ean=<ean>&size=640 → cover jpg

Changes from the original draft

  • Cookie source: repo-root .env (IPAGE_COOKIE=), not ~/.hermes/.env. Single-quote the value in .env; the cookie contains &/;/=.
  • Detail page layout drifted. Metadata (BISAC, Dewey, LCCN, LC call, physical, carton) is NOT in the table containing the “Additional Information” header — that table is now empty. It lives in a separate, richer table. Fix: pick the longest table containing “BISAC Categories”.
  • Stock table is its own table with one cell per line (PA / PRIMARY / 1684 / 1128 / ...). Parsed as a token stream, not by line-splitting. Forthcoming titles legitimately have no stock table.
  • EAN cell can hold two codes (9780063415898 0063415895); take the EAN-13. Fallback: EAN is also in the cover imageloader URL.
  • Grid markup is uppercase (<INPUT NAME="addItem">); BeautifulSoup lowercases it, so the parser is unaffected.

SFF top sellers = saved search id 206002 (~25 titles).