iPage extraction script
Canonical script now lives in the repo:
catalog/packages/scraper/src/ipage_extract.py (run with uv run).
This note no longer holds a copy of the code (two copies just diverge). It records how the extraction works and what changed from the original draft.
Key finding: no browser automation needed
A captured session cookie + plain HTTP does everything. Confirmed all three endpoints return 200 with a valid cookie and no JS:
IPAGE_COOKIE='JSESSIONID=...; ...; ipageUserLoggedIn=...' # from a logged-in tab
curl -A "<chrome UA>" -H "Cookie: $IPAGE_COOKIE" \
"https://ipage.ingramcontent.com/ipage/product/search/savedSearches.action?action=select&searchId=206002"The Browserbase/browser-session plan in the original architecture is dropped.
Endpoints
savedSearches.action?action=select&searchId=206002→ grid +lastSearchtokenproductdetail?queryString=<lastSearch>&R=<productId>&dNo=0→ metadata + stockservlet/ibg.common.titledetail.imageloader?ean=<ean>&size=640→ cover jpg
Changes from the original draft
- Cookie source: repo-root
.env(IPAGE_COOKIE=), not~/.hermes/.env. Single-quote the value in.env; the cookie contains&/;/=. - Detail page layout drifted. Metadata (BISAC, Dewey, LCCN, LC call, physical, carton) is NOT in the table containing the “Additional Information” header — that table is now empty. It lives in a separate, richer table. Fix: pick the longest table containing “BISAC Categories”.
- Stock table is its own table with one cell per line
(
PA / PRIMARY / 1684 / 1128 / ...). Parsed as a token stream, not by line-splitting. Forthcoming titles legitimately have no stock table. - EAN cell can hold two codes (
9780063415898 0063415895); take the EAN-13. Fallback: EAN is also in the cover imageloader URL. - Grid markup is uppercase (
<INPUT NAME="addItem">); BeautifulSoup lowercases it, so the parser is unaffected.
Saved search
SFF top sellers = saved search id 206002 (~25 titles).