# task-571.3: Smart Matching 활용 크롤러 프로토타입

## 한정승인 (Scoped Delegation)
제이회장님이 전체 Phase를 2팀에 한정승인. 각 Phase 완료 → .done → 즉시 다음 Phase 진행.

## 참조 문서
1. `memory/research/scrapling-analysis.md` — Scrapling 심층분석 보고서
2. `memory/reports/task-571.2.md` — Phase 2 완료 보고서 (Scrapling 설치 + crawl_utils + SKILL.md)
3. `memory/tasks/task-571.1.md` — 마스터플랜 (전체 5 Phase)

## Phase 2 산출물 (이전 Phase)
- `/home/jay/workspace/scripts/crawl_utils.py` — ProxyRotator, fetch_with_retry, html_to_markdown, clean_html
- `/home/jay/workspace/scripts/tests/test_crawl_utils.py` — 53개 테스트
- `/home/jay/workspace/skills/advanced-crawling/SKILL.md` — 크롤링 스킬 가이드
- Scrapling 0.4.2 설치 완료 (Fetcher, DynamicFetcher, StealthyFetcher import 확인)

## 작업 항목

### 1. Smart Matching 적용 (S-1~S-4)
- Scrapling의 auto_save/adaptive/find_similar 기능 활용
- 요소 fingerprinting → SQLite 저장 → 유사도 기반 재탐색

### 2. 보험사 공개 데이터 크롤러 프로토타입
경로: `/home/jay/workspace/scripts/insurance_crawler.py`

구현 항목:
- 보험사 공시 페이지 데이터 추출 (공개 데이터만 대상)
- Smart Matching으로 구조 변경 대응
- CSS 셀렉터 기반 데이터 추출
- crawl_utils.py의 ProxyRotator/fetch_with_retry 활용
- html_to_markdown으로 LLM 입력 변환

### 3. 파싱 기능 통합 (P-1~P-4, P-6)
- lxml 기반 고속 파싱
- CSS/XPath 셀렉터 + TextHandler 체이닝
- find_similar()로 반복 구조 데이터 자동 추출

## 수정/생성 파일
- `/home/jay/workspace/scripts/insurance_crawler.py` (신규)
- `/home/jay/workspace/scripts/tests/test_insurance_crawler.py` (신규)

## 주의사항
- ⚠️ 실제 크롤링 테스트 결과물은 제이회장님 확인 후에만 삭제
- 합법적 공개 데이터만 대상
- robots.txt 존중

## 완료 기준
- insurance_crawler.py + 테스트
- Smart Matching 동작 확인 (auto_save/adaptive)
- 프로토타입 실행 결과 (데이터 샘플)
- pyright 에러 0건
- 기존 테스트 회귀 없음

## 보고서
`memory/reports/task-571.3.md`