# 인슈위키 정제 파이프라인 체크포인트 저장

## 배경
현재 파이프라인에서 스레드 분리 + LLM 정밀 분리 결과가 메모리에만 있어서, 프로세스 중단 시 처음부터 다시 수행해야 함. "이어서 정제"도 배치 추출만 skip하고 1~3단계는 매번 반복.

## 작업 내용

### 1. 스레드 분리 결과 체크포인트 저장
파일: `/home/jay/projects/insuwiki/scripts/kakao_knowledge/knowledge_extractor_v2.py`

`_split_threads_v2()` 완료 후:
```python
# 체크포인트 저장
if output_dir:
    checkpoint_path = Path(output_dir) / "checkpoint_threads.json"
    checkpoint_data = [t.model_dump() for t in threads]  # ThreadV2 → dict
    checkpoint_path.write_text(json.dumps(checkpoint_data, ensure_ascii=False, indent=2))
    _add_log(f"스레드 분리 체크포인트 저장: {len(threads)}개 스레드")
```

### 2. LLM 정밀 분리 결과 체크포인트 저장
`_llm_refine_thread_splits()` 완료 후:
```python
if output_dir:
    checkpoint_path = Path(output_dir) / "checkpoint_refined_threads.json"
    checkpoint_data = [t.model_dump() for t in threads]
    checkpoint_path.write_text(json.dumps(checkpoint_data, ensure_ascii=False, indent=2))
    _add_log(f"LLM 정밀 분리 체크포인트 저장: {len(threads)}개 스레드")
```

### 3. 이어서 정제 시 체크포인트 활용
`extract_knowledge_v2()` 시작 부분에:
```python
# 체크포인트 존재하면 스레드 분리 건너뛰기
if output_dir:
    refined_cp = Path(output_dir) / "checkpoint_refined_threads.json"
    threads_cp = Path(output_dir) / "checkpoint_threads.json"
    if refined_cp.exists():
        threads = [ThreadV2(**t) for t in json.loads(refined_cp.read_text())]
        _add_log(f"체크포인트 복원: LLM 정밀 분리 결과 {len(threads)}개 스레드")
        # 스레드 분리 + LLM 정밀 분리 건너뛰기
    elif threads_cp.exists():
        threads = [ThreadV2(**t) for t in json.loads(threads_cp.read_text())]
        _add_log(f"체크포인트 복원: 스레드 분리 결과 {len(threads)}개 스레드")
        # 스레드 분리 건너뛰기, LLM 정밀 분리만 수행
    else:
        # 처음부터 수행
```

### 4. 체크포인트 정리
정제 완료 시 체크포인트 파일 삭제 (또는 보존 — 디버깅용)

## 영향 파일
- `/home/jay/projects/insuwiki/scripts/kakao_knowledge/knowledge_extractor_v2.py` — 수정

## 검증 시나리오
1. 정제 실행 → output_dir에 `checkpoint_threads.json` 생성 확인
2. LLM 정밀 분리 후 `checkpoint_refined_threads.json` 생성 확인
3. 이어서 정제 시 체크포인트 존재 → 스레드 분리 건너뛰기 로그 확인
4. 체크포인트 없으면 기존대로 처음부터 수행
5. 기존 정제 기능 회귀 없음

## 주의
- knowledge_extractor_v2.py는 insuwiki 프로젝트
- ThreadV2 모델의 직렬화/역직렬화 확인 (model_dump → dict → ThreadV2)
- output_dir이 None이면 체크포인트 미저장 (기존 동작 유지)
- 수정 후 즉시 커밋