# Task: 블로그 홍보성 감지 조건 수정 (3건)

## 긴급도: critical
홍보성 오감지 + 조건 변경. Step 5 핵심 로직 수정.

---

## 1. LLM 홍보성 판별 (조건 5) 삭제

### 파일: `/home/jay/projects/InfoKeyword/worker/pipeline/analyzer.py`

**삭제할 코드 (라인 128-131)**:
```python
    # 5. LLM 홍보성 판별
    llm_result = await judge_promotional(text)
    if llm_result.get("is_promotional"):
        reasons.append("llm_promotional")
```

상단 import에서도 제거:
```python
from worker.analyzer.llm_promotional import judge_promotional  # 이 줄 삭제
```

**이유**: 제이회장님이 조건에서 제외 지시. 나머지 4가지 조건(전화번호/주소, 외부링크, 첨부파일, 이미지분석)만 유지.

---

## 2. 네이버 블로그 템플릿/배너광고 링크 제외

### 문제
`blog_content.py`의 `_extract_external_links()`가 블로그 본문이 아닌 **네이버 자동 삽입 링크**까지 수집 → 오감지 발생.

실제 오감지 사례:
- `http://creativecommons.org/licenses/by-nc-nd/2.0/kr/` — CCL 저작권 표시 (네이버 블로그 하단 자동 삽입)

### 파일: `/home/jay/projects/InfoKeyword/worker/crawler/blog_content.py`

#### 수정 1: `_extract_external_links()` 함수에서 네이버 템플릿 영역 링크 제외

```python
# 네이버 블로그 템플릿/시스템이 자동 삽입하는 링크 도메인 (홍보성 아님)
_TEMPLATE_DOMAINS = {
    "creativecommons.org",    # CCL 저작권 표시
    "www.creativecommons.org",
}

# 네이버 애드포스트/배너광고 컨테이너 CSS 선택자
_AD_CONTAINER_SELECTORS = [
    ".adpost_wrap",          # 애드포스트 래퍼
    ".adpost-container",     # 애드포스트 컨테이너
    ".revenue_unit_wrap",    # 수익 유닛
    "[data-adpost-id]",      # 애드포스트 데이터 속성
    ".sponsor_area",         # 스폰서 영역
    ".ad_box",               # 광고 박스
]
```

#### 수정 2: `_extract_external_links()` 에서 광고 컨테이너 내 링크 제외

현재 코드는 모든 `<a>` 태그에서 링크를 추출하는데, 아래 로직 추가:

```python
def _extract_external_links(soup: BeautifulSoup) -> list[str]:
    urls: list[str] = []
    seen: set[str] = set()

    # ★ 광고 컨테이너 내 URL을 사전 수집하여 제외
    ad_urls: set[str] = set()
    for selector in _AD_CONTAINER_SELECTORS:
        for container in soup.select(selector):
            for a in container.find_all("a", href=True):
                ad_urls.add(a["href"].strip())

    # Method 1: data-linkdata with linkUse=true
    for tag in soup.find_all(attrs={"data-linkdata": True}):
        raw = tag.get("data-linkdata", "")
        try:
            link_data = json.loads(raw)
            if link_data.get("linkUse") is True or link_data.get("linkUse") == "true":
                link_url = link_data.get("link", "")
                if (link_url
                    and not _is_internal_link(link_url)
                    and not _is_template_domain(link_url)  # ★ 추가
                    and link_url not in ad_urls             # ★ 추가
                    and link_url not in seen):
                    urls.append(link_url)
                    seen.add(link_url)
        except (json.JSONDecodeError, TypeError):
            pass

    # Method 2: plain <a> tags
    for anchor in soup.find_all("a", href=True):
        href: str = anchor["href"].strip()
        if (href
            and href.startswith("http")
            and not _is_internal_link(href)
            and not _is_template_domain(href)  # ★ 추가
            and href not in ad_urls             # ★ 추가
            and href not in seen):
            urls.append(href)
            seen.add(href)

    return urls


def _is_template_domain(url: str) -> bool:
    """네이버 블로그 템플릿이 자동 삽입하는 링크인지 확인."""
    try:
        host = urlparse(url).netloc.lower()
        return host in _TEMPLATE_DOMAINS or any(host.endswith("." + d) for d in _TEMPLATE_DOMAINS)
    except Exception:
        return False
```

---

## 3. 네이버플레이스 감지 강화

### 문제
`detect_place()`는 `place.naver.com`, `map.naver.com` URL을 체크하지만, 블로그에서 플레이스는 `<a>` 태그가 아닌 **SmartEditor 지도 모듈**로 삽입될 수 있음.

### 파일: `/home/jay/projects/InfoKeyword/worker/crawler/blog_content.py`

`get_blog_content()` 반환값에 `has_place` 필드가 이미 있지만, `_extract_external_links()`에서 naver.com 내부 링크는 제외되므로 place 링크가 external_links 목록에 안 들어감.

**수정**: `_extract_external_links()` 외에 별도로 플레이스/톡톡 링크를 감지

```python
def _detect_naver_special_links(soup: BeautifulSoup) -> dict:
    """블로그 본문에서 네이버 톡톡/플레이스 링크를 별도 감지."""
    has_talktalk = False
    has_place = False

    # 1. 모든 <a> 태그에서 도메인 확인 (internal 필터와 무관하게)
    for anchor in soup.find_all("a", href=True):
        href = anchor["href"].strip().lower()
        if "talk.naver.com" in href or "talkpf.naver.com" in href:
            has_talktalk = True
        if "place.naver.com" in href or "map.naver.com" in href:
            has_place = True

    # 2. SmartEditor 지도 모듈 감지
    # SE3 지도 컴포넌트: .se-module-map, .se-map
    if soup.select_one(".se-module-map, .se-map, .se_map"):
        has_place = True

    # 3. data-linkdata에서 place/map 링크 감지
    for tag in soup.find_all(attrs={"data-linkdata": True}):
        raw = tag.get("data-linkdata", "")
        try:
            link_data = json.loads(raw)
            link_url = link_data.get("link", "").lower()
            if "place.naver.com" in link_url or "map.naver.com" in link_url:
                has_place = True
            if "talk.naver.com" in link_url or "talkpf.naver.com" in link_url:
                has_talktalk = True
        except (json.JSONDecodeError, TypeError):
            pass

    # 4. iframe src에서 지도 감지 (네이버 지도 임베드)
    for iframe in soup.find_all("iframe", src=True):
        src = iframe["src"].lower()
        if "map.naver.com" in src or "place.naver.com" in src:
            has_place = True

    return {"has_talktalk": has_talktalk, "has_place": has_place}
```

**`get_blog_content()`에서 호출 추가**:
기존 `_has_talktalk(external_links)`, `_has_place(external_links)` 대신 위 함수 사용:
```python
# 기존 (external_links만 보므로 naver.com 링크 못 잡음)
# talktalk = _has_talktalk(external_links)
# place = _has_place(external_links)

# 수정 (HTML 전체에서 직접 감지)
special_links = _detect_naver_special_links(soup)
talktalk = special_links["has_talktalk"]
place = special_links["has_place"]
```

---

## 4. analyzer.py에서 톡톡/플레이스 감지 경로 수정

### 파일: `/home/jay/projects/InfoKeyword/worker/pipeline/analyzer.py`

현재 `_analyze_single_blog()`에서:
```python
# 현재: external_links.py의 함수로 감지
ext_links = detect_external_links(external_links)
has_talktalk = detect_talktalk(external_links)
has_place = detect_place(external_links)
```

수정: `blog_content.py`가 이미 `has_talktalk`, `has_place`를 반환하므로 이를 직접 사용:
```python
ext_links = detect_external_links(external_links)
has_talktalk = content.get("has_talktalk", False)  # blog_content.py에서 HTML 직접 감지
has_place = content.get("has_place", False)          # blog_content.py에서 HTML 직접 감지
```

그리고 reason도 세분화:
```python
# 기존: 외부링크/톡톡/플레이스 묶어서 하나의 reason
if ext_links or has_talktalk or has_place:
    reasons.append("external_links")

# 수정: 각각 별도 reason으로 분리 (한국어 표시 시 구체적 사유 제공)
if ext_links:
    reasons.append("external_links")
if has_talktalk:
    reasons.append("naver_talktalk")
if has_place:
    reasons.append("naver_place")
```

---

## 5. 프론트엔드 reason 한국어 매핑 업데이트

### 파일: `/home/jay/projects/InfoKeyword/src/app/report/[id]/page.tsx`

기존 매핑에 추가/수정:
```tsx
const reasonLabelMap: Record<string, string> = {
  "phone_or_address": "전화번호/주소 감지",
  "external_links": "외부링크 감지",
  "naver_talktalk": "네이버 톡톡 감지",
  "naver_place": "네이버 플레이스 감지",
  "attachment": "첨부파일 감지",
  "image_phone_or_address": "이미지 내 전화번호/주소",
  // "llm_promotional" 삭제됨
};
```

---

## 6. 검증

```bash
cd /home/jay/projects/InfoKeyword
source /home/jay/workspace/.env.keys
unset CLAUDECODE
python3 -c "
import asyncio
from worker.crawler.blog_search import search_blogs
from worker.crawler.blog_content import get_blog_content
from worker.analyzer.external_links import detect_external_links

async def test():
    # 테스트 1: CCL 링크 오감지 해소
    blogs = await search_blogs('가공육 암')
    for b in blogs[:3]:
        if not b['is_naver_blog']: continue
        content = await get_blog_content(b['url'])
        ext = detect_external_links(content['external_links'])
        print(f'rank={b[\"rank\"]} ext={len(ext)} talk={content[\"has_talktalk\"]} place={content[\"has_place\"]}')
        for u in ext:
            print(f'  {u[:80]}')

    # 테스트 2: 실제 홍보 링크는 정상 감지
    blogs2 = await search_blogs('암보험 추천')
    for b in blogs2:
        if b['is_ad'] or not b['is_naver_blog']: continue
        content = await get_blog_content(b['url'])
        ext = detect_external_links(content['external_links'])
        if ext:
            print(f'rank={b[\"rank\"]} ext={len(ext)} (정상감지)')
            break

asyncio.run(test())
"
```

예상:
- "가공육 암" rank 1,2: ext=0 (CCL 제외됨) ✅
- "암보험 추천" 홍보글: ext>0 (vo.la 등 정상 감지) ✅

---

## 7. Worker + 프론트엔드 재시작

```bash
# Worker
cd /home/jay/projects/InfoKeyword
fuser 8100/tcp 2>/dev/null | xargs -r kill
sleep 1
source /home/jay/workspace/.env.keys
unset CLAUDECODE
nohup python3 -m uvicorn worker.main:app --host 0.0.0.0 --port 8100 > /tmp/infokeyword-worker.log 2>&1 &
sleep 2
curl -s http://localhost:8100/health

# 프론트엔드
npm run build
fuser 3100/tcp 2>/dev/null | xargs -r kill
sleep 1
PORT=3100 nohup npm run start > /tmp/infokeyword-next.log 2>&1 &
sleep 3
curl -s -o /dev/null -w '%{http_code}' http://localhost:3100
```

---

## 수정 대상 파일
1. `worker/pipeline/analyzer.py` — LLM 조건 삭제 + 톡톡/플레이스 감지 경로 변경 + reason 세분화
2. `worker/crawler/blog_content.py` — 템플릿/배너광고 제외 + 플레이스/톡톡 HTML 직접 감지
3. `src/app/report/[id]/page.tsx` — reason 한국어 매핑 업데이트