Files
PocketVeto/backend/app/services/trends_service.py
Jack Levy 7e5c5b473e feat: API optimizations — quota batching, ETags, caching, async sponsor (v0.9.7)
Nine efficiency improvements across the data pipeline:

1. NewsAPI OR batching (news_service.py + news_fetcher.py)
   - Combine up to 4 bills per NewsAPI call using OR query syntax
   - NEWSAPI_BATCH_SIZE=4 means ~4× effective daily quota (100→400 bill-fetches)
   - fetch_news_for_bill_batch task; fetch_news_for_active_bills queues batches

2. Google News RSS cache (news_service.py)
   - 2-hour Redis cache shared between news_fetcher and trend_scorer
   - Eliminates duplicate RSS hits when both workers run against same bill
   - clear_gnews_cache() admin helper + admin endpoint

3. pytrends keyword batching (trends_service.py + trend_scorer.py)
   - Compare up to 5 bills per pytrends call instead of 1
   - get_trends_scores_batch() returns scores in original order
   - Reduces pytrends calls by ~5× and associated rate-limit risk

4. GovInfo ETags (govinfo_api.py + document_fetcher.py)
   - If-None-Match conditional GET; DocumentUnchangedError on HTTP 304
   - ETags stored in Redis (30-day TTL) keyed by MD5(url)
   - document_fetcher catches DocumentUnchangedError → {"status": "unchanged"}

5. Anthropic prompt caching (llm_service.py)
   - cache_control: {type: ephemeral} on system messages in AnthropicProvider
   - Caches the ~700-token system prompt server-side; ~50% cost reduction on
     repeated calls within the 5-minute cache window

6. Async sponsor fetch (congress_poller.py)
   - New fetch_sponsor_for_bill Celery task replaces blocking get_bill_detail()
     inline in poll loop
   - Bills saved immediately with sponsor_id=None; sponsor linked async
   - Removes 0.25s sleep per new bill from poll hot path

7. Skip doc fetch for procedural actions (congress_poller.py)
   - _DOC_PRODUCING_CATEGORIES = {vote, committee_report, presidential, ...}
   - fetch_bill_documents only enqueued when action is likely to produce
     new GovInfo text (saves ~60–70% of unnecessary document fetch attempts)

8. Adaptive poll frequency (congress_poller.py)
   - _is_congress_off_hours(): weekends + before 9AM / after 9PM EST
   - Skips poll if off-hours AND last poll < 1 hour ago
   - Prevents wasteful polling when Congress is not in session

9. Admin panel additions (admin.py + settings/page.tsx + api.ts)
   - GET /api/admin/newsapi-quota → remaining calls today
   - POST /api/admin/clear-gnews-cache → flush RSS cache
   - Settings page shows NewsAPI quota remaining (amber if < 10)
   - "Clear Google News Cache" button in Manual Controls

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 16:50:51 -04:00

113 lines
3.7 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""
Google Trends service (via pytrends).
pytrends is unofficial web scraping — Google blocks it sporadically.
All calls are wrapped in try/except and return 0 on any failure.
"""
import logging
import random
import time
from app.config import settings
logger = logging.getLogger(__name__)
def get_trends_score(keywords: list[str]) -> float:
"""
Return a 0100 interest score for the given keywords over the past 90 days.
Returns 0.0 on any failure (rate limit, empty data, exception).
"""
if not settings.PYTRENDS_ENABLED or not keywords:
return 0.0
try:
from pytrends.request import TrendReq
# Jitter to avoid detection as bot
time.sleep(random.uniform(2.0, 5.0))
pytrends = TrendReq(hl="en-US", tz=0, timeout=(10, 25))
kw_list = [k for k in keywords[:5] if k] # max 5 keywords
if not kw_list:
return 0.0
pytrends.build_payload(kw_list, timeframe="today 3-m", geo="US")
data = pytrends.interest_over_time()
if data is None or data.empty:
return 0.0
# Average the most recent 14 data points for the primary keyword
primary = kw_list[0]
if primary not in data.columns:
return 0.0
recent = data[primary].tail(14)
return float(recent.mean())
except Exception as e:
logger.debug(f"pytrends failed (non-critical): {e}")
return 0.0
def get_trends_scores_batch(keyword_groups: list[list[str]]) -> list[float]:
"""
Get pytrends scores for up to 5 keyword groups in a SINGLE pytrends call.
Takes the first (most relevant) keyword from each group and compares them
relative to each other. Falls back to per-group individual calls if the
batch fails.
Returns a list of scores (0100) in the same order as keyword_groups.
"""
if not settings.PYTRENDS_ENABLED or not keyword_groups:
return [0.0] * len(keyword_groups)
# Extract the primary (first) keyword from each group, skip empty groups
primaries = [(i, kws[0]) for i, kws in enumerate(keyword_groups) if kws]
if not primaries:
return [0.0] * len(keyword_groups)
try:
from pytrends.request import TrendReq
time.sleep(random.uniform(2.0, 5.0))
pytrends = TrendReq(hl="en-US", tz=0, timeout=(10, 25))
kw_list = [kw for _, kw in primaries[:5]]
pytrends.build_payload(kw_list, timeframe="today 3-m", geo="US")
data = pytrends.interest_over_time()
scores = [0.0] * len(keyword_groups)
if data is not None and not data.empty:
for idx, kw in primaries[:5]:
if kw in data.columns:
scores[idx] = float(data[kw].tail(14).mean())
return scores
except Exception as e:
logger.debug(f"pytrends batch failed (non-critical): {e}")
# Fallback: return zeros (individual calls would just multiply failures)
return [0.0] * len(keyword_groups)
def keywords_for_member(first_name: str, last_name: str) -> list[str]:
"""Extract meaningful search keywords for a member of Congress."""
full_name = f"{first_name} {last_name}".strip()
if not full_name:
return []
return [full_name]
def keywords_for_bill(title: str, short_title: str, topic_tags: list[str]) -> list[str]:
"""Extract meaningful search keywords for a bill."""
keywords = []
if short_title:
keywords.append(short_title)
elif title:
# Use first 5 words of title
words = title.split()[:5]
if len(words) >= 2:
keywords.append(" ".join(words))
keywords.extend(tag.replace("-", " ") for tag in (topic_tags or [])[:3])
return keywords[:5]