Files
dotfiles/python/ebook_search/eval/dataset.py
T
Richie 09d963ba34
treefmt / nix fmt (pull_request) Successful in 10s
pytest / pytest (pull_request) Successful in 31s
build_systems / build-brain (pull_request) Successful in 52s
build_systems / build-bob (pull_request) Successful in 52s
build_systems / build-jeeves (pull_request) Successful in 2m43s
build_systems / build-leviathan (pull_request) Successful in 59s
build_systems / build-rhapsody-in-green (pull_request) Successful in 1m5s
fix(ebook-search): skip comment lines in gold query loader and realign tests
load_gold_queries now skips blank and `//` comment lines so the committed
section separator in queries.jsonl no longer breaks dataset/load-test loading.

Update tests left stale by the search refactor (6bc3011):
- pass the now-required rank_constant to reciprocal_rank_fusion
- expect bm25_candidates to receive the full query and drop the removed
  "BM25 query preparation" timing step
- assert reranking is enabled by default
2026-06-21 14:45:31 -04:00

48 lines
1.7 KiB
Python

"""Shared query set loading for evaluation and load testing.
Each JSONL record has a ``query`` and an optional reference ``answer``. ``answerable``
marks whether the query should be answerable from the library (false for out-of-corpus
"garbage" queries used to test the refusal path). Relevance for retrieval metrics is
labeled at source (book) granularity in ``relevant_sources``; source titles must match
``ebook_source.title`` values for the indexed corpus.
"""
from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
DEFAULT_QUERIES_PATH = Path(__file__).parent / "data" / "queries.jsonl"
@dataclass(frozen=True)
class GoldQuery:
"""One labeled query shared by the eval and load-test tools."""
query: str
answer: str | None
answerable: bool
relevant_sources: tuple[str, ...]
relevant_substrings: tuple[str, ...]
def load_gold_queries(path: Path = DEFAULT_QUERIES_PATH) -> list[GoldQuery]:
"""Load labeled queries from a JSONL file. Blank lines and ``//`` comment lines are skipped."""
queries: list[GoldQuery] = []
for line in path.read_text(encoding="utf-8").splitlines():
stripped = line.strip()
if not stripped or stripped.startswith("//"):
continue
record = json.loads(stripped)
queries.append(
GoldQuery(
query=str(record["query"]),
answer=record.get("answer"),
answerable=bool(record.get("answerable", True)),
relevant_sources=tuple(record.get("relevant_sources", ())),
relevant_substrings=tuple(record.get("relevant_substrings", ())),
)
)
return queries