jobcurator

PyPI Publish Lint & CI CodeQL Security Scan Docs Deployment PyPI Version Python Versions MIT License

πŸ€— Welcome to the jobcurator library

jobcurator is an open-source Machine Learning library to clean, normalize, structure, compress, and sample large datasets & feeds of job offers.

πŸ“š Table of Contents

πŸ’‘ Motivation: Why jobcurator?

This library exists because job feeds in the aggregator world and the programmatic job distribution world are extremely noisy, redundant, low quality, and not normalized. jobcurator was created to take raw job firehose feeds and turn them into high-quality, diversified and deduplicated structured job data β€” before they hit searching, ranking, matching or bidding engines.

πŸ“¬ Contact

For questions, ideas, or coordination around larger changes:

Primary maintainer πŸ“§ mouhidine.seiv@hrflow.ai

✨ Available features:

βš–οΈ Backends comparison

Feature default_hash minhash_hash sklearn_hash faiss_hash
Algorithm SimHash + LSH (Β± Multi-probe) MinHash + Jaccard LSH (Β± Multi-probe) HashingVectorizer + NearestNeighbors (cosine) FAISS IndexFlatL2 on composite vectors
Similarity Hamming cosine Jaccard on token/shingle sets Cosine distance L2 distance
Use case General-purpose hash+geo dedupe/compression Robust near-dupe on noisy / reordered text Text-heavy feeds + sklearn-based experimentation Huge catalogs, low latency, NN-heavy workloads
Dependencies None None scikit-learn+numpy faiss-cpu+numpy
Dataset size ~1k β†’ ~200k ~1k β†’ ~200k ~1k β†’ ~100k ~50k β†’ 1M+
Speed Fast on CPU for small–medium datasets Slower than default_hash (more hashing work) Moderate, depends on sparse ops & RAM Very fast for large NN queries once indexed
Explicit geo constraint Yes (3D distance filter in clustering) Yes (3D distance filter in clustering) No (only via tokens) No (geo only affects L2 distance)
3D location use Hard geo radius (max_cluster_distance_km) Hard geo radius (max_cluster_distance_km) Encoded as coarse x/y/z tokens Normalized (x,y,z) directly in vector
Text encoder SimHash on title + text Word n-gram shingles on title + text Text to sparse hashed vector Encoded via signature bits
Categories encoder In meta-hash (part of signature) Added as shingles in MinHash set As extra tokens to HashingVectorizer As β€œrichness” feature in vector
Salary encoder Bucketed into meta-hash Bucketed tokens in MinHash set Via numeric/features (quality, etc.) Indirect (via signature / numeric features)
Main threshold d_sim_threshold (Hamming on SimHash) jaccard_threshold (min Jaccard) Internal NN radius (not exposed in API) d_sim_threshold (max L2 in FAISS space)
Multi-probe support Yes (use_multiprobe, max_multiprobe_flips) Yes (use_multiprobe, max_multiprobe_flips) No No
Outlier filter Optional (use_outlier_filter + IsolationForest) Optional (same) Optional (same) Optional (same)

No dense text embeddings. Hash-based + classic ML only.

πŸ“‹ TODO


πŸ—‚οΈ Repository structure

jobcurator/
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ README.md                 # main intro, installation, backends, examples
β”œβ”€β”€ README_ADVANCED.md        # incremental pipelines + data model (advanced doc)
β”œβ”€β”€ LICENSE
β”œβ”€β”€ .gitignore
β”œβ”€β”€ test.py                   # unit tests for JobCurator (single batch)
β”œβ”€β”€ test_incremental.py       # CLI demo for incremental pipeline (local/sql)
β”‚
β”œβ”€β”€ src/
β”‚   └── jobcurator/
β”‚       β”œβ”€β”€ __init__.py        # exports JobCurator, Job, Category, etc.
β”‚       β”œβ”€β”€ models.py          # Job, Category, Location3DField, SalaryField
β”‚       β”œβ”€β”€ curator.py         # JobCurator class (dedupe + compression)
β”‚       β”œβ”€β”€ hash_utils.py      # SimHash, MinHash, LSH, distances, signatures
β”‚       β”œβ”€β”€ cuckoo_filter.py   # CuckooFilter implementation
β”‚       β”œβ”€β”€ sklearn_backend.py # sklearn_hash backend helpers (if used)
β”‚       β”œβ”€β”€ faiss_backend.py   # faiss_hash backend helpers (if used)
β”‚       └── storage/
β”‚           β”œβ”€β”€ __init__.py
β”‚           β”œβ”€β”€ base.py        # LightJob, StoreDB, process_batch, global_reselect
β”‚           β”œβ”€β”€ sql_store.py   # SqlStoreDB (compressed_jobs + dedupe_state tables)
β”‚           └── local_store.py # LocalFileStoreDB (JSONL + pickle)
β”‚
└── tests/
    └── __init__.py

πŸš€ Installation

To install for local Dev:

git clone https://github.com/<your-username>/jobcurator.git
cd jobcurator
pip install -e .

To reinstall for local Dev:

pip uninstall -y jobcurator  # ignore error if not installed
pip install -e .

(coming soon) To install the package once published to PyPI:

pip install jobcurator

Optional extras:

pip install scikit-learn faiss-cpu

πŸ§ͺ Testing code

Run main folder run test.py

# 1) Default backend (SimHash + LSH + geo), keep ~50%, preview 10 jobs (capped to len(jobs))
python3 test.py                        # n-preview-jobs=10 (capped to len(jobs)), ratio=0.5

# 2) Default backend (SimHash + LSH + geo), more aggressive compression  keep ~30%, preview 8 jobs
python3 test.py --backend default_hash --ratio 0.3 --n-preview-jobs 8

# 3) Default backend (MinHash + Jaccard LSH + geo),  keep ~50%, preview 5
python3 test.py --backend minhash_hash --ratio 0.5 --n-preview-jobs 5

# 4) sklearn backend (HashingVectorizer + NearestNeighbors), keep ~50%, preview 5
#    (requires: pip install scikit-learn)
python3 test.py --backend sklearn_hash --ratio 0.5 --n-preview-jobs 5

# 5) FAISS backend (signature bits + 3D loc + categories), keep ~40%, preview 5
#    (requires: pip install faiss-cpu)
python3 test.py --backend faiss_hash --ratio 0.4 --n-preview-jobs 5

# 6) Short option for preview:
python3 test.py -n 5 --backend default_hash --ratio 0.5
python3 test.py -n 5 --backend minhash_hash  --ratio 0.4

🧩 Public API a Example usage

Import

from jobcurator import JobCurator, Job, Category, SalaryField, Location3DField
from datetime import datetime

Basic Jobcurator


# 1) Build some jobs

jobs = [
    Job(
        id="job-1",
        title="Senior Backend Engineer",
        text="Full description...",
        categories={
            "job_function": [
                Category(
                    id="backend",
                    label="Backend",
                    level=1,
                    parent_id="eng",
                    level_path=["Engineering", "Software", "Backend"],
                )
            ]
        },
        location=Location3DField(
            lat=48.8566,
            lon=2.3522,
            alt_m=35,
            city="Paris",
            country_code="FR",
        ),
        salary=SalaryField(
            min_value=60000,
            max_value=80000,
            currency="EUR",
            period="year",
        ),
        company="HrFlow.ai",
        contract_type="Full-time",
        source="direct",
        created_at=datetime.utcnow(),
    ),
]

# 2) Choose a backend

# ======================================================
# Option 1: "default_hash"
#      SimHash + LSH (+ optional Multi-probe) + geo distance
#      (no extra dependencies)
# ======================================================

curator_default = JobCurator(
    # Global parameters (used by all backends)
    ratio=0.5,                     # keep ~50% of jobs
    alpha=0.6,                     # quality vs diversity tradeoff
    max_per_cluster_in_pool=3,     # max jobs per cluster entering pool
    backend="default_hash",
    use_outlier_filter=False,      # set True to enable IsolationForest (if sklearn installed)
    outlier_contamination=0.05,    # only used when use_outlier_filter=True

    # Backend-specific: default_hash
    d_sim_threshold=20,            # max Hamming distance on SimHash
    max_cluster_distance_km=50.0,  # max geo distance (km) within a cluster
    # jaccard_threshold is ignored by default_hash

    # Multi-probe LSH (used by default_hash + minhash_hash)
    use_multiprobe=True,
    max_multiprobe_flips=1,        # small value = light extra recall
)


# ======================================================
# Option 2: "minhash_hash"
#      MinHash + Jaccard LSH on shingles (text + cats + coarse loc + salary)
#      + optional Multi-probe + geo distance
#      (no extra dependencies)
# ======================================================

curator_minhash = JobCurator(
    # Global parameters
    ratio=0.5,
    alpha=0.6,
    max_per_cluster_in_pool=3,
    backend="minhash_hash",
    use_outlier_filter=False,
    outlier_contamination=0.05,

    # Backend-specific: minhash_hash
    jaccard_threshold=0.8,         # min Jaccard similarity between jobs in a cluster
    max_cluster_distance_km=50.0,  # geo radius (km) for clusters
    # d_sim_threshold is ignored by minhash_hash

    # Multi-probe LSH for MinHash bands
    use_multiprobe=True,
    max_multiprobe_flips=1,
)


# ======================================================
# Option 3: "sklearn_hash"
#      HashingVectorizer + NearestNeighbors (cosine radius)
#      (requires scikit-learn)
# ======================================================

# pip install scikit-learn
curator_sklearn = JobCurator(
    # Global parameters
    ratio=0.5,
    alpha=0.6,
    max_per_cluster_in_pool=3,
    backend="sklearn_hash",
    use_outlier_filter=True,       # enable IsolationForest pre-filter
    outlier_contamination=0.05,    # proportion of jobs flagged as outliers

    # Backend-specific:
    # d_sim_threshold, max_cluster_distance_km, jaccard_threshold,
    # use_multiprobe, max_multiprobe_flips are all ignored by sklearn_hash
)


# ======================================================
# Option 4: "faiss_hash"
#      FAISS on [signature bits + 3D location + category richness]
#      (requires faiss-cpu)
# ======================================================

# pip install faiss-cpu
curator_faiss = JobCurator(
    # Global parameters
    ratio=0.5,
    alpha=0.6,
    max_per_cluster_in_pool=3,
    backend="faiss_hash",
    use_outlier_filter=False,
    outlier_contamination=0.05,

    # Backend-specific: faiss_hash
    d_sim_threshold=20,            # approx max L2 distance in FAISS space
    # max_cluster_distance_km, jaccard_threshold, use_multiprobe,
    # max_multiprobe_flips are ignored by faiss_hash
)


# 3) Compute the results

compressed_jobs = curator_default.dedupe_and_compress(jobs)

print(f"{len(jobs)} β†’ {len(compressed_jobs)} jobs kept")
for j in compressed_jobs:
    print(j.id, j.title, j.location.city, f"quality={j.quality:.3f}")

curator_default.print_compression_summary(n_preview=10, t_ms=elapsed_ms)
curator_default.print_jobs_summary(selected, n_preview=10, label="Selected")

1) print_compression_summary(n_preview: int = 0, t_ms: float = 0.0)

Shows the effective keep ratio, backend, timing, and an ASCII table of length/quality stats for all vs selected.

Example output

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ”Ž preview: 10 | 🎯 ratio: 0.40 | 🧠 backend: default_hash | ⏱️  time: 82.4 ms β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total jobs: 12000 | Selected: 4800 (40.0% kept)
+----------+-------+--------+--------+--------+--------+
| Dataset  | Count |  Len ΞΌ |  Len Οƒ | Qual ΞΌ | Qual Οƒ |
+==========+=======+========+========+========+========+
| All jobs | 12000 |  92.14 |  37.21 | 0.644  | 0.112  |
| Selected |  4800 | 106.38 |  29.07 | 0.711  | 0.087  |
+----------+-------+--------+--------+--------+--------+

2) print_jobs_summary(jobs, num_selected_to_show=10, label="jobs set")

Previews the top-N by current order (you can pass curator.selected_jobs) with per-row Quality / Diversity / Selection and a short canonical hash.

Columns: ID | Title | City | Quality | Diversity | Selection | Hash

Incremental JobCURATOR

conn = psycopg2.connect(β€œdbname=… user=… password=… host=…”) store = SqlStoreDB(conn)

curator = JobCurator(backend=”default_hash”, ratio=0.5, alpha=0.6)

compressed_jobs1 = process_batch(store, jobs1, curator) compressed_jobs2 = process_batch(store, jobs2, curator)

global_reselect_in_store(store, ratio=0.5, alpha=0.6)

* Local store
```python
from jobcurator import JobCurator
from jobcurator.storage import LocalFileStoreDB, process_batch, global_reselect_in_store

store = LocalFileStoreDB()

curator = JobCurator(backend="default_hash", ratio=0.5, alpha=0.6)

compressed_jobs1 = process_batch(store, jobs1, curator)
compressed_jobs2 = process_batch(store, jobs2, curator)

global_reselect_in_store(store, ratio=0.5, alpha=0.6)


🧱 Core Concepts

Job schema

A Job is a structured object with:

Category schema

A Category is a hierarchical node:

Multiple dimensions (e.g. job_function, industry, seniority) can coexist in categories:

job.categories = {
    "job_function": [Category(...), ...],
    "industry": [Category(...), ...],
}

Location schema with 3D coordinates

Location3DField:

These 3D coordinates are used to compute actual distances between cities and avoid merging jobs that are geographically too far when clustering.

Salary schema

SalaryField:

Salary is used both in completion/quality scoring and in the exact/meta hashes (bucketed).

CuckooFilter (approximate β€œseen before”)

The library includes a simple CuckooFilter:

compressed = curator.dedupe_and_compress(jobs, seen_filter=seen_filter)

Where seen_filter is typically an instance of jobcurator.CuckooFilter.

JobCurator parameters

JobCurator(
    ratio: float = 1.0,              # default compression ratio
    alpha: float = 0.6,              # quality vs diversity weight
    max_per_cluster_in_pool: int = 3,
    d_sim_threshold: int = 20,       # SimHash Hamming threshold for clustering
    max_cluster_distance_km: float = 50.0,  # max distance between cities in same cluster
)

More params: | Param | Where | Type | Default | Description | | β€”β€”β€”β€”β€”β€”β€”β€”- | β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”- | β€”β€”β€”β€”- | β€”β€”β€”β€”β€”- | ————————————————————————————————– | | ratio | JobCurator(...) / dedupe_and_compress() | float ∈ [0,1] | 1.0 | Target keep ratio after dedupe + selection. | | alpha | JobCurator(...) | float ∈ [0,1] | 0.6 | Trade-off in selection_score. | | greedy_diversity | dedupe_and_compress() | bool | False | Recompute diversity on the final set with robust scaling (recommended for quality-sensitive runs). | | max_per_cluster_in_pool | JobCurator(...) | int | 3 | Cap per cluster before global selection. | | backend | JobCurator(...) | literal | "default_hash" | Hashing/clustering strategy. | | use_outlier_filter | JobCurator(...) | bool | False | Optional IsolationForest pre-filter. | | d_sim_threshold | JobCurator(...) | int | 20 | Hamming/L2 threshold (backend-specific). | | jaccard_threshold | JobCurator(...) | float | 0.8 | MinHash LSH threshold. |


JobCurator Backends

You choose the dedup clustering strategy via:

JobCurator(backend=...)

Available backends:


βš™οΈ How It Works (High Level)

1. Preprocessing & scoring

2. Approximate β€œseen before” filter (CuckooFilter)

3. Exact hash dedup (strict duplicates)

4. Composite signature (no embeddings)

For each job, build a 128-bit signature:

This signature is used by the different backends.

5. Clustering (backend-dependent)

Depending on backend:

#### a. backend="default_hash" – SimHash + Multi-probe LSH + geo

#### b. backend="sklearn_hash" – HashingVectorizer + NearestNeighbors

#### c. backend="faiss_hash" – FAISS on signature + 3D loc + categories

6. Intra-cluster ranking

7. Global compression with diversity

Result: you keep fewer, higher-quality, and more diverse jobs, while avoiding duplicates (strict + near-duplicates), and optionally skipping already-seen jobs via CuckooFilter.


## πŸ› οΈ Advanced (High Level)

1. Diversity–aware selection

During compression we score each candidate:

selection_score = Ξ± * quality + (1 - Ξ±) * diversity

a. Greedy pass (fast)

While selecting, we compute each job’s min distance to any already-selected item, then robust-scale distances with quantiles (q_lo=0.10, q_hi=0.90) and label smoothing (Ξ΅=0.02) to avoid hard 0/1:

z = clamp01( (d - q10) / (q90 - q10 + 1e-6) )
diversity = Ξ΅ + (1 - 2Ξ΅) * z

Seed item gets diversity_score = 1.0 (helps robust scaling).

b. Greedy diversity re-compute (optional, slower, more faithful)

If you pass greedy_diversity=True, we run recompute_diversity_scores() on the final selected set:

Knobs (in recompute_diversity_scores):

selected = curator.dedupe_and_compress(
    jobs,
    ratio=0.4,                 # optional override
    greedy_diversity=True,     # ← new
    seen_filter=my_cuckoo,     # optional Bloom/Cuckoo/set-like
)

# Optional recalibration if you want to run it manually:
curator.recompute_diversity_scores(
    selected_jobs=selected,
    alpha=curator.alpha,
    distance_fn=curator._diversity_distance,
    k_nn=3,
    q_lo=0.10, q_hi=0.90,
    tau=0.15,
    label_eps=0.02,
    use_softmin=False,
)

### Incremental Jobcurator Approach

Problem: You often receive batches of jobs over time (jobs1, jobs2, …) and want to:

The solution is:

  1. Use a global CuckooFilter to remember β€œseen” jobs (by exact hash).
  2. Use a pluggable StoreDB to store compressed jobs + CuckooFilter state.
  3. Use:

    • process_batch(StoreDB, jobs, JobCurator) for incremental batches
    • global_reselect_in_store(StoreDB, ratio, alpha) for global rebalancing

Test with local storage:

python3 test_incremental.py \
  --backend default_hash \
  --ratio 0.5 \
  --alpha 0.6 \
  --storage local \
  --dsn "" \
  --batches 3 \
  --n-per-batch 20 \
  --clear-local \
  # --no-global-reselect   # (optional) add this flag if you want to skip final global rebalancing

Test with SQL storage (Postgres):

python3 test_incremental.py \
  --backend default_hash \
  --ratio 0.5 \
  --alpha 0.6 \
  --storage sql \
  --dsn "dbname=mydb user=myuser password=mypass host=localhost port=5432" \  
  --batches 3 \
  --n-per-batch 30 \
  # --no-global-reselect   # optional

For more details, see the Advanced documentation.