jobcurator is an open-source Machine Learning library to clean, normalize, structure, compress, and sample large datasets & feeds of job offers.
This library exists because job feeds in the aggregator world and the programmatic job distribution world are extremely noisy, redundant, low quality, and not normalized. jobcurator was created to take raw job firehose feeds and turn them into high-quality, diversified and deduplicated structured job data β before they hit searching, ranking, matching or bidding engines.
For questions, ideas, or coordination around larger changes:
Primary maintainer π§ mouhidine.seiv@hrflow.ai
dedupe_and_compress is a Hash-based job deduplication and compression fuction with quality and diversity preservation. It takes a list of
structured job objects and:
print_compression_summary() β compact, ASCII table with length/quality stats.print_jobs_summary() β preview of top-N selected jobs with Quality / Diversity / Selection.process_batch Supports incremental pipelines (see the Advanced documentation:
| Feature | default_hash |
minhash_hash |
sklearn_hash |
faiss_hash |
|---|---|---|---|---|
| Algorithm | SimHash + LSH (Β± Multi-probe) | MinHash + Jaccard LSH (Β± Multi-probe) | HashingVectorizer + NearestNeighbors (cosine) | FAISS IndexFlatL2 on composite vectors |
| Similarity | Hamming cosine | Jaccard on token/shingle sets | Cosine distance | L2 distance |
| Use case | General-purpose hash+geo dedupe/compression | Robust near-dupe on noisy / reordered text | Text-heavy feeds + sklearn-based experimentation | Huge catalogs, low latency, NN-heavy workloads |
| Dependencies | None | None | scikit-learn+numpy |
faiss-cpu+numpy |
| Dataset size | ~1k β ~200k | ~1k β ~200k | ~1k β ~100k | ~50k β 1M+ |
| Speed | Fast on CPU for smallβmedium datasets | Slower than default_hash (more hashing work) |
Moderate, depends on sparse ops & RAM | Very fast for large NN queries once indexed |
| Explicit geo constraint | Yes (3D distance filter in clustering) | Yes (3D distance filter in clustering) | No (only via tokens) | No (geo only affects L2 distance) |
| 3D location use | Hard geo radius (max_cluster_distance_km) |
Hard geo radius (max_cluster_distance_km) |
Encoded as coarse x/y/z tokens | Normalized (x,y,z) directly in vector |
| Text encoder | SimHash on title + text | Word n-gram shingles on title + text | Text to sparse hashed vector | Encoded via signature bits |
| Categories encoder | In meta-hash (part of signature) | Added as shingles in MinHash set | As extra tokens to HashingVectorizer | As βrichnessβ feature in vector |
| Salary encoder | Bucketed into meta-hash | Bucketed tokens in MinHash set | Via numeric/features (quality, etc.) | Indirect (via signature / numeric features) |
| Main threshold | d_sim_threshold (Hamming on SimHash) |
jaccard_threshold (min Jaccard) |
Internal NN radius (not exposed in API) | d_sim_threshold (max L2 in FAISS space) |
| Multi-probe support | Yes (use_multiprobe, max_multiprobe_flips) |
Yes (use_multiprobe, max_multiprobe_flips) |
No | No |
| Outlier filter | Optional (use_outlier_filter + IsolationForest) |
Optional (same) | Optional (same) | Optional (same) |
No dense text embeddings. Hash-based + classic ML only.
jobcurator/
βββ pyproject.toml
βββ README.md # main intro, installation, backends, examples
βββ README_ADVANCED.md # incremental pipelines + data model (advanced doc)
βββ LICENSE
βββ .gitignore
βββ test.py # unit tests for JobCurator (single batch)
βββ test_incremental.py # CLI demo for incremental pipeline (local/sql)
β
βββ src/
β βββ jobcurator/
β βββ __init__.py # exports JobCurator, Job, Category, etc.
β βββ models.py # Job, Category, Location3DField, SalaryField
β βββ curator.py # JobCurator class (dedupe + compression)
β βββ hash_utils.py # SimHash, MinHash, LSH, distances, signatures
β βββ cuckoo_filter.py # CuckooFilter implementation
β βββ sklearn_backend.py # sklearn_hash backend helpers (if used)
β βββ faiss_backend.py # faiss_hash backend helpers (if used)
β βββ storage/
β βββ __init__.py
β βββ base.py # LightJob, StoreDB, process_batch, global_reselect
β βββ sql_store.py # SqlStoreDB (compressed_jobs + dedupe_state tables)
β βββ local_store.py # LocalFileStoreDB (JSONL + pickle)
β
βββ tests/
βββ __init__.py
To install for local Dev:
git clone https://github.com/<your-username>/jobcurator.git
cd jobcurator
pip install -e .
To reinstall for local Dev:
pip uninstall -y jobcurator # ignore error if not installed
pip install -e .
(coming soon) To install the package once published to PyPI:
pip install jobcurator
Optional extras:
pip install scikit-learn faiss-cpu
Run main folder run test.py
# 1) Default backend (SimHash + LSH + geo), keep ~50%, preview 10 jobs (capped to len(jobs))
python3 test.py # n-preview-jobs=10 (capped to len(jobs)), ratio=0.5
# 2) Default backend (SimHash + LSH + geo), more aggressive compression keep ~30%, preview 8 jobs
python3 test.py --backend default_hash --ratio 0.3 --n-preview-jobs 8
# 3) Default backend (MinHash + Jaccard LSH + geo), keep ~50%, preview 5
python3 test.py --backend minhash_hash --ratio 0.5 --n-preview-jobs 5
# 4) sklearn backend (HashingVectorizer + NearestNeighbors), keep ~50%, preview 5
# (requires: pip install scikit-learn)
python3 test.py --backend sklearn_hash --ratio 0.5 --n-preview-jobs 5
# 5) FAISS backend (signature bits + 3D loc + categories), keep ~40%, preview 5
# (requires: pip install faiss-cpu)
python3 test.py --backend faiss_hash --ratio 0.4 --n-preview-jobs 5
# 6) Short option for preview:
python3 test.py -n 5 --backend default_hash --ratio 0.5
python3 test.py -n 5 --backend minhash_hash --ratio 0.4
from jobcurator import JobCurator, Job, Category, SalaryField, Location3DField
from datetime import datetime
# 1) Build some jobs
jobs = [
Job(
id="job-1",
title="Senior Backend Engineer",
text="Full description...",
categories={
"job_function": [
Category(
id="backend",
label="Backend",
level=1,
parent_id="eng",
level_path=["Engineering", "Software", "Backend"],
)
]
},
location=Location3DField(
lat=48.8566,
lon=2.3522,
alt_m=35,
city="Paris",
country_code="FR",
),
salary=SalaryField(
min_value=60000,
max_value=80000,
currency="EUR",
period="year",
),
company="HrFlow.ai",
contract_type="Full-time",
source="direct",
created_at=datetime.utcnow(),
),
]
# 2) Choose a backend
# ======================================================
# Option 1: "default_hash"
# SimHash + LSH (+ optional Multi-probe) + geo distance
# (no extra dependencies)
# ======================================================
curator_default = JobCurator(
# Global parameters (used by all backends)
ratio=0.5, # keep ~50% of jobs
alpha=0.6, # quality vs diversity tradeoff
max_per_cluster_in_pool=3, # max jobs per cluster entering pool
backend="default_hash",
use_outlier_filter=False, # set True to enable IsolationForest (if sklearn installed)
outlier_contamination=0.05, # only used when use_outlier_filter=True
# Backend-specific: default_hash
d_sim_threshold=20, # max Hamming distance on SimHash
max_cluster_distance_km=50.0, # max geo distance (km) within a cluster
# jaccard_threshold is ignored by default_hash
# Multi-probe LSH (used by default_hash + minhash_hash)
use_multiprobe=True,
max_multiprobe_flips=1, # small value = light extra recall
)
# ======================================================
# Option 2: "minhash_hash"
# MinHash + Jaccard LSH on shingles (text + cats + coarse loc + salary)
# + optional Multi-probe + geo distance
# (no extra dependencies)
# ======================================================
curator_minhash = JobCurator(
# Global parameters
ratio=0.5,
alpha=0.6,
max_per_cluster_in_pool=3,
backend="minhash_hash",
use_outlier_filter=False,
outlier_contamination=0.05,
# Backend-specific: minhash_hash
jaccard_threshold=0.8, # min Jaccard similarity between jobs in a cluster
max_cluster_distance_km=50.0, # geo radius (km) for clusters
# d_sim_threshold is ignored by minhash_hash
# Multi-probe LSH for MinHash bands
use_multiprobe=True,
max_multiprobe_flips=1,
)
# ======================================================
# Option 3: "sklearn_hash"
# HashingVectorizer + NearestNeighbors (cosine radius)
# (requires scikit-learn)
# ======================================================
# pip install scikit-learn
curator_sklearn = JobCurator(
# Global parameters
ratio=0.5,
alpha=0.6,
max_per_cluster_in_pool=3,
backend="sklearn_hash",
use_outlier_filter=True, # enable IsolationForest pre-filter
outlier_contamination=0.05, # proportion of jobs flagged as outliers
# Backend-specific:
# d_sim_threshold, max_cluster_distance_km, jaccard_threshold,
# use_multiprobe, max_multiprobe_flips are all ignored by sklearn_hash
)
# ======================================================
# Option 4: "faiss_hash"
# FAISS on [signature bits + 3D location + category richness]
# (requires faiss-cpu)
# ======================================================
# pip install faiss-cpu
curator_faiss = JobCurator(
# Global parameters
ratio=0.5,
alpha=0.6,
max_per_cluster_in_pool=3,
backend="faiss_hash",
use_outlier_filter=False,
outlier_contamination=0.05,
# Backend-specific: faiss_hash
d_sim_threshold=20, # approx max L2 distance in FAISS space
# max_cluster_distance_km, jaccard_threshold, use_multiprobe,
# max_multiprobe_flips are ignored by faiss_hash
)
# 3) Compute the results
compressed_jobs = curator_default.dedupe_and_compress(jobs)
print(f"{len(jobs)} β {len(compressed_jobs)} jobs kept")
for j in compressed_jobs:
print(j.id, j.title, j.location.city, f"quality={j.quality:.3f}")
curator_default.print_compression_summary(n_preview=10, t_ms=elapsed_ms)
curator_default.print_jobs_summary(selected, n_preview=10, label="Selected")
print_compression_summary(n_preview: int = 0, t_ms: float = 0.0)Shows the effective keep ratio, backend, timing, and an ASCII table of length/quality stats for all vs selected.
Example output
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π preview: 10 | π― ratio: 0.40 | π§ backend: default_hash | β±οΈ time: 82.4 ms β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total jobs: 12000 | Selected: 4800 (40.0% kept)
+----------+-------+--------+--------+--------+--------+
| Dataset | Count | Len ΞΌ | Len Ο | Qual ΞΌ | Qual Ο |
+==========+=======+========+========+========+========+
| All jobs | 12000 | 92.14 | 37.21 | 0.644 | 0.112 |
| Selected | 4800 | 106.38 | 29.07 | 0.711 | 0.087 |
+----------+-------+--------+--------+--------+--------+
print_jobs_summary(jobs, num_selected_to_show=10, label="jobs set")Previews the top-N by current order (you can pass curator.selected_jobs) with per-row Quality / Diversity / Selection and a short canonical hash.
Columns: ID | Title | City | Quality | Diversity | Selection | Hash
conn = psycopg2.connect(βdbname=β¦ user=β¦ password=β¦ host=β¦β) store = SqlStoreDB(conn)
curator = JobCurator(backend=βdefault_hashβ, ratio=0.5, alpha=0.6)
compressed_jobs1 = process_batch(store, jobs1, curator) compressed_jobs2 = process_batch(store, jobs2, curator)
global_reselect_in_store(store, ratio=0.5, alpha=0.6)
* Local store
```python
from jobcurator import JobCurator
from jobcurator.storage import LocalFileStoreDB, process_batch, global_reselect_in_store
store = LocalFileStoreDB()
curator = JobCurator(backend="default_hash", ratio=0.5, alpha=0.6)
compressed_jobs1 = process_batch(store, jobs1, curator)
compressed_jobs2 = process_batch(store, jobs2, curator)
global_reselect_in_store(store, ratio=0.5, alpha=0.6)
A Job is a structured object with:
id: unique identifiertitle: job title (string)text: full job description (string)categories: hierarchical taxonomy per dimension (dict[str, list[Category]])location: Location3DField with lat/lon/alt (internally converted to 3D x,y,z)salary: optional SalaryFieldcompany, contract_type, source, created_atJobCurator):
length_tokens, length_scorecompletion_score_valqualityexact_hash (strict dedup key)signature (128-bit composite hash used by backends)A Category is a hierarchical node:
id: unique taxonomy IDlabel: human-readable labellevel: depth in hierarchy (0 = root)parent_id: optional parent category idlevel_path: full path from root (e.g. ["Engineering", "Software", "Backend"])Multiple dimensions (e.g. job_function, industry, seniority) can coexist in categories:
job.categories = {
"job_function": [Category(...), ...],
"industry": [Category(...), ...],
}
Location3DField:
lat, lon: degreesalt_m: altitude in meterscity, country_code: human-readable metadatax, y, z: Earth-centered 3D coordinates (computed internally by compute_xyz())These 3D coordinates are used to compute actual distances between cities and avoid merging jobs that are geographically too far when clustering.
SalaryField:
min_value, max_value: numeric range (optional)currency: e.g. "EUR", "USD"period: "year", "month", "day", "hour"Salary is used both in completion/quality scoring and in the exact/meta hashes (bucketed).
The library includes a simple CuckooFilter:
Works on exact_hash values:
exact_hash is probably present β the job is skipped.add(exact_hash) and the job is processed.seen_filter parameter in:compressed = curator.dedupe_and_compress(jobs, seen_filter=seen_filter)
Where seen_filter is typically an instance of jobcurator.CuckooFilter.
JobCurator(
ratio: float = 1.0, # default compression ratio
alpha: float = 0.6, # quality vs diversity weight
max_per_cluster_in_pool: int = 3,
d_sim_threshold: int = 20, # SimHash Hamming threshold for clustering
max_cluster_distance_km: float = 50.0, # max distance between cities in same cluster
)
ratio = 1.0 β keep all jobsratio = 0.5 β keep ~50% of jobs (highest quality + diversity)alpha closer to 1 β prioritize quality; closer to 0 β prioritize diversityMore params:
| Param | Where | Type | Default | Description |
| ββββββββ- | ββββββββββββββ- | ββββ- | βββββ- | βββββββββββββββββββββββββββββββββ |
| ratio | JobCurator(...) / dedupe_and_compress() | float β [0,1] | 1.0 | Target keep ratio after dedupe + selection. |
| alpha | JobCurator(...) | float β [0,1] | 0.6 | Trade-off in selection_score. |
| greedy_diversity | dedupe_and_compress() | bool | False | Recompute diversity on the final set with robust scaling (recommended for quality-sensitive runs). |
| max_per_cluster_in_pool | JobCurator(...) | int | 3 | Cap per cluster before global selection. |
| backend | JobCurator(...) | literal | "default_hash" | Hashing/clustering strategy. |
| use_outlier_filter | JobCurator(...) | bool | False | Optional IsolationForest pre-filter. |
| d_sim_threshold | JobCurator(...) | int | 20 | Hamming/L2 threshold (backend-specific). |
| jaccard_threshold | JobCurator(...) | float | 0.8 | MinHash LSH threshold. |
You choose the dedup clustering strategy via:
JobCurator(backend=...)
Available backends:
default_hash
minhash_hash
MinHash over shingles built from:
max_cluster_distance_km as default_hash).sklearn_hash
Uses scikit-learn:
HashingVectorizer on text + encoded 3D location + category tokens.NearestNeighbors (cosine radius) to form clusters.IsolationForest for outlier filtering.pip install scikit-learn.faiss_hash
Uses FAISS (IndexFlatL2) on composite vectors:
[ [\text{signature bits} + \text{normalized (x,y,z)} + \text{category richness}] ]
Designed for large-scale catalogs (fast nearest-neighbor search).
Requires: pip install faiss-cpu.
title + text β normalize to length_score β [0,1] using p10/p90 percentiles.completion_score based on presence of key fields: title, text, location, salary, categories, company, contract_type.freshness_score (based on created_at) and source_quality (e.g. direct vs job_board).Combine into a single quality score:
quality(j) = 0.3 * length_score
+ 0.4 * completion_score
+ 0.2 * freshness_score
+ 0.1 * source_quality
exact_hash(j) is probably already in the filter β skip.add(exact_hash(j)) to the filter and keep the job.blake2b into a 64-bit exact_hash.exact_hash (hard dedup).For each job, build a 128-bit signature:
title + text.Concatenate:
signature = (simhash << 64) | meta_bits
This signature is used by the different backends.
Depending on backend:
#### a. backend="default_hash" β SimHash + Multi-probe LSH + geo
max_multiprobe_flips).d_sim_thresholdmax_cluster_distance_km#### b. backend="sklearn_hash" β HashingVectorizer + NearestNeighbors
HashingVectorizer over:
NearestNeighbors (cosine radius) to connect jobs that are close in this hashed feature space.#### c. backend="faiss_hash" β FAISS on signature + 3D loc + categories
For each job, build a numeric vector:
[signature_bits (0/1), normalized (x,y,z), category_richness]
IndexFlatL2).d_sim_threshold are connected.quality (descending).max_per_cluster_in_pool jobs as candidates.cannonical_id based the job id, reference, company .quality (descending).Greedy selection:
Repeatedly pick the job j in the pool that maximizes:
selection_score(j) =
alpha * quality(j)
+ (1 - alpha) * normalized_diversity_distance(j)
where normalized_diversity_distance(j) depends on the backend:
default_hash β normalized Hamming distance on 64-bit composite signatureminhash_hash β 1 - Jaccard estimate from MinHash signaturessklearn_hash β cosine distance on HashingVectorizer vectorfaiss_hash β L2 distance on FAISS composite vectorStop when youβve selected:
K = ceil(ratio * N_original)
Result: you keep fewer, higher-quality, and more diverse jobs, while avoiding duplicates (strict + near-duplicates), and optionally skipping already-seen jobs via CuckooFilter.
## π οΈ Advanced (High Level)
During compression we score each candidate:
selection_score = Ξ± * quality + (1 - Ξ±) * diversity
quality is computed per job (length, completion, etc.).diversity is backend-aware:
default_hash: normalized Hamming on 64-bit SimHash/composite signatureminhash_hash: 1 β Jaccard (from MinHash)sklearn_hash: cosine distancefaiss_hash: cosine distance on composite vectorsWhile selecting, we compute each jobβs min distance to any already-selected item, then robust-scale distances with quantiles (q_lo=0.10, q_hi=0.90) and label smoothing (Ξ΅=0.02) to avoid hard 0/1:
z = clamp01( (d - q10) / (q90 - q10 + 1e-6) )
diversity = Ξ΅ + (1 - 2Ξ΅) * z
Seed item gets
diversity_score = 1.0(helps robust scaling).
If you pass greedy_diversity=True, we run recompute_diversity_scores() on the final selected set:
[0,1].Per item, compute either:
Apply robust quantile scaling (q_lo, q_hi), then a leave-one-out (LOO) contribution:
final_div = 0.5 * local_scaled + 0.5 * loo_scaledselection_score with same Ξ±.Knobs (in recompute_diversity_scores):
k_nn=3, q_lo=0.10, q_hi=0.90, tau=0.15, label_eps=0.02, use_softmin=False.selected = curator.dedupe_and_compress(
jobs,
ratio=0.4, # optional override
greedy_diversity=True, # β new
seen_filter=my_cuckoo, # optional Bloom/Cuckoo/set-like
)
# Optional recalibration if you want to run it manually:
curator.recompute_diversity_scores(
selected_jobs=selected,
alpha=curator.alpha,
distance_fn=curator._diversity_distance,
k_nn=3,
q_lo=0.10, q_hi=0.90,
tau=0.15,
label_eps=0.02,
use_softmin=False,
)
### Incremental Jobcurator Approach
Problem:
You often receive batches of jobs over time (jobs1, jobs2, β¦) and want to:
The solution is:
StoreDB to store compressed jobs + CuckooFilter state.Use:
process_batch(StoreDB, jobs, JobCurator) for incremental batchesglobal_reselect_in_store(StoreDB, ratio, alpha) for global rebalancingTest with local storage:
python3 test_incremental.py \
--backend default_hash \
--ratio 0.5 \
--alpha 0.6 \
--storage local \
--dsn "" \
--batches 3 \
--n-per-batch 20 \
--clear-local \
# --no-global-reselect # (optional) add this flag if you want to skip final global rebalancing
Test with SQL storage (Postgres):
python3 test_incremental.py \
--backend default_hash \
--ratio 0.5 \
--alpha 0.6 \
--storage sql \
--dsn "dbname=mydb user=myuser password=mypass host=localhost port=5432" \
--batches 3 \
--n-per-batch 30 \
# --no-global-reselect # optional
For more details, see the Advanced documentation.