v0.6.0 • Progress Report

🚀

RojuSec

Threat Engine

ML-powered phishing detection with explainable scoring

PHASE 3 COMPLETE • CORE ENGINE LIVE

6

Core Modules

73%

Baseline F1

8.4K

Rows Processed

2

ML Models

Architecture

Core Detection Stack

Production-ready modules wired into a single scoring pipeline

🤖 core.nlp.model

Rule-based text engine with 50+ phishing & spam patterns.

✅ 4 / 4 tests • urgency, credential theft, payment fraud

🌐 core.url_analysis

URL extraction and risk scoring for embedded links.

✅ 13 / 13 tests • shorteners, IPs, dodgy TLDs

🔐 core.auth

API-key auth and rate-limiting for all sensitive endpoints.

✅ 13 / 13 tests • 60/min + 1,000/hr per key

📊 core.telemetry

Privacy-first telemetry with hashed identifiers and subscores.

✅ 6 / 6 tests • 8,390+ records stored

Phase 3B & 3C

ML Models Online

DeBERTa + MiniLM integrated as "boosters" on top of rules

🧠 core.models.text

✅ Phase 3B

DeBERTa-based classifier providing soft evidence on email content.

DeBERTa-base model

Boost range: −5 to +10

Lazy loading

Thread-safe caching

Multi-heuristic scoring

Explainable subscores

🔍 core.models.behavior

✅ Phase 3C

MiniLM-based behavioral anomaly model for sender behavior patterns.

MiniLM embeddings

Cosine similarity

Online baseline updates

Thread-safe store

Boost range: 0 to +10

Telemetry-backed tuning

Security & Data

Hardening & Telemetry

Built as a security product, not just a toy model

🔐 core.auth

✅ Live

API-key authentication and multi-tier rate limiting.

Key generation & validation

60 requests / minute

1,000 requests / hour

Per-key isolation

Rate-limit status endpoint

Defensive defaults

✅ 13 / 13 tests • happy-path + abuse scenarios

📊 core.telemetry

✅ Live

Privacy-first telemetry with hashed email IDs and full subscore storage.

SQLite storage

SHA-256 email hashing

All subscores persisted

User feedback capture

Non-blocking logging

8,390+ events

✅ 6 / 6 tests • schema + privacy guarantees

API

API Surface

Designed to plug into mail gateways, SOC tooling and scripts

GET /health

Liveness probe for containers / orchestrators.

Public

GET /health/ml

Verifies ML models are loaded and ready.

Public

POST /analyze

Main phishing analysis endpoint (NLP + URL + ML + behavior).

Auth Required

GET /rate-limit

Returns remaining quota for the calling API key.

Optional

GET /telemetry/stats

Aggregated telemetry metrics for dashboards.

Auth Required

GET /progress

Progress view for screenshots (this page).

Public

Performance

Current Metrics

Conservative tuning prioritising precision and stability

73%

Baseline F1

Before aggressive telemetry-based tuning

80%

Precision

False positives kept under control

67%

Recall

Balanced against precision guardrails

14

Risk Threshold

Current cut-off for "high risk" emails

⚙️ Conservative Tuning

✅ Active

Automated weight updates driven by datasets + telemetry, with strict guardrails.

Min precision: 70%

Min F1 improvement: +2%

8,424 rows processed

3 datasets in rotation

Testing

Dataset Coverage

Iterative runs across three public phishing corpora

📧 CEAS

10.2% complete

39,154 total rows • 4,000 processed • 35,154 remaining

10.2%

📧 Ling

100% complete

2,859 rows • full pass with current pipeline

100%

📧 Nazario

100% complete

1,565 rows • full pass with current pipeline

100%

8.4K

Total Rows

Processed across CEAS, Ling & Nazario so far

8

Tuning Runs

Iterative passes with conservative constraints

Roadmap

What's Next

From prototype engine to plug-and-play security product

🔗 Phase 4 — URL Intelligence

⏳ Planned

Augment URL scoring with external reputation sources.

VirusTotal API integration

URLhaus lookups

Google Safe Browsing

Redirect-chain following

🎯 Phase 5 — Model Fine-Tuning

⏳ Planned

Close the loop using telemetry and labeled feedback.

DeBERTa fine-tuning pipeline

Model versioning & rollback

A/B testing strategies

"Safe rollout" guardrails