← Back
Knowledge Basev1.3.0

Website Crawling

Website Crawling (2025-12-30): Honestify can ingest public content from your portfolio, blog, or GitHub Pages to enrich your knowledge base without manual copy-paste.

·4 min read·Honestify Engineering

Feature area: Website Crawler

This release covers Website Crawling Version 1.3.0, shipped 2025-12-30. Status: shipped. No breaking changes.

Summary

Honestify can ingest public content from your portfolio, blog, or GitHub Pages to enrich your knowledge base without manual copy-paste.

Engineers with strong public writing still retyped URLs and summaries into Honestify—duplicate work that discouraged knowledge base adoption. A respectful crawler fetches public pages, extracts main content, and indexes text for AI retrieval with user-approved URL allowlists.

What Changed

URL allowlist

New

Users specify domains and max crawl depth before indexing starts.

Content extraction

New

Readability-style main-content detection strips nav and boilerplate.

Crawl scheduling

Improved

Periodic re-crawl option for sites that update frequently.

KB setup time

Before

~25 min

After

~12 min

Users with 5+ public pages

  • URL allowlist — Users specify domains and max crawl depth before indexing starts.
  • Content extraction — Readability-style main-content detection strips nav and boilerplate.
  • Crawl scheduling — Periodic re-crawl option for sites that update frequently.

Why We Built It

Engineers with strong public writing still retyped URLs and summaries into Honestify—duplicate work that discouraged knowledge base adoption.

We prioritized this work because metrics showed a measurable funnel leak we could fix in one sprint. The fix needed to be durable—not a patch—so we addressed root causes in Website Crawler rather than symptoms alone.

Engineers, recruiters, and hiring managers all benefit when Honestify behaves predictably in production. This release reflects that bar.

User Impact

Knowledge base setup time decreased 50% for users with existing portfolio sites.

AudienceHow you benefit
EngineersFaster profile setup, clearer AI answers, less manual rework
RecruitersMore complete profiles and reliable share links when candidates use Honestify
Founders / hiring managersBetter signal on candidate preparation and skills alignment
Platform engineersInfrastructure patterns that reduce incident risk

Relevant skills: rag, python, system design, typescript. Target roles: ai engineer, backend engineer, devops engineer.

Technical Highlights

  • Robots.txt compliance with crawl-delay respect
  • Per-user rate limits and concurrent crawl caps
  • Chunking pipeline for embedding index
  • Dedup via content hash

Rollout used feature flags with staged percentage increase and automatic rollback on error budget burn.

Before

Website Crawling: before vs after

Before

Users manually pasted article text or skipped external content entirely.

After

Submit a root URL; Honestify discovers and indexes linked pages within configured depth and domain bounds.

Users moving from the previous experience should notice submit a root URL; Honestify discovers and indexes linked pages within configured depth and domain bounds.

Screenshots

Future Improvements

What we are building next

  • GitHub README indexing
  • RSS feed subscriptions
  • Selective page exclusion UI

Known limitations

  • · JavaScript-heavy SPAs may require manual paste fallback
  • · Authentication-gated pages not supported

Feedback welcome: Reply via in-app feedback or support—especially if you hit edge cases we did not cover in this release.

This update connects to other Honestify work:

Create your own AI profile

Upload your resume, add expertise, and share a profile link beside LinkedIn so recruiters can ask follow-up questions before the interview.