From Precision to Scale: AI-Enabled Crawler

How combining existing tools and best practices helped me tackle the challenge of discovering and validating educational resources at scale - 262,603 links

Jul 28, 2025

Executive Summary

I built a 10-engine AI discovery system that delivered 2,000 high-quality educational resources. When asked to find 100,000+ links instead, I created an entirely different system by combining various tools and approaches: OpenEvolve for improving validation, Firecrawl for deeper crawling, PostgreSQL best practices for efficient data handling, and seed links from my previous work for better targeting.

The journey taught me that crawling is straightforward - I gathered nearly 800k links. The real challenge was validation. Through iterative improvements and learning from failed attempts, I developed a single-table architecture with comprehensive metadata columns and validation procedures. This article shares that journey and the lessons learned along the way.

My final take is 262,603 links and a tool that can do it all over again.

See a simplified version of the article

Introduction: A Different Kind of Challenge

After completing my precision discovery system that produced 2,000 validated educational resources, I received feedback suggesting a different direction: finding as many educational links as possible - ideally 100,000 or more.

Rather than modifying my existing tool, I decided to pivot to something new. This wasn’t about inventing revolutionary approaches - it was about combining existing tools and best practices to tackle a scale challenge.

Building on What Existed

Learning from Previous Work

My original discovery system taught me valuable lessons about educational content patterns. The 2,000 validated links from that project became seed URLs for the new crawler, providing quality starting points for expansion.

Leveraging Colleagues’ Innovations

A colleague’s work on OpenEvolve - using AI to evolve better algorithms - inspired me to apply similar approaches to URL validation. I didn’t invent OpenEvolve; I adapted it to my specific challenge.

Using Established Tools

Firecrawl: Handled JavaScript-heavy sites and complex crawling scenarios
PostgreSQL: Provided robust data storage with JSONB for flexible metadata
Async Python: Enabled concurrent processing at scale

The OpenEvolve Training Process

The most significant improvement came from training a better validator using OpenEvolve. I started with a basic HTTP validation function that was failing miserably - marking valid educational sites as invalid due to slow response times or non-standard headers.

The training process was iterative:

I created an initial validator with simple rules
Fed it a corpus of known educational URLs from my database
Let OpenEvolve run for 100 iterations, evolving better strategies
Each iteration tested against real URLs, measuring true positive rates

The AI discovered patterns I hadn’t considered:

Educational sites often have longer load times due to rich content
PDF resources require different timeout strategies
Some educational platforms use non-standard HTTP responses
Images return 200 status but most are useless
Certain domains consistently require specific handling

The evolved validator learned to adapt its approach based on domain patterns and content types, dramatically improving success rates.

Firecrawl’s Dual Purpose: Discovery and Validation

Firecrawl became central to both crawling and validation. For discovery, its JavaScript rendering capabilities unlocked content on modern educational platforms. But I also discovered its /scrape and /extract endpoints could serve validation purposes.

The /scrape endpoint provided full page content, allowing me to verify educational relevance by analyzing:

Presence of grade-level indicators
Educational vocabulary density
Curriculum-related keywords
Content structure patterns

The /extract endpoint with custom schemas became particularly powerful for metadata extraction.

This dual use of Firecrawl - for both discovery and validation - provided rich metadata while managing API credit consumption efficiently.

The Journey from Complex to Simple

Starting with Four Tables

Initially, I designed a four-table workflow:

discovered_links for raw URLs
valid_links for confirmed resources
links_to_fix for repairable URLs
links_cannot_fix for problematic URLs

This seemed elegant in theory but proved cumbersome in practice.

The Single-Table Realization

After wrestling with complex state transitions and data synchronization, I consolidated everything into one table with rich metadata columns. This simplified approach proved more maintainable and efficient:

URL and domain information
Validation status and timestamps
Educational metadata (grade level, subject)
Processing flags and confidence scores
Fix history and validation attempts
Firecrawl extraction results

Overnight Processing and Incremental Improvement

The system ran continuously, processing links overnight while I slept. Each morning brought new insights:

Which domains validated poorly
What patterns appeared in broken URLs
Where the crawler got stuck
How the OpenEvolve-trained validator performed on different site types

These insights led to incremental improvements in the validation logic, URL fixing patterns, and crawling strategies. The overnight runs became a feedback loop, with each iteration refining the approach.

Applying PostgreSQL Best Practices

Working with hundreds of thousands of URLs required careful database design:

Batch operations instead of individual inserts
Proper indexing on frequently queried columns
JSONB metadata for flexible schema evolution
Connection pooling for concurrent operations
Checkpoint systems for resumable processing

The JSONB columns proved particularly valuable for storing Firecrawl extraction results and OpenEvolve validation metadata without constant schema migrations.

The Formula That Emerged

Through trial and error, a definitive workflow emerged:

Smart Crawling: Use seed URLs and Firecrawl for targeted discovery
Conservative Fixing: Only repair URLs with high-confidence patterns (7 specific patterns identified)
Deduplicating AFTER Fixing and BEFORE Validation
Enhanced Validation: Apply OpenEvolve-trained validator with domain-specific strategies
Metadata Extraction: Use Firecrawl’s /extract endpoint for educational classification
Continuous Processing: Run overnight with checkpoint recovery
Single Source of Truth: One table with comprehensive metadata

Latest Report

Lessons from the Journey

Validation Is the Real Challenge

Crawling educational domains to find URLs is relatively straightforward. Determining which URLs contain valuable educational content is where the complexity lies. The combination of OpenEvolve-trained validation and Firecrawl metadata extraction provided the breakthrough.

Simple Architectures Scale Better

The move from four tables to one wasn’t a retreat - it was a recognition that simpler systems are easier to understand, maintain, and debug.

Existing Tools Are Powerful

I didn’t need to invent new approaches. Combining Firecrawl, OpenEvolve concepts, PostgreSQL best practices, and async Python provided all the capabilities required.

AI Training Beats Hand-Coded Rules

The OpenEvolve-trained validator discovered patterns and strategies I never would have coded manually. Letting AI learn from examples proved far more effective than trying to anticipate every edge case.

Incremental Improvement Works

Running the system overnight and making small improvements based on observations proved more effective than trying to design the perfect system upfront.

The Three Ultimate Tools

The final system consolidated into three Python scripts:

crawler.py: Combines multiple discovery approaches with 6 operational modes
fixer.py: Applies conservative URL repair patterns with confidence scoring
validator.py: Uses OpenEvolve-enhanced validation with 7 validation modes

Each tool represents accumulated learning from the entire journey.

Conclusion

This project succeeded not through breakthrough innovations but through thoughtful combination of existing tools and practices. The journey from my precision system to this scale-focused crawler taught valuable lessons about validation challenges and the power of simple architectures.

The system exists and works, processing educational URLs at scale. More importantly, it demonstrates how combining established approaches - OpenEvolve for algorithm improvement, Firecrawl for robust crawling and metadata extraction, PostgreSQL for data management, and lessons from previous projects - can address new challenges effectively.

The definitive insight: validation is everything. A million unvalidated URLs are worthless; a thousand validated educational resources are gold. This system found the middle ground, processing at scale while maintaining quality through intelligent validation.

Ultimate Crawler PR