All projects
Personal2024

Wikipedia Scraper

Async crawler with 100 concurrent workers, O(1) URL deduplication, and a 20-second global deadline.

PythonAsyncioAiohttpBeautifulSoup

Challenge

Efficiently crawling large-scale websites requires balancing speed with resource management under strict time constraints.

Approach

Built a high-concurrency async crawler with 100 workers, O(1) URL deduplication, and a global 20-second deadline using Python's asyncio and aiohttp.

What it does

100 Concurrent Workers

Saturates network bandwidth and masks I/O latency with massive parallelism.

20s Deadline Enforcement

Global deadline propagation cancels all pending tasks exactly at the time limit.

URL Deduplication

Hash set guarantees O(1) lookup time, preventing redundant processing and infinite loops.

Non-Blocking Architecture

Full async event loop with robust link normalization and protocol handling.