Personal2024
Wikipedia Scraper
Async crawler with 100 concurrent workers, O(1) URL deduplication, and a 20-second global deadline.
PythonAsyncioAiohttpBeautifulSoup
Challenge
Efficiently crawling large-scale websites requires balancing speed with resource management under strict time constraints.
Approach
Built a high-concurrency async crawler with 100 workers, O(1) URL deduplication, and a global 20-second deadline using Python's asyncio and aiohttp.
What it does
100 Concurrent Workers
Saturates network bandwidth and masks I/O latency with massive parallelism.
20s Deadline Enforcement
Global deadline propagation cancels all pending tasks exactly at the time limit.
URL Deduplication
Hash set guarantees O(1) lookup time, preventing redundant processing and infinite loops.
Non-Blocking Architecture
Full async event loop with robust link normalization and protocol handling.