Personal project · 2025

Wikipedia Scraper

A crawler that reads as much of Wikipedia as it can in 20 seconds, with 100 workers running in parallel.

Python · Asyncio · Aiohttp · BeautifulSoup

The problem

Crawling a big site fast means keeping a hundred connections busy without ever fetching the same page twice, and stopping exactly on a deadline.

I wrote an async crawler where 100 workers share one queue and one set of seen URLs, and a global deadline cancels everything at exactly 20 seconds.

100 Workers at Once. A hundred connections stay busy at the same time, so the crawler never sits waiting on a single page.
Stops on the Dot. A global deadline cancels every pending task at exactly 20 seconds.
Never Fetches Twice. One shared set of seen URLs means no page downloads twice and no loop runs forever.
One Event Loop, No Threads. Everything is async, with careful link cleanup so odd URLs do not crash a worker.

The full engineering is on GitHub.