Did you know your favorite website can detect when you’re browsing it in public transport and when you scroll it laying in your bed?
ArticleIn this article, the author goes over the different ways that websites use to detect bots. It's quite detailed with some extra resources at the end for further reading. If you're familiar with the OSI Network Model [1], some sections are a bit easier to digest but all in all it's a great article.
One reason I read the full thing even though I was familiar with most of the content (based on a quick perusal) was that I've done my fair share of long running web scrapping and it always pissed me off when the techinques I used would work for some sites and fail horribly for others.
With the knowledge from this article, I could retrospectivley look back on some projects that didn't work and make a pretty good guess as to what bot detection techinques they employed to foil my scrapping. Even if you've never done any scrapping and you don't forsee yourself doing it anytime soon, I would still suggest being familiar with these techniques.
Why?
Well, if you hang around the interwebs long enough, you'll run into a website that has data you want. It will either have no API for querying the data or have an API which is severely lacking for your needs. In that moment, you'll either decide to move on or scrape it. You would then either get the data you need or run into one of these bot detection techniques. Seeing as you've armed yourself with the knowledge beforehand, you can avoid the initial frustration that comes with failure and proceed to test different approaches till you find something that works.
Thanks for reading and as always, all comments, critiques and questions are highly appreciated. Here's a link to the
previous article response.
[1] - Google and/or LLMs are your friend