While interesting and well-intentioned, the advice in the article can cause issues. The agent isn't downloading CSS or images? Could be a blind user. Pages being downloaded in quick succession? Could be browser prefetch. Lack of mouse movements during navigation? It could be a user on a mobile device or screen reader.
What if the bot is making individual requests from unique IP addresses? What if it's scraping many pages over time rather than a single smash-and-grab approach?
The article admits that this is a hard thing to solve. In most cases, it's probably not very worthwhile to try to detect bots in the ways that are suggested. Focusing on patterns in the server logs to find out what's being targeted might be more beneficial. Then slap a login or captcha around anything valuable that bots shouldn't have access to
basically I've written bots for scraping governmental sites, and fashion sites. I've never yet had a problem (although the code on one fashion site looked like all classes were randomly generated with implied some attempt to keep scraping from working), so what I'm wondering is - what kinds of sites actually go through the trouble of tagging you as a bot, since you've got experience care to share?
Ages ago I was one of those bots. Most fun I've had playing cat-and-mouse. Their team was actually quite good. And I learned that Perl can solve any problem.
I joke that I only have two real skills. Asking questions and identifying good ideas.
In the latter category I have heterogenous load balancing, segregated by traffic type. For example, if you have a read-mostly web application, often the most expensive parts are in the administrative functions. If they run on separate hardware, then a regression doesn’t affect everyone, and a client side bug or flood of traffic doesn’t automatically lock people out of the admin workflow.
Similarly, you have server side rendering for web crawlers and anyone with JavaScript disabled for Angular and other interactive frameworks, which generally involves running a small quantity of specialized services on a few machines.
Shunting all users who smell like a bot to a small set of machines doesn’t lock them out, but it does preserve the experience for everybody else.
I somewhat agree, but those can be good signals though, you just have to gather statistical data on them from known good users and combine with other signals. Like what are the chances for an agent to be used by a real person if it doesn't download images, lacks mouse movements and, for example, is coming from a datacenter? Obviously pretty low. The same way access patterns alone don't give you much, but are good against scrapers when combined with other signals.
And putting captchas or logins on anything you have to protect from bots also "protects" from real people, most of them simply leave and don't bother with those thing.
> Like what are the chances for an agent to be used by a real person if it doesn't download images, lacks mouse movements and, for example, is coming from a datacenter? Obviously pretty low.
Congrats, you just identified a visually impaired user with a screen reader who uses a VPN through AWS/GCP/etc.
that depending on your sites purpose and where your servers are located / where your main customer base is you might be opening yourself to extremely negative public relations by blocking this person as a bot, or even legal repercussions.
This is completely irrelevant to the conversation about technical solutions. My example is just about probabilities and how to use them for bot detection.
What exactly in that logic in any way leads to turning off the site entirely? It's literally the opposite: narrowing down the bots by calculating probability of them not being real people using multiple signals and like Bayesian inference or something like that, it's pretty much a primitive machine learning.
I'm surprised there are so many trolling comments here.
well if that was the case then the parent comment pointing out you could have had a false positive certainly seems relevant, but you wanted to know what their point was. I thought their point was just that these false positives could create negative consequences for anyone implementing a bot preventing solution on their site.
In the context of describing an idea to minimize false positives, pointing out a false positive in the given example is at best nitpicking, it's definitely not relevant.
My favorite thing to do was to detect a bot and then show the bot a parallel dataset I'd developed of bogus information. Instead of hitting a block and working its way around it, competitors scraped a bunch of slightly warbled nonsense data. Have fun with that gang! Also made it super duper easy to see who'd try to steal our data.
As someone on the other side of the game: this is incredibly effective and can be near-impossible for me to figure out you're feeding me BS data.
I have also done this as an anti-botting tactic myself though and it can get pretty complicated to setup properly unless you're doing the low-tech "feed these networks BS data" list.
I work for a SaaS company that provides a domain per customer, and one of the consequences of this that never occurred to me before is that even polite crawlers and bots can thrash your servers if you have enough TLDs.
Every few months someone discovers the correlation between our log messages and user agents and we rehash the same discussion about a particular badly behaved crawler that produces enough log noise that it distracts from identifying new regressions.
I coordinated an effort to fix a problem with bad encoding of query parameters ages ago, but we still see broken ones from this one bot.
I just launched a site. I have not mentioned it anywhere and there are no links to it on the internet, yet the logs are full of bots looking for vulnerabilities. Judging by what they are looking for, it seems I could eliminate half of them by if (location.contains('php')).
Those are kiddies and most of them are dumb but filtering more complex ones is not that easy.
Don't worry about backlinks - they know you've registered a domain so already checking A records in domain DNS. They are much more clever that 10 years ago when we could expect 0 requests to site that was just set up.
I think bots should have full rights to access the internet, just like humans.
Discriminating against bots just makes the internet worse for everyone. In the future, everyone will have personal bots and agents to help automate many things.
The message I'm getting from this article is that headless Chrome should offer an 'undetectable' mode where its unique, fingerprintable globals are replaced with those from the headful version.
There are many false positives, though. It might believe it a bot but is wrong. Changing the number in the URL is something I commonly do. Additionally, I often use curl when I want to download a file (rather than using the browser's download manager).
We detecting web bots by analyzing behavior of all web traffic and via clustering we finding entities that are behaving unusually similar to each other. This way we detect "clusters" or families of bots.
Since 1/21 I have identified 153,423 attempts to gain access to some servers I run. All from unique ip addresses. It's one thing to identify bots, it's quite another what do about them.
> it’s safe to assume that attackers are likely to lie on their fingerprints
I hate this sentiment that every one who doesn't want to be tracked is a criminal.
It's the digital version of "You don't have anything to hide if you aren't doing bad".
---
I don't understand the fear about bots or scraping. As long as bots are behaving nicely (not slamming servers), they are just as much web citizens as humans.
The web is about sharing of information and having an entire company about exterminating the viability of bots is horrifying.
How do bots fuck up your website? I have been involved in running multiple websites (some of them which had plenty of bots crawling them) and only has had issues once or twice.
For a site I run, people mine my data to host their own competing services to get a slice of my pie. You can reverse engineer my database by hitting my service with millions of requests, a database I spent a lot of time and money putting together.
Sure, you can try, but it's my right (and sport) to make it as hard as I can for you. :)
People like to imagine feel-good abstract ideas behind both bots and Tor like "the freedom/openness of information" or some journalist trying to visit your website under some strict regime. In reality, 99% of it is just abuse or someone trying to make a buck off you. It's not really as romantic as you think.
Not all sites are valid targets for malicious bots. I presume your sites were not multiplayer web games. I have a serious bot problem in my game where bots ruin the fun for everyone if I let them be. As soon as you add a competitive element to your game you will have a bot problem.
There are many different kinds of websites. For some, bots are a nuisance at most. For others, they can impact or debilitate the entire service. Obviously they're not going to be a significant issue for your read-only cat picture static site.
With a few exceptions like HD video hosting or sites doing a ton of computation for every request, the best response to point 1 is to just lighten the site, which reduces the load from bots and also improves the UX for humans. Rate limiting can also help in some circumstances. Or maybe consider... charging money.
If you don't force botters to take extreme camouflage measures, bots are easily filtered out of logs (and offer a potentially useful metric of their own).
If a business is threatened by "competitors" simply scraping published information, it's probably already doomed from the start. And I would posit that most site owners with this mindset originally got their data by scraping other sites, which is why they feel vulnerable. Compete over elements that are actually valuable!
> > it’s safe to assume that attackers are likely to lie on their fingerprints
> I hate this sentiment that every one who doesn't want to be tracked is a criminal.
The statement you are referring to is more accurately paraphrased as “everyone who is a criminal does not want to be tracked”, which is very different from “everyone who does not want to be tracked is a criminal.”
> It's the digital version of "You don't have anything to hide if you aren't doing bad".
For the reason noted above, the statement at issue is actually the digital version of “if you are doing something bad, you do have something to hide,” which is a near-universal truth (I mean, it not true if you have actual or practical immunity from any accountability for your wrongdoing, but otherwise it's true.)
This makes, and implies, no claims about “probable cause”, and isn't about criminal process where “probable cause” would even be relevant, it's about security policy where personal rights are not involved, only private discretionary benefits. It is often entirely appropriate in such a content to apply heuristics which are effective because they minimize false negatives even when they might have a high false positive rate.
You have players completely destroying the fun for other players in your game. How you should address this (eg bracketed matchmaking, rate limiting, bans for harrassment, etc) has to do with what they're doing, not whether the game input is being generated by a human or computer.
I really don't understand the higher purpose of "bots should never be banned anywhere, ever". You are allowed to think scraping is OK while thinking they should be banned in other contexts at the same time.
And FYI the bots are destroying the in-game economy in my game by doing extreme grinding no normal human being is capable of. It's a very hard problem to tackle with mechanics.
I enthusiaticly support banning bots for abusive behaviour. I also enthusiaticly support banning humans for abusive behaviour.
> extreme grinding no normal human being is capable of
> a very hard problem to tackle with mechanics.
- diminishing returns for over-farming particular areas (per-player, maybe also total)
- stat penalties for playing too long / bonuses for rest ("You have worked a fourty-hour week in this game, don't you have a life you should be getting back to?")
- moving resources[0] around to break pathfinding, and/or adding poisoned ones to discourage blind grabbing.
- or just ban people who regularly dump extreme quantities of grind-farmed stuff into the economy (since that is the behaviour you're trying to stop), bots or no bots.
0: including enemies, who might, for example, run away from someone who's been slaughtering them for the past six hours
Sure, those are good ideas (even though most are not applicable to my game specifically, it's probably not what you imagine) but I presume you understand that it's much easier for me to just detect and ban bots. Regardless I don't want bots running around that won't respond or interact with other players. That itself is a reason to ban them.
I still don't understand why you think bots should be allowed in my game at all. I don't want them and the players certainly don't want them (bot accusation is the number one drama).
One of my services routinely bans abusive hosts that try repeatedly ban evading, but we notice via TCP fingerprinting that their machine has started up under a minute ago each time it rotates.
Non-passive: It's also possible to read a GPU name like "VMware SVGA" from JS, or watch for mismatched/wrong hardwareConcurrency.
I wonder if they could just make a new browser standard which lets the server put the browser into a mode where the browser itself ensures there’s a live human interacting with the page/app, like using the webcam + OS facial recognition to ensure the server that it’s not a bot.
Unless it was closed source / DRM-y, the scraper developer would just compiler their browser from source and fake the detection code.
Scrapers that know what they're doing could implement this in a day. There are already easier ways to detect scrapers that don't know what they're doing.
And that's just the start of a long cat & mouse game that'd inevitably end up even more user-hostile and unaccessible than requiring a webcam to browse a website.
Answering this question was recently a 1B acquisition for F5 with their purchase of Shape Security.
Will be interesting to see what happens to other industry players like PerimeterX. Distil was eaten by the corpse of Imperva, and I don't see Akamai making strong headway with Botman.
Google is going after this too with reCAPTCHA. The HN reaction to that has been interesting.
It's interesting to me how many comments in these threads talk about scraping as the issue with bots. Every sale I saw when I worked on this problem was related to credential stuffing. Seems the enterprise dollars are in the fraud space, but the HN sentiment is in scraping.
Funny how disconnected the community here can be from what I saw first-hand as the "real" issue. Makes me wonder what other topics it gets wrong. Surely my area of expertise isn't special.
What if the bot is making individual requests from unique IP addresses? What if it's scraping many pages over time rather than a single smash-and-grab approach?
The article admits that this is a hard thing to solve. In most cases, it's probably not very worthwhile to try to detect bots in the ways that are suggested. Focusing on patterns in the server logs to find out what's being targeted might be more beneficial. Then slap a login or captcha around anything valuable that bots shouldn't have access to