Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to detect web bots? (antoinevastel.com)
115 points by avastel on Feb 9, 2020 | hide | past | favorite | 75 comments


While interesting and well-intentioned, the advice in the article can cause issues. The agent isn't downloading CSS or images? Could be a blind user. Pages being downloaded in quick succession? Could be browser prefetch. Lack of mouse movements during navigation? It could be a user on a mobile device or screen reader.

What if the bot is making individual requests from unique IP addresses? What if it's scraping many pages over time rather than a single smash-and-grab approach?

The article admits that this is a hard thing to solve. In most cases, it's probably not very worthwhile to try to detect bots in the ways that are suggested. Focusing on patterns in the server logs to find out what's being targeted might be more beneficial. Then slap a login or captcha around anything valuable that bots shouldn't have access to


I use a keyboard-driven browser to reduce wrist strain, and I sometimes get tagged as a bot, probably for lack of mouse movement.


basically I've written bots for scraping governmental sites, and fashion sites. I've never yet had a problem (although the code on one fashion site looked like all classes were randomly generated with implied some attempt to keep scraping from working), so what I'm wondering is - what kinds of sites actually go through the trouble of tagging you as a bot, since you've got experience care to share?


Yandex Search does it to me regularly. Sometimes it takes 5 attempts to get through the captchas.


Ticketmaster


Good response, I definitely never though of that before. Serious case where you don't want Bots buying stuff up.


Ages ago I was one of those bots. Most fun I've had playing cat-and-mouse. Their team was actually quite good. And I learned that Perl can solve any problem.


Have you written about this? Sounds like an entertaining story.


I have not. But I will and will post to HN.


I joke that I only have two real skills. Asking questions and identifying good ideas.

In the latter category I have heterogenous load balancing, segregated by traffic type. For example, if you have a read-mostly web application, often the most expensive parts are in the administrative functions. If they run on separate hardware, then a regression doesn’t affect everyone, and a client side bug or flood of traffic doesn’t automatically lock people out of the admin workflow.

Similarly, you have server side rendering for web crawlers and anyone with JavaScript disabled for Angular and other interactive frameworks, which generally involves running a small quantity of specialized services on a few machines.

Shunting all users who smell like a bot to a small set of machines doesn’t lock them out, but it does preserve the experience for everybody else.


I somewhat agree, but those can be good signals though, you just have to gather statistical data on them from known good users and combine with other signals. Like what are the chances for an agent to be used by a real person if it doesn't download images, lacks mouse movements and, for example, is coming from a datacenter? Obviously pretty low. The same way access patterns alone don't give you much, but are good against scrapers when combined with other signals.

And putting captchas or logins on anything you have to protect from bots also "protects" from real people, most of them simply leave and don't bother with those thing.


> Like what are the chances for an agent to be used by a real person if it doesn't download images, lacks mouse movements and, for example, is coming from a datacenter? Obviously pretty low.

Congrats, you just identified a visually impaired user with a screen reader who uses a VPN through AWS/GCP/etc.


And your point is?


that depending on your sites purpose and where your servers are located / where your main customer base is you might be opening yourself to extremely negative public relations by blocking this person as a bot, or even legal repercussions.


This is completely irrelevant to the conversation about technical solutions. My example is just about probabilities and how to use them for bot detection.


If that logic is extended, the true technical solution is to turn the site off entirely.


What exactly in that logic in any way leads to turning off the site entirely? It's literally the opposite: narrowing down the bots by calculating probability of them not being real people using multiple signals and like Bayesian inference or something like that, it's pretty much a primitive machine learning.

I'm surprised there are so many trolling comments here.


well if that was the case then the parent comment pointing out you could have had a false positive certainly seems relevant, but you wanted to know what their point was. I thought their point was just that these false positives could create negative consequences for anyone implementing a bot preventing solution on their site.


In the context of describing an idea to minimize false positives, pointing out a false positive in the given example is at best nitpicking, it's definitely not relevant.


The ADA exists and for good reasons.


My favorite thing to do was to detect a bot and then show the bot a parallel dataset I'd developed of bogus information. Instead of hitting a block and working its way around it, competitors scraped a bunch of slightly warbled nonsense data. Have fun with that gang! Also made it super duper easy to see who'd try to steal our data.


As someone on the other side of the game: this is incredibly effective and can be near-impossible for me to figure out you're feeding me BS data.

I have also done this as an anti-botting tactic myself though and it can get pretty complicated to setup properly unless you're doing the low-tech "feed these networks BS data" list.


That's really, really funny.

I would absolutely love to see that as someone who scrapes sites. It would make me smile immensely to see the trolling.




I really like this approach to handling bots.


I work for a SaaS company that provides a domain per customer, and one of the consequences of this that never occurred to me before is that even polite crawlers and bots can thrash your servers if you have enough TLDs.

Every few months someone discovers the correlation between our log messages and user agents and we rehash the same discussion about a particular badly behaved crawler that produces enough log noise that it distracts from identifying new regressions.

I coordinated an effort to fix a problem with bad encoding of query parameters ages ago, but we still see broken ones from this one bot.


I just launched a site. I have not mentioned it anywhere and there are no links to it on the internet, yet the logs are full of bots looking for vulnerabilities. Judging by what they are looking for, it seems I could eliminate half of them by if (location.contains('php')).


Those are kiddies and most of them are dumb but filtering more complex ones is not that easy.

Don't worry about backlinks - they know you've registered a domain so already checking A records in domain DNS. They are much more clever that 10 years ago when we could expect 0 requests to site that was just set up.


I think bots should have full rights to access the internet, just like humans.

Discriminating against bots just makes the internet worse for everyone. In the future, everyone will have personal bots and agents to help automate many things.


I'm actually being serious about this. Hopefully others are worried about the trend to discriminate against bots.

If corporations can have personhood rights, why can't non-human agents?


Corporations should not have human rights. Bots should not either but they should be allowed on the internet, no need to bring rights into it.


Not sure what 'full rights' mean here. My rights involve zero obligation to serve every request to my server.

It doesn't make much sense to me for someone to have the 'right' to demand me to respond to them nor their Python loop.


The message I'm getting from this article is that headless Chrome should offer an 'undetectable' mode where its unique, fingerprintable globals are replaced with those from the headful version.


There are many false positives, though. It might believe it a bot but is wrong. Changing the number in the URL is something I commonly do. Additionally, I often use curl when I want to download a file (rather than using the browser's download manager).


Out of interest, why do you prefer curl?


I just don't like the graphical file management.


At the very least you can use these techniques to know what not to do when scraping, and therefore scrape better.


We detecting web bots by analyzing behavior of all web traffic and via clustering we finding entities that are behaving unusually similar to each other. This way we detect "clusters" or families of bots.


Interesting!

I wrote something that leverages AWS Lambdas to get around rate limiting, this solution would've tagged me instantly.


Since 1/21 I have identified 153,423 attempts to gain access to some servers I run. All from unique ip addresses. It's one thing to identify bots, it's quite another what do about them.


> it’s safe to assume that attackers are likely to lie on their fingerprints

I hate this sentiment that every one who doesn't want to be tracked is a criminal.

It's the digital version of "You don't have anything to hide if you aren't doing bad".

---

I don't understand the fear about bots or scraping. As long as bots are behaving nicely (not slamming servers), they are just as much web citizens as humans.

The web is about sharing of information and having an entire company about exterminating the viability of bots is horrifying.


>having an entire company about exterminating the viability of bots is horrifying.

I have a website I use to make money to feed my family. Shitty bots mess up my website, cost me money. A SaaS offers a solution, which saves me money.

What about that line of logic is horrifying to you?


How do bots fuck up your website? I have been involved in running multiple websites (some of them which had plenty of bots crawling them) and only has had issues once or twice.


For a site I run, people mine my data to host their own competing services to get a slice of my pie. You can reverse engineer my database by hitting my service with millions of requests, a database I spent a lot of time and money putting together.

Sure, you can try, but it's my right (and sport) to make it as hard as I can for you. :)

People like to imagine feel-good abstract ideas behind both bots and Tor like "the freedom/openness of information" or some journalist trying to visit your website under some strict regime. In reality, 99% of it is just abuse or someone trying to make a buck off you. It's not really as romantic as you think.


Not all sites are valid targets for malicious bots. I presume your sites were not multiplayer web games. I have a serious bot problem in my game where bots ruin the fun for everyone if I let them be. As soon as you add a competitive element to your game you will have a bot problem.


There are many different kinds of websites. For some, bots are a nuisance at most. For others, they can impact or debilitate the entire service. Obviously they're not going to be a significant issue for your read-only cat picture static site.


More bots, more EC2s, more money

More bots, more noise, bad analytics

More bots, IP theft at scale, competitors now have my content

I don't understand how this is hard to wrap one's head around. This is why we can't have nice things.

Edit - I offer 3 points to refute parent and I'm getting downvoted. This site is moderated by a bunch of bots!


With a few exceptions like HD video hosting or sites doing a ton of computation for every request, the best response to point 1 is to just lighten the site, which reduces the load from bots and also improves the UX for humans. Rate limiting can also help in some circumstances. Or maybe consider... charging money.

If you don't force botters to take extreme camouflage measures, bots are easily filtered out of logs (and offer a potentially useful metric of their own).

If a business is threatened by "competitors" simply scraping published information, it's probably already doomed from the start. And I would posit that most site owners with this mindset originally got their data by scraping other sites, which is why they feel vulnerable. Compete over elements that are actually valuable!


> > it’s safe to assume that attackers are likely to lie on their fingerprints

> I hate this sentiment that every one who doesn't want to be tracked is a criminal.

The statement you are referring to is more accurately paraphrased as “everyone who is a criminal does not want to be tracked”, which is very different from “everyone who does not want to be tracked is a criminal.”

> It's the digital version of "You don't have anything to hide if you aren't doing bad".

For the reason noted above, the statement at issue is actually the digital version of “if you are doing something bad, you do have something to hide,” which is a near-universal truth (I mean, it not true if you have actual or practical immunity from any accountability for your wrongdoing, but otherwise it's true.)


The stamement, yes; the sentiment is "and therefore we can use not wanting to be tracked as probable cause for treating them as a criminal".


This makes, and implies, no claims about “probable cause”, and isn't about criminal process where “probable cause” would even be relevant, it's about security policy where personal rights are not involved, only private discretionary benefits. It is often entirely appropriate in such a content to apply heuristics which are effective because they minimize false negatives even when they might have a high false positive rate.


Fair enough about the phrasing, although madamelic is the one who brought criminal process into it.

The sentiment is "and therefore we can discriminate against people who use user-agents[0] that don't actively help us violate their privacy rights".

0: such as screen readers, for that zesty ADA-lawsuit flavour.


I have bots completely destroying the fun for other players in my game. Are you saying I should just ignore it because they have "rights"?


You have players completely destroying the fun for other players in your game. How you should address this (eg bracketed matchmaking, rate limiting, bans for harrassment, etc) has to do with what they're doing, not whether the game input is being generated by a human or computer.


I really don't understand the higher purpose of "bots should never be banned anywhere, ever". You are allowed to think scraping is OK while thinking they should be banned in other contexts at the same time.

And FYI the bots are destroying the in-game economy in my game by doing extreme grinding no normal human being is capable of. It's a very hard problem to tackle with mechanics.


> bots should never be banned anywhere, ever

I enthusiaticly support banning bots for abusive behaviour. I also enthusiaticly support banning humans for abusive behaviour.

> extreme grinding no normal human being is capable of

> a very hard problem to tackle with mechanics.

- diminishing returns for over-farming particular areas (per-player, maybe also total)

- stat penalties for playing too long / bonuses for rest ("You have worked a fourty-hour week in this game, don't you have a life you should be getting back to?")

- moving resources[0] around to break pathfinding, and/or adding poisoned ones to discourage blind grabbing.

- or just ban people who regularly dump extreme quantities of grind-farmed stuff into the economy (since that is the behaviour you're trying to stop), bots or no bots.

0: including enemies, who might, for example, run away from someone who's been slaughtering them for the past six hours


Sure, those are good ideas (even though most are not applicable to my game specifically, it's probably not what you imagine) but I presume you understand that it's much easier for me to just detect and ban bots. Regardless I don't want bots running around that won't respond or interact with other players. That itself is a reason to ban them.

I still don't understand why you think bots should be allowed in my game at all. I don't want them and the players certainly don't want them (bot accusation is the number one drama).


>"bots should never be banned anywhere, ever".

Who said that? You can ban bots for bad behaviour.

>As long as bots are behaving nicely (not slamming servers), they are just as much web citizens as humans.


Look at the comment I'm replying to.


That causes problems for their counterparties. Passively scraping a website doesn't, generally.


Affirming the consequent.


"..(the OS and its version, as well as whether it is a VM.."

Could anybody link me up with something to read on how to detect it is a VM based on TLS fingerprint?


One of my services routinely bans abusive hosts that try repeatedly ban evading, but we notice via TCP fingerprinting that their machine has started up under a minute ago each time it rotates.

Non-passive: It's also possible to read a GPU name like "VMware SVGA" from JS, or watch for mismatched/wrong hardwareConcurrency.


thanks, will be googling bit more on these terms and techniques


Bots have a PR problem.


What about users of RPA software like UiPath, Blue Prism, etc.?


I wonder if they could just make a new browser standard which lets the server put the browser into a mode where the browser itself ensures there’s a live human interacting with the page/app, like using the webcam + OS facial recognition to ensure the server that it’s not a bot.


This makes my skin crawl. I understand the intent, but this is morally wrong on many different levels.

I hate credential stuffing and malicious activity as much as the next person, but I would never sacrifice user privacy like that.

Also, you can spoof a webcam (but that's not the real problem).


It would be better than the unpaid labor we have to do for Captcha.


Captcha don't stop bots anymore, they only stop humans. I don't know why websites are still using those.


Unless it was closed source / DRM-y, the scraper developer would just compiler their browser from source and fake the detection code.

Scrapers that know what they're doing could implement this in a day. There are already easier ways to detect scrapers that don't know what they're doing.

And that's just the start of a long cat & mouse game that'd inevitably end up even more user-hostile and unaccessible than requiring a webcam to browse a website.


Answering this question was recently a 1B acquisition for F5 with their purchase of Shape Security.

Will be interesting to see what happens to other industry players like PerimeterX. Distil was eaten by the corpse of Imperva, and I don't see Akamai making strong headway with Botman.

Google is going after this too with reCAPTCHA. The HN reaction to that has been interesting.

It's interesting to me how many comments in these threads talk about scraping as the issue with bots. Every sale I saw when I worked on this problem was related to credential stuffing. Seems the enterprise dollars are in the fraud space, but the HN sentiment is in scraping.

Funny how disconnected the community here can be from what I saw first-hand as the "real" issue. Makes me wonder what other topics it gets wrong. Surely my area of expertise isn't special.


I think that scraping is a more divisive and undecided thing. Credential stuffing is pretty cut-and-dry a bad thing.

No one credential stuffs for innocent fun.


Please go on!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: