In the last week I've had to deal with two large-scale influxes of traffic on one particular web server in our organization.
The first involved requests from 300,000 unique IPs in a span of a few hours. I analyzed them and found that ~250,000 were from Brazil. I'm used to using ASNs to block network ranges sending this kind of traffic, but in this case they were spread thinly over 6,000+ ASNs! I ended up blocking all of Brazil (sorry).
A few days later this same web server was on fire again. I performed the same analysis on IPs and found a similar number of unique addresses, but spread across Turkey, Russia, Argentina, Algeria and many more countries. What is going on?! Eventually I think I found a pattern to identify the requests, in that they were using ancient Chrome user agents. Chrome 40, 50, 60 and up to 90, all released 5 to 15 years ago. Then, just before I could implement a block based on these user agents, the traffic stopped.
In both cases the traffic from datacenter networks was limited because I already rate limit a few dozen of the larger ones.
It's a reverse proxy that presents a PoC challenge to every new visitor. It shifts the initial cost of accessing your server's resources back at the client. Assuming your uplink can handle 300k clients requesting a single 70kb web page, it should solve most of your problems.
I've seen a few attacks where the operators placed malicious code on high-traffic sites (e.g. some government thing, larger newspapers), and then just let browsers load your site as an img. Did you see images, css, js being loaded from these IPs? If they were expecting images, they wouldn't parse the HTML and not load other resources.
It's a pretty effective attack because you get large numbers of individual browsers to contribute. Hosters don't care, so unless the site owners are technical enough, they can stay online quite a bit.
If they work with Referrer Policy, they should be able to mask themselves fairly well - the ones I saw back then did not.
We all agree that AI crawlers are a big issue as they don't respect any established best practices, but we rarely talk about the path forward. Scraping has been around for as long as the internet, and it was mostly fine. There are many very legitimate use cases for browser automation and data extraction (I work in this space).
So what are potential solutions? We're somehow still stuck with CAPTCHAS, a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How can enable beneficial automation while protecting against abusive AI crawlers?
My pet peeve is that using the term "AI crawler" for this conflates things unnecessarily. There's some people who are angry at it due to anti-AI bias and not wishing to share information, while there are others who are more concerned about it due to the large amount of bandwidth and server overloading.
Not to mention that it's unknown if these are actually from AI companies, or from people pretending to be AI companies. You can set anything as your user agent.
It's more appropriate to mention the specific issue one haves about the crawlers, like "they request things too quickly" or "they're overloading my server". Then from there, it is easier to come to a solution than just "I hate AI". For example, one would realize that things like Anubis have existed forever, they are just called DDoS protection, specifically those using proof-of-work schemes (e.g. https://github.com/RuiSiang/PoW-Shield).
This also shifts the discussion away from something that adds to the discrimination against scraping in general, and more towards what is actually the issue: overloading servers, or in other words, DDoS.
It won't fully solve the problem, but with the problem relatively identified, you must then ask why people are engaging in this behavior. Answer: money, for the most part. Therefore, follow the money and identify the financial incentives driving this behavior. This leads you pretty quickly to a solution most people would reject out-of-hand: turn off the financial incentive that is driving the enshittification of the web. Which is to say, kill the ad-economy.
Or at least better regulate it while also levying punitive damages that are significant enough to both disuade bad-actors and encourage entities to view data-breaches (or the potential therein) and "leakage[0]" as something that should actually be effectively secured against. Afterall, there are some upsides to the ad-economy that, without it, would present some hard challenges (eg, how many people are willing to pay for search? what happens to the vibrant sphere of creators of all stripes that are incentivized by the ad-economy? etc).
Personally, I can't imagine this would actually happen. Pushback from monied interests aside, most people have given up on the idea of data-privacy or personal-ownership of their data, if they ever even cared in the first place. So, in the absence of willing to do do something about the incentive for this maligned behavior, we're left with few good options.
The best solution I've seen is to hit everyone with a proof of work wall and whitelist the scrapers that are welcome (search engines and such).
Running SHA hash calculations for a second or so once every week is not bad for users, but with scrapers constantly starting new sessions they end up spending most of their time running useless Javascript, slowing the down significantly.
The most effective alternative to proof of work calculations seems to be remote attestation. The downside is that you're getting captchas if you're one of the 0.1% who disable secure boot and run Linux, but the vast majority of web users will live a captcha free life. This same mechanism could in theory also be used to authenticate welcome scrapers rather than relying on pure IP whitelists.
> I am now of the opinion that every form of web-scraping should be considered abusive behaviour and web servers should block all of them. If you think your web-scraping is acceptable behaviour, you can thank these shady companies and the “AI” hype for moving you to the bad corner.
I imagine that e.g. Youtube would be happy to agree with this. Not that it would turn them against AI generally.
yeah, but you can't, that's the problem. Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures).
On a separate note, I believe open web scraping has been a massive benefit to the internet on net, and almost entirely positive pre-2021. Web scraping & crawling enables search engines, services like Internet Archive, walled-garden-busting (like Invidious, yt-dlp, and Nitter), mashups (Spotube, IFTT, and Plaid would have been impossible to bootstrap without web scraping), and all kinds of interesting data science projects (e.g. scraping COVID-19 stats from local health departments to patch together a picture of viral spread for epidemiologists).
We should have a way to verify the user-agents of the valid and useful scrapers such as Internet Archive by having some kind of cryptographic signature of their user-agents and being able to validate it with any reverse proxy seems like a good start
> Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures)
Anubis, go-away, etc are great, don't get me wrong -- but what Anubis does is impose a cost on every query. The website operator is hoping that the compute will have a rate-limiting effect on scrapers while minimally impacting the user experience. It's almost like chemotherapy, in that you're poisoning everyone in the hope that the aggressive bad actors will be more severely affected than the less aggressive good actors. Even the Anubis readme calls it a nuclear option. In practice it appears to work pretty well, which is great!
It's a half-measure because:
1. You're slowing down scrapers, not blocking them. They will still scrape your site content in violation of robots.txt.
2. Scrapers with more compute than IP proxies will not be significantly bottlenecked by this.
3. This may lead to an arms race where AI companies respond by beefing up their scraping infrastructure, necessitating more difficult PoW challenges, and so on. The end result of this hypothetical would be a more inconvenient and inefficient internet for everyone, including human users.
To be clear: I think Anubis is a great tool for website operators, and one of the best self-hostable options available today. However, it's a workaround for the core problem that we can't reliably distinguish traffic from badly behaving AI scrapers from legitimate user traffic.
Welcome scrapers (IA, maybe Google and Bing) can publish their IP addresses and get whitelisted. Websites that want to prevent being on the Internet Archive can pretty much just ask for their website to be excluded (even retroactively).
a large chunk of internet archive's snapshots are from archiveteam, where "warriors" bring their own ips (and they crawl respectfully!). save page now is important too, but you don't realise what is useful until you lose it.
They have the right to try to convince me to let them scrape me. Most of the time they're thinly veiled data traders. I haven't seen any new company try to scrape my stuff since maybe Kagi.
Kagi is welcome to scrape from their IP addresses. Other bots that behave are fine too (Huawei and various other Chinese bots don't and I've had to put an IP block on those).
It's interesting but so far there is no definitive proof it's happening.
People are jumping to conclusions a bit fast over here, yes technically it's possible but this kind of behavior would be relatively easy to spot because the app would have to make direct connections to the website it wants to scrap.
Your calculator app for instance connecting to CNN.com ...
iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Android by Google doesn't have such a useful feature of course, but you can run third party firewall like pcapdroid, which I recommend highly.
Macos (little snitch).
Windows (fort firewall).
Not everyone run these app obviously, only the most nerdy like myself but we're also the kind of people who would report on app using our device to make, what is in fact, a zombie or bot network.
I'm not saying it's necessarily false but imo it remains a theory until proven otherwise.
This is a hilariously optimistic, naive, disconnected from reality take. What sort of "proof" would be sufficient for you? TFA includes of course data from the authors own server logs^, but it also references real SDKs and business selling this exact product. You can view the pricing page yourself, right next to stats on how many IPs are available for you to exploit. What else do you need to see?
> iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Privacy reports do not include that information. They include broad areas of information the app claims to gather. There is zero connection between those claimed areas and what the app actually does unless app review notices something that doesn't match up. But none of that information is updated dynamically, and it has never actually included the domains the app connects to. You may be confusing it with the old domain declarations for less secure HTTP connections. Once the connections met the system standards you no longer needed to declare it.
I wasn't aware of this feature. But apparently it does include that information. I just enabled it and can see the domains that apps connect to. https://support.apple.com/en-us/102188
There is already a lot of proof. Just ask for a sales pitch from companies selling these data and they will gladly explain everything to you.
Go to a data conference like Neudata and you will see. You can have scraped data from user devices, real-time locations, credit card, Google analytics, etc.
Given 5his is a thing even in browser plugins, and that so very few people analyse their firewalls, I'd not discount it at all. Much of the world's users hve no clue and app stores are notoriously bad at reacting even with publicsed malware e.g. 'free' VPNs in iOS Store.
All it takes is one person to find out and raise the alarm. The average user doesn't read the source code behind openssl or whatever either, that doesn't mean there's no gains in open sourcing it.
The average user is also not reading these raised “alarms”. And if an app has a bad name, another one will show up with a different name on the same day.
You're on a tech forum, you must have seen one of the many post about app, either on Android or iPhone, that acts like spyware.
They happens from time to time, last one was not more than two week ago where it's been shown that many app were able to read the list of all other app installed on a Android and that Google refused to fix that.
Do you really believe that an app used to make your device part of a bot network wouldn't be posted over here ?
"You're on a tech forum", that's exactly the point. The "average user" is not on a tech forum though, the average user opens the app store of their platform, types "calculator" and installs the first one that's free.
> So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library into their apps that sells users network bandwidth
AKA "why do Cloudflare and Google make me fill out these CAPTCHAs all day"
I don't know why Play Protect/MS Defender/whatever Apple has for antivirus don't classify apps that embed such malware as such. It's ridiculous that this is allowed to go on when detection is so easy. I don't know a more obvious example of a trojan than an SDK library making a user's device part of a botnet.
The implication is that the users that are being constantly presented with CAPTCHAs are experiencing that because they are unwittingly proxying scrapers through their devices via malicious apps they've installed.
or just that they don't run windows/mac OS with chome like everyone else and it's "suspicious".
I get cloudflare capchas all the time with firefox on linux... (and I'm pretty sure there's no such app in my home network!)
When a random device on your network gets infected with crap like this, your network becomes a bot egress point, and anti bot networks respond appropriately. Cloudflare, Akamai, even Google will start showing CAPTCHAs for every website they protect when your network starts hitting random servers with scrapers or DDoS attacks.
This is even worse with CG-NAT if you don't have IPv6 to solve the CG-NAT problem.
I don't think the data they collect is used to train anything these days. Cloudflare is using AI generated images for CAPTCHAs and Google's actual CAPTCHAs are easier for bots than humans at this point (it's the passive monitoring that makes it still work a little bit).
I don't know if I should be surprised about what's described in this article, given the current state of the world. Certainly I didn't know about it before, and I agree with the article's conclusion.
Personally, I think the "network sharing" software bundled with apps should fall into the category of potentially unwanted applications along with adware and spyware. All of the above "tag along" with something the user DID want to install, and quietly misuse the user's resources. Proxies like this definitely have an impact for metered/slow connections - I'm tempted to start Wireshark'ing my devices now to look for suspicious activity.
There should be a public repository of apps known to have these shady behaviours. Having done some light web scraping for archival/automation before, it's a pity that it'll become collateral damage in the anti-AI-botfarm fight.
I agree, this should be called spyware, and malware. There are many other kind of software that also should, but netcat and ncat (probably) aren't malware.
I agree, but the harm done to the users is only one part of the total harm. I think it's quite plausible that many users wouldn't mind some small amount of their bandwidth being used, if it meant being able to use a handy browser extension that they would otherwise have to pay actual dollars for -- but the harm done to those running the servers remains.
> Has anyone tried to compile a list of software that uses these libraries? It would be great to know what apps to avoid
I wouldn't mind reading a comprehensive report on SOTA with regard to bot-blocking.
Sure, there's Anubis (although someone elsethread called it a half-measure, and I'd like to know why), there's captcha's, there's relying on a monopoly (cloudflare, etc) who probably also wants to run their own bots at some point, but what else is there?
In the case of Android, εxodus has one[1], though I couldn't find the malware library listed in TFA. Aurora Store[2], a FOSS Google Play Store client, also integrates it.
That seems to be looking at tracking and data collection libraries, though, for things like advertising and crash reporting. I don't see any mention of the kind of 'network sharing' libraries that this article is about. Have I missed it?
No but here's the thing. Being in the industry for many years I know they are required to mention it in the TOS when using the SDKs. A crawler pulling app TOSs and parsing them could be a thing. List or not, it won't be too useful outside this tech community.
The broken thing about the web is that in order for data to remain readable, a unique sysadmin somewhere has to keep a server running in the face of an increasingly hostile environment.
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).
Can you point me at what you mean? I'm not immediately finding something that indicates that it is not fit for this use case. The fact that bad actors use it to resist those who want to shut them down is, if anything, an endorsement of its durability. There's a bit of overlap between resisting the AI scrapers and resisting the FBI. You can either have a single point of control and a single point of failure, or you can have neither. If you're after something that's both reliable and reliably censorable--I don't think that's in the cards.
That's not to say that it is a ready replacement for the web as we know it. If you have hash-linked everything then you wind up with problems trying to link things together, for instance. Once two pages exist, you can't after-the-fact create a link between them because if you update them to contain that link then their hashes change so now you have to propagate the new hash to people. This makes it difficult to do things like have a comments section at the bottom of a blog post. So you've got to handle metadata like that in some kind of extra layer--a layer which isn't hash linked and which might be susceptible to all the same problems that our current web is--and then the browser can build the page from immutable pieces, but the assembly itself ends up being dynamic (and likely sensitive to the users preference, e.g. dark mode as a browser thing not a page thing).
But I still think you could move maybe 95% of the data into an immutable hash-linked world (think of these as nodes in a graph), the remaining 5% just being tuples of hashes and pubic keys indicating which pages are trusted by which users, which ought to be linked to which others, which are known to be the inputs and output of various functions, and you know... structure stuff (these are our graph's edges).
The edges, being smaller, might be subject to different constraints than the web as we know it. I wouldn't propose that we go all the way to a blockchain where every device caches every edge, but it might be feasible for my devices to store all of the edges for the 5% of the web I care about, and your devices to store the edges for the 5% that you care about... the nodes only being summoned when we actually want to view them. The edges can be updated when our devices contact other devices (based on trust, like you know that device's owner personally) and ask "hey, what's new?"
I've sort of been freestyling on this idea in isolation, probably there's already some projects that scratch this itch. A while back I made a note to check out https://ceramic.network/ in this capacity, but I haven't gotten down to trying it out yet.
Assuming the right incentives can be found to prevent widespread leeching, a distributed content-addressed model indeed solves this problem, but introduces the problem of how to control your own content over time. How do you get rid of a piece of content? How do you modify the content at a given URL?
I know, as far as possible it's a good idea to have content-immutable URLs. But at some point, I need to make www.myexamplebusiness.com show new content. How would that work?
Except no one wants content addressed data - because if you knew what it was you wanted, then you would already have stored it. The web as we know it is an index - it's a way to discover that data is available and specifically we usually want the latest data that's available.
AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.
Yes, for the reasons you describe, you can't be both a useful web-like protocol and also 100% immutable/hash-linked.
But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.
If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:
- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature
- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in
Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.
> because if you knew what it was you wanted, then you would already have stored it.
"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.
We need a list of apps that include these libraries and any malware scanner - including Windows Defender, Play Protect and whatever Apple calls theirs - need to put infected applications into quarantine immediately.
Just because it's not directly causing damage to the device running the malware is running on, that doesn't mean it's not malware.
My iPhone occasionally displays an interrupt screen to remind me that my weather app has been accessing my location in the background and to confirm continued access.
It should also do something similar for apps making chatty background requests to domains not specified at app review time. The legitimate use cases for that behaviour are few.
On the one hand, yes this could work for many cases. On the other hand, good bye p2p. Not every app is a passive client-server request-response. One needs to be really careful with designing permission systems. Apple has already killed many markets before they had a chance to even exist, such as companion apps for watches and other peripherals.
P2P was practically dead on iPhone even back in 2010. The whole "don't burn the user's battery" thing precludes mobile phones doing anything with P2P other than leeching off of it. The only exceptions are things like AirDrop; i.e. locally peer-to-peer things that are only active when in use and don't try to form an overlay or mesh network that would require the phone to become a router.
And, AFAIK, you already need special permission for anything other than HTTPS to specific domains on the public Internet. That's why apps ping you about permissions to access "local devices".
You mean, good bye using my bandwidth without my permission? That's good. And if I install a bittorrent client on my phone, I'll know to give it permission.
> such as companion apps for watches and other peripherals
That's just apple abusing their market position in phones to push their watch. What does it have to do with p2p?
It’s an example of when you design sandboxes/firewalls it’s very easy to assume all apps are one big homogenous blob doing rest calls and everything else is malicious or suspicious. You often need strange permissions to do interesting things. Apple gives themselves these perms all the time.
Maybe there could be a special entitlement that Apple's reviewers would only grant to applications that have a legitimate reason to require such connections.
Then only applications granted that permission would be able to make requests to arbitrary domains / IP addresses.
That's how it works with other permissions most applications should not have access to, like accessing user locations. (And private entitlements third party applications can't have are one way Apple makes sure nobody can compete with their apps, but that's a separate issue.)
Android is so fucking anti-privacy that they still don't have an INTERNET access revoke toggle. The one they have currently is broken and can easily be bypassed with google play services (another highly privileged process running for no reason other than to sell your soul to google). GrapheneOS has this toggle luckily. Whenever you install an app, you can revoke the INTERNET access at the install screen and there is no way that app can bypass it
Do you suggest to outright forbid TCP connections for user software? Because you can compile OpenSSL or any other TLS library and do a TCP connection to port 443 which will be opaque for operating system. They can do wild things like kernel-level DPI for outgoing connections to find out host, but that quickly turns into ridiculous competition.
I think capability based security with proxy capabilities is the way to do it, and this would make it possible for the proxy capability to intercept the request and ask permission, or to do whatever else you want it to do (e.g. redirections, log any accesses, automatically allow or disallow based on a file, use or ignore the DNS cache, etc).
The system may have some such functions built in, and asking permission might be a reasonable thing to include by default.
Try actually using a system like this. OpenSnitch and LittleSnitch do it for Linux and MacOS respectively. Fedora has a pretty good interface for SELinux denials.
I've used all of them, and it's a deluge: it is too much information to reasonably react to.
Your broad is either deny or accept but there's no sane way to reliably know what you should do.
This is not and cannot be an individual problem: the easy part is building high fidelity access control, the hard part is making useful policy for it.
I suggested proxy capabilities, that it can easily be reprogrammed and reconfigured; if you want to disable this feature then you can do that too. It is not only allow or deny; other things are also possible (e.g. simulate various error conditions, artificially slow down the connection, go through a proxy server, etc). (This proxy capability system would be useful for stuff other than network connections too.)
> it is too much information to reasonably react to.
Even if it asks, does not necessarily mean it has to ask every time if the user lets it keep the answer (either for the current session for until the user deliberately deletes this data). Also, if it asks too much because it tries to access too many remote servers, then might be spyware, malware, etc anyways, and is worth investigating in case that is what it is.
> the hard part is making useful policy for it.
What the default settings should be is a significant issue. However, changing the policies in individual cases for different uses, is also something that a user might do, since the default settings will not always be suitable.
If whoever manages the package repository, app store, etc is able to check for malware, then this is a good thing to do (although it should not prohibit the user from installing their own software and modifying the existing software), but security on the computer is also helpful, and neither of these is the substitute for the other; they are together.
Vast majority of revenues in the mobile apps ecosystem are ads, which by design pulled from 3rd parties (and are part of the broader problem discussed in this post).
I am waiting for Apple to enable /etc/hosts or something similar on iOS devices.
Residential IP proxies have some weaknesses. One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
We are working on an open‑source fraud prevention platform [1], and detecting fake users coming from residential proxies is one of its use cases.
The first blog post in this series[1], linked to at the top of TFA, offers an analysis on the potential of using ASNs to detect such traffic. Their conclusion was that ASNs are not helpful for this use-case, showing that across the 50k IPs they've blocked, there is less than 4 IP addresses per ASN, on average.
What was done manually in the first blog is exactly what tirreno helps to achieve by analyzing traffic, here is live example [1]. Blocking an entire ASN should not be considered a strategy when real users are involved.
Regarding the first post, it's rare to see both datacenter network IPs and mobile proxy IP addresses used simultaneously. This suggests the involvement of more than one botnet.
The main idea is to avoid using IP addresses as the sole risk factor. Instead, they should be considered as just one part of the broader picture of user behavior.
>One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
Both are pretty easy to mitigate with a geoip database and some smart routing. One "residential proxy" vendor even has session tokens so your source IP doesn't randomly jump between each request.
At least here in the US most residential ISPs have long leases and change infrequently, weeks or months.
Trying to understand your product, where is it intended to sit in a network? Is it a standalone tool that you use to identify these IPs and feed into something else for blockage or is it intended to be integrated into your existing site or is it supposed to proxy all your web traffic? The reason I ask is it has fairly heavyweight install requirements and Apache and PHP are kind of old school at this point, especially for new projects and companies. It's not what they would commonly be using for their site.
Indeed, if it's a real user from a residential IP address, in most cases it will be the same network. However, if it's a proxy from residential IPs, there could be 10 requests from one network, the 11th request from a second network, and the 12th request back from the same network. This is a red flag.
Thank you for your question. tirreno is a standalone app that needs to receive API events from your main web application. It can work perfectly with 512GB Postgres RAM or even lower, however, in most cases we're talking about millions of events that request resources.
It's much easier to write a stable application without dependencies based on mature technologies. tirreno is fairly 'boring software'.
Effective fraud prevention relies on both the full user context and the behavioral patterns of known online fraudsters. The key idea is that an IP address cannot be used as a red flag on its own without considering the broader context of the account.
However, if we know that the fraudsters we're dealing with are using mobile networks proxies and are randomly switching between two mobile operators, that is certainly a strong risk signal.
An awful lot of free Wi-Fi networks you find in malls are operated by different providers. Walking from one side of a mall to the other while my phone connects to all the Wi-Fi networks I’ve used previously would have you flag me as a fraudster if I understand your approach correctly.
We are discussing user behavior in the context of a web system. The fact that your device has connected to different Wi-Fi networks doesn't necessarily mean that all of them were used to access the web application.
Finally, as mentioned earlier, there is no silver bullet that works for every type of online fraudster. For example, in some applications, a TOR connection might be considered a red flag. However, if we are talking about hn visitors, many of them use TOR on a daily basis.
When the enshitification initially hit the fan, I had little flashbacks of Phil Zimmerman talking about Web of Trust and amusing myself thinking maybe we need humans proving they're humans to other humans so we know we aren't arguing with LLMs on the internet or letting them scan our websites.
But it just doesn't scale to internet size so I'm fucked if I know how we should fix it. We all have that cousin or dude in our highschool class who would do anything for a bit of money and introducing his 'friend' Paul who is in fact a bot whose owner paid for the lie. And not like enough money to make it a moral dilemma, just drinking money or enough for a new video game. So once you get past about 10,000 people you're pretty much back where we are right now.
I think it should be possible to build something that generalises the idea of Web of Trust so that it's more flexible, and less prone to catastrophic breakdown past some scaling limit.
Binary "X trusts Y" statements, plus transitive closure, can lead to long trust paths that we probably shouldn't actually trust the endpoints of. Could we not instead assign probabilities like "X trusts Y 95%", multiply probabilities along paths starting from our own identity, and take the max at each vertex? We could then decide whether to finally trust some Z if its percentage is more than some threshold T%. (Other ways of combining in-edges may be more suitable than max(); it's just a simple and conservative choice.)
Perhaps a variant of backprop could be used to automatically update either (a) all or (b) just our own weights, given new information ("V has been discovered to be fraudulent").
True. Perhaps a collective vote past 2 degrees of freedom out where multiple parties need to vouch for the same person before you believe they aren't a bot. Then you're using the exponential number of people to provide diminishing weight instead of increasing likelihood of malfeasance.
> So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library into their apps that sells users network bandwidth.
This is yet another reason why we need to be wary of popular apps, add-ons, extensions, and so forth changing hands, by legitimate sale or more nefarious methods. Initially innocent utilities can be quickly coopted into being parts of this sort of scheme.
Strange the HolaVPN e.g. Brightdata is not mentioned. They've been using user hosts for those purposes for decades, and also selling proxies en masse. Fun fact they don't have any servers for the VPN. All the VPN traffic is routed through ... other users!
> I am now of the opinion that every form of web-scraping should be considered abusive behaviour and web servers should block all of them. If you think your web-scraping is acceptable behaviour, you can thank these shady companies and the “AI” hype for moving you to the bad corner.
Why jump to that conclusion?
If a scraper clearly advertises itself, follows robots.txt, and has reasonable backoff, it's not abusive. You can easily block such a scraper, but then you're encouraging stealth scrapers because they're still getting your data.
I'd block the scrapers that try to hide and waste compute, but deliberately allow those that don't. And maybe provide a sitemap and API (which besides being easier to scrape, can be faster to handle).
This isn't obvious, 99% of apps make multiple calls to multiple services, and these SDK's are embedded into the app. How can you tell whats legit outbound/inbound? Doing a fingerprint search for the worst culprits might help catch some, but it would likely be a game of cat and mouse.
Their marketing tells you it's for protection. What they fail to omit is it's for their revenue protection - observe that as long as you do not threaten their revenue models, or the revenue models of their partners, you are allowed through. It has never been about the users or developers.
Nobody said that, it's your choice to take whatever action fits your scenario. I have clients where VPNs are blocked yes, it depends on the industry, fraud rate, chargeback rates etc.
Sandboxing means you can limit network access. For example, on Android you can disallow wi-fi and cellular access (not sure about bluetooth) on a per-app basis.
Network access settings should really be more granular for apps that have a legitimate need.
App store disclosure labels should also add network usage disclosure.
Also my reaction when the call is for Google, Apple, Microsoft to fix this : DDOS being illegal, shouldn't the first reaction instead to be to contact law enforcement ?
If you treat platforms like they are all-powerful, then that's what they are likely to become...
Let me get this straight: we want computers knowing everything, to solve current and future problems, but we don't want to give them access to our knowledge?
I don't want computers to know everything. Most knowledge on the internet is false and entirely useless.
The companies selling us computers that supposedly know everything should pay for their database, or they should give away the knowledge they gained for free. Right now, the scraping and copying is free and the knowledge is behind a subscription to access a proprietary model that forms the basis of their business.
Humanity doesn't benefit, the snake oil salesmen do.
> Let me get this straight: we want computers knowing everything, to solve current and future problems, but we don't want to give them access to our knowledge?
Who said that?
There's basically two extremes:
1. We want access to all of human knowledge, now and forever, in order to monetise it and make more money for us, and us alone.
and
2. We don't want our freely available knowledge sold back to us, with no credits to the original authors.
I feel like this could be automated. Spin up a virtual device on a monitored network. Install one app, click on some stuff for awhile, uninstall and move onto the next. If the app reaches out to a lot of random sites then flag it
Google could do this. I'm sure Apple could as well. Third parties could for a small set of apps
This is being done by a couple of SDKs, it'd be much easier to just find and flag those SDK files. Finding apps becomes a matter of a single pass scan over the application contents rather than attempting to bypass the VM detection methods malware is packed full of.
I think tech can still be beautiful in a less grandiose and "omniparadisical" way than people used to dream of. "A wide open internet, free as in speech this, free as in beer that, open source wonders, open gardens..." Well, there are a lot of incentives that fight that, and game theory wins. Maybe we download software dependencies from our friends, the ones we actually trust. Maybe we write more code ourselves--more homesteading families that raise their own chickens, jar their own pickled carrots, and code their own networking utilities. Maybe we operate on servers we own, or our friends own, and we don't get blindsided by news that the platforms are selling our data and scraping it for training.
Maybe it's less convenient and more expensive and onerous. Do good things require hard work? Or did we expect everyone to ignore incentives forever while the trillion-dollar hyperscalers fought for an open and noble internet and then wrapped it in affordable consumer products to our delight?
It reminds me of the post here a few weeks ago about how Netflix used to be good and "maybe I want a faster horse" - we want things to be built for us, easily, cheaply, conveniently, by companies, and we want those companies not to succumb to enshittification - but somehow when the companies just follow the game theory and turn everything into a TikToky neural-networks-maximizing-engagement-infinite-scroll-experience, it's their fault, and not ours for going with the easy path while hoping the corporations would not take the easy path.
it's funny, i've never heard of or thought about the possibility of this happening but actually in hindsight it seems almost too obvious to not be a thing.
Many years ago cybercriminals used to hack computers to use them as residential proxies, now they purchase them online as a service.
In most cases they are used for conducting real financial crimes, but the police investigators are also aware that there is a very low chance that sophisticated fraud is committed directly from a residential IP address.
Its a fair point but very dynamic to sort out. This needs a full research team to figure out. Or you know.. all of us combined!! It is definitely a problem.
TINFOIL: Sometimes I always wondered if Azure or AWS used bots to push site traffic hits to generate money... they know you are hosted with them.. They have your info.. Send out bots to drive micro accumulation. Slow boil..
I think that's mostly that they don't care about having malicious bots on their networks as long as they pay.
GCE is rare in my experience. Most bots I see are on AWS. The DDOS-adjacent hyper aggressive bots that try random URLs and scan for exploits tend to be on Azure or use VPNs.
AWS is bad when you report malicious traffic. Azure has been completely unresponsive and didn't react, even for C&C servers.
In the sense that people are voluntarily installing and running this malware on their computers, rather than being tricked into running it? Is that the only difference?
Which is ironic considering that I strongly disagree with one of the primary walled garden justifications, used particularly in the case of Apple, which amounts to "the end user is too stupid to decide on his own". Unfortunately, even if I disagree with it as a guiding principle sometimes that statement proves true.
It’s not about stupidity, but practicality. People can’t give informed consent for 100 ToS for different companies, and keep those up to date. That’s why there are laws.
No doubt in a dense wall of text that the user must accept to use the application, or worse is deemed to have accepted by using the application at all.
> So if you as an app developer include such a 3rd party SDK in your app to make some money — you are part of the problem and I think you should be held responsible for delivering malware to your users, making them botnet members.
I suspect that this goes for many different SDKs. Personally, I am really, really sick of hearing "That's a solved problem!", whenever I mention that I tend to "roll my own," as opposed to including some dependency, recommended by some jargon-addled dependency addict.
Bad actors love the dependency addiction of modern developers, and have learned to set some pretty clever traps.
I’m constantly amazed at how careless developers are with pulling 3rd party libraries into their code. Have you audited this code? Do you know everything it does? Do you know what security vulnerabilities exist in it? On what basis do you trust it to do what it says it is doing and nothing else?
But nobody seems to do this diligence. It’s just “we are in a rush. we need X. dependency does X. let’s use X.” and that’s it!
I think developers are paid to competently deliver software to their employer, and part of that competence is properly vetting the code you are delivering. If I wrote code that ended up having serious bugs like crashing, I’d expect to have at least a minimum consequence, like root causing it and/or writing a postmortem to help avoid it in the future. Same as I’d expect if I pulled in a bad dependency.
Your expectations do not match the employment market as I have ever experienced it.
Have you ever worked anywhere that said "go ahead and slow down on delivering product features that drive business value so you can audit the code of your dependencies, that's fine, we'll wait"?
Yea, and that’s the problem. If such absolute rock bottom minimal expectations (know what the code does) are seen as too slow and onerous, the industry is cooked!
Due diligence is a sliding scale. Work at a webdev agency is "get it done as fast as possible for this MVP we need". Work at NASA or a biomedical device company? Every line of code is triple-checked. It's entirely dependent on the cost/benefit analysis.
If a car manufacturer sources a part from a third party, and that part has a serious safety problem, who will the customer blame? And who will be responsible for the recall and the repairs?
Malware, botnets… it is very similar. And people including developers are - in 80 per cent - eagier to make money, because… Is greed good? No, it isn’t. It is a plague.
You're a developer who devoted time to develop a piece of software. You discover that you are not generating any income from it: few people can even find it in the sea of similar apps, few of those are willing to pay for it, and those who are willing to pay for it are not willing to pay much. To make matters worse, you're going to lose a cut of what is paid to the middlemen who facilitate the transaction.
Is that greed?
I can find many reasons to be critical of that developer, things like creating a product for a market segment that is saturated, and likely doing so because it is low hanging fruit (both conceptually and in terms of complexity). I can be critical of their moral judgement for how they decided to generate income from their poor business judgment. But I don't thinks it's right to automatically label them as greedy. They may be greedy, but they may also be trying to generate income from their work.
Umm, yes? You are not owed anything in this life, certainly not income for your choice to spend your time on building a software product no one asked for. Not making money on it is a perfectly fine outcome. If you desperately need guaranteed money, don't build an app expecting it to sell; get a job.
> If you desperately need guaranteed money, don't build an app expecting it to sell; get a job.
Technically true but a bit of perspective might help. The consumer market is distorted by free (as in beer) apps that does a bunch of shitty things that should in many cases be illegal or require much more informed consent than today, like tracking everything they can. Then you have VC funded ”free” as well, where the end game is to raise prices slowly to boil the frog. Then you have loss leaders from megacorps, and a general anti-competitive business culture.
Plus, this is not just in the Wild West shady places, like the old piratebay ads. The top result for ”timer” on the App Store (for me) is indeed a timer app, but with IAP of $800/y subscription… facilitated by Apple Inc, who gets 15-30% of the bounty.
Look, the point is it’s almost impossible to break into consumer markets because everyone else is a predator. It’s a race to the bottom, ripping off clueless customers. Everyone would benefit from a fairer market. Especially honest developers.
No I think it’s designed to catch misclicks and children operating the phone and such, sold as $17/week possibly masquerading as one-time payment. They pay for App Store ads for it too.
We could have people ask for software in a more convenient way.
Not making money could be an indication the software isn't useful, but what if it is? What can the collective do in that zone?
I imagine one could ask and pay for unwritten software then get a refund if it doesn't materialize before your deadline.
Why is discovery (of many creation) willingly handed over to a hand full of mega corps?? They seem to think I want to watch and read about Trump and Elon every day.
Promoting something because it is good is a great example of a good thing that shouldn't pay.
There was an earlier discussion on HN about whether advertising should be more heavily regulated (or even banned outright). I'm starting to wonder whether most of the problems on the Web are negative side effects of the incentives created by ads (including all botnets, except those that enable ransomeware and espionage). Even the current worldwide dopamine addition is driven by apps and content created for engagement, whose entire purpose is ad revenue.
This is especially true for script kiddies, which is why I am so thankful for https://e18e.dev/
AI is making this worse than ever though, I am constantly having to tell devs that their work is failing to meet requirements, because AI is just as bad as a junior dev when it comes to reaching for a dependency. It’s like we need training wheels for the prompts juniors are allowed to write.
I agree that there are things with too many dependencies and I try to avoid that. I think it is a good idea to minimize how many dependencies are needed (even indirect dependencies; however, in some cases a dependency is not a specific implementation, and in that case indirect dependencies are less of a problem, although having a good implementation with less indirect dependencies is still beneficial). I may write my own, in many cases. However, another reason for writing my own is because of other kind of problems in the existing programs. Not all problems are malicious; many are just that they do not do what I need, or do too much more than what I need, or both. (However, most of my stuff is C rather than JavaScript; the problem seems to be more severe with JavaScript, but I do not use that much.)
That may be true but I think you're missing the point here.
The "network sharing" behavior in these SDKs is the sole purpose of the SDK. It isn't being included as a surprise along with some other desirable behavior. What needs to stop is developers including these SDKs as a secondary revenue source in free or ad-supported apps.
My personal beef is that most of the time it acts like hidden global dependencies, and the configuration of those dependencies, along with their lifetimes, becomes harder to understand by not being traceable in the source code.
Dependency injection is just passing your dependencies in as constructor arguments rather than as hidden dependencies that the class itself creates and manages.
It's equivalent to partial application.
An uninstantiated class that follows the dependency injection pattern is equivalent to a family of functions with N+Mk arguments, where Mk is the number of parameters in method k.
Upon instantiation by passing constructor arguments, you've created a family of functions each with a distinct sets of Mk parameters, and N arguments in common.
> Dependency injection is just passing your dependencies in as constructor arguments rather than as hidden dependencies that the class itself creates and manages.
That's the best way to think of it fundamentally. But the main implication of that which is at some point something has to know how to resolve those dependencies - i.e. they can't just be constructed and then injected from magic land. So global cradles/resolvers/containers/injectors/providers (depending on your language and framework) are also typically part and parcel of DI, and that can have some big implications on the structure of your code that some people don't like. Also you can inject functions and methods not just constructors.
> Dependency injection is just passing your dependencies in as constructor arguments rather than as hidden dependencies that the class itself creates and manages.
This is all well and good, but you also need a bunch of code that handles resolving those dependencies, which oftentimes ends up being complex and hard to debug and will also cause runtime errors instead of compile time errors, which I find to be more or less unacceptable.
Edit: to elaborate on this, I’ve seen DI frameworks not be used in “enterprise” projects a grand total of zero times. I’ve done DI directly in personal projects and it was fine, but in most cases you don’t get to make that choice.
Just last week, when working on a Java project that’s been around for a decade or so, there were issues after migrating it from Spring to Spring Boot - when compiled through the IDE and with the configuration to allow lazy dependency resolution it would work (too many circular dependencies to change the code instead), but when built within a container by Maven that same exact code and configuration would no longer work and injection would fail.
I’m hoping it’s not one of those weird JDK platform bugs but rather an issue with how the codebase is compiled during the container image build, but the issue is mind boggling. More fun, if you take the .jar that’s built in the IDE and put it in the container, then everything works, otherwise it doesn’t. No compilation warnings, most of the startup is fine, but if you build it in the container, you get a DI runtime error about no lazy resolution being enabled even if you hardcode the setting to be on in Java code: https://docs.spring.io/spring-boot/api/kotlin/spring-boot-pr...
I’ve also seen similar issues before containers, where locally it would run on Jetty and use Tomcat on server environments, leading to everything compiling and working locally but throwing injection errors on the server.
What’s more, it’s not like you can (easily) put a breakpoint on whatever is trying to inject the dependencies - after years of Java and Spring I grow more and more convinced that anything that doesn’t generate code that you can inspect directly (e.g. how you can look at a generated MapStruct mapper implementation) is somewhat user hostile and will complicate things. At least modern Spring Boot is good in that more of the configuration is just code, because otherwise good luck debugging why some XML configuration is acting weird.
In other words, DI can make things more messy due to a bunch of technical factors around how it’s implemented (also good luck reading those stack traces), albeit even in the case of Java something like Dagger feels more sane https://dagger.dev/ despite never really catching on.
Of course, one could say that circular dependencies or configuration issues are project specific, but given enough time and projects you will almost inevitably get those sorts of headaches. So while the theory of DI is nice, you can’t just have the theory without practice.
Inclined to agree. Consider that a singleton dependency is essentially a global, and differs from a traditional global, only in that the reference is kept in a container and supplied magically via a constructor variable. Also consider that constructor calls are now outside the application layer frames of the callstack, in case you want to trace execution.
Dependency injection is not hidden. It's quite the opposite: dependency injection lists explicitly all the dependencies in a well defined place.
Hidden dependencies are: untyped context variable; global "service registry", etc. Those are hidden, the only way to find out which dependencies given module has is to carefully read its code and code of all called functions.
To me it‘s rather anti-functional. Normally, when you instantiate a class, the resulting object’s behavior only depends on the constructor arguments you pass it (= the behavior is purely a function of the arguments). With dependency injection, the object’s behavior may depend on some hidden configuration, and not even inspecting the class’ source code will be able to tell you the source of that bevavior, because there’s only an @Inject annotation without any further information.
Conversely, when you modify the configuration of which implementation gets injected for which interface type, you potentially modify the behavior of many places in the code (including, potentially, the behavior of dependencies your project may have), without having passed that code any arguments to that effect. A function executing that code suddenly behaves differently, without any indication of that difference at the call site, or traceable from the call site. That’s the opposite of the functional paradigm.
> because there’s only an @Inject annotation without any further information
It sounds like you have a gripe with a particular DI framework and not the idea of Dependency Injection. Because
> Normally, when you instantiate a class, the resulting object’s behavior only depends on the constructor arguments you pass it (= the behavior is purely a function of the arguments)
With Dependency Injection this is generally still true, even more so than normal because you're making the constructor's dependencies explicit in the arguments. If you have a class CriticalErrorLogger(), you can't directly tell where it logs to, is it using a flat file or stdout or a network logger? If you instead have a class CriticalErrorLogger(logger *io.writer), then when you create it you know exactly what it's using to log because you had to instantiate it and pass it in.
Or like Kortilla said, instead of passing in a class or struct you can pass in a function, so using the same example, something like CriticalErrorLogger(fn write)
I don't quite understand your example, but I don't think the particulars make much of a difference. We can go with the most general description: With dependency injection, you define points in your code where dependencies are injected. The injection point is usually a variable (this includes the case of constructor parameters), whose value (the dependency) will be set by the dependency injection framework. The behavior of the code that reads the variable and hence the injected value will then depend on the specific value that was injected.
My issue with that is this: From the point of view of the code accessing the injected value (and from the point of view of that code's callers), the value appears like out of thin air. There is no way to trace back from that code where the value came from. Similarly, when defining which value will be injected, it can be difficult to trace all the places where it will be injected.
In addition, there are often lifetime issues involved, when the injected value is itself a stateful object, or may indirectly depend on mutable, cached, or lazy-initialized, possibly external state. The time when the value's internal state is initialized or modified, or whether or not it is shared between separate injection points, is something that can't be deduced from the source code containing the injection points, but is often relevant for behavior, error handling, and general reasoning about the code.
All of this makes it more difficult to reason about the injected values, and about the code whose behavior will depend on those values, from looking at the source code.
> whose value (the dependency) will be set by the dependency injection framework
I agree with your definition except for this part, you don't need any framework to do dependency injection. It's simply the idea that instead of having an abstract base class CriticalErrorLogger, with the concrete implementations of StdOutCriticalErrorLogger, FileCriticalErrorLogger, AwsCloudwatchCriticalErrorLogger which bake their dependency into the class design; you instead have a concrete class CriticalErrorLogger(dep *dependency) and create dependency objects externally that implement identical interfaces in different ways. You do text formatting, generating a traceback, etc, and then call dep.write(myFormattedLogString), and the dependency handles whatever that means.
I agree with you that most DI frameworks are too clever and hide too much, and some forms of DI like setter injection and reflection based injection are instant spaghetti code generators. But things like Constructor Injection or Method Injection are so simple they often feel obvious and not like Dependency Injection even though they are. I love DI, but I hate DI frameworks; I've never seen a benefit except for retrofitting legacy code with DI.
And yeah it does add the issue or lifetime management. That's an easy place to F things up in your code using DI and requires careful thought in some circumstances. I can't argue against that.
But DI doesn't need frameworks or magic methods or attributes to work. And there's a lot of situations where DI reduces code duplication, makes refactoring and testing easier, and actually makes code feel less magical than using internal dependencies.
The basic principle is much simpler than most DI frameworks make it seem. Instead of initializing a dependency internally, receive the dependency in some way. It can be through overly abstracted layers or magic methods, but it can also be as simple as adding an argument to the constructor or a given method that takes a reference to the dependency and uses that.
The pattern you are describing is what I know as the Strategy pattern [0]. See the example there with the Car class that takes a BrakeBehavior as a constructor parameter [1]. I have no issue with that and use it regularly. The Strategy pattern precedes the notion of dependency injection by around ten years.
The term Dependency Injection was coined by Martin Fowler with this article: https://martinfowler.com/articles/injection.html. See how it presents the examples in terms of wiring up components from a configuration, and how it concludes with stressing the importance of "the principle of separating service configuration from the use of services within an application". The article also presents constructor injection as only one of several forms of dependency injection.
That is how everyone understood dependency injection when it became popular 10-20 years ago: A way to customize behavior at the top application/deployment level by configuration, without having to pass arguments around throughout half the code base to the final object that uses them.
Apparently there has been a divergence of how the term is being understood.
[1] The fact that Car is abstract in the example is immaterial to the pattern, and a bit unfortunate in the Wikipedia article, from a didactic point of view.
They're not really exclusive ideas. The Constructor Injection section in Fowler's article is exactly the same as the Strategy pattern. But no one talks about the Strategy pattern anymore, it's all wrapped into the idea of DI and that's what caught on.
It was interesting reading this exchange. I have a similar understanding of DI to you. I have never even heard of a DI framework and I have trouble picturing what it would look like. It was interesting to watch you two converge on where the disconnect was.
It starts off feeling like a superpower allowing to to change a system's behaviour without changing its code directly. It quickly devolves into a maintenance nightmare though every time I've encountered it.
I'm talking more specifically about Aspect Oriented Programming though and DI containers in OOP, which seemed pretty clever in theory, but have a lot of issues in reality.
I take no issues with currying in functional programming.
AI scrapers and "sneaker bots" are just the tip of the iceberg.
Why are all these entities concentrated and metastasizing from just a few superhubs?
Why do they look, smell and behave like state-level machinery?
If you've researched you'll know exactly what I'm talking about.
Unless complicit, tech leaders (Apple Google Microsoft) have a duty to respond swiftly and decisively.
This has been going on far too long.
"Infatica is partnered with Bitdefender, a global leader in cybersecurity, to protect our SDK users from malicious web traffic and content, including infected URLs, untrusted web pages, fraudulent and phishing links, and more."
In the last week I've had to deal with two large-scale influxes of traffic on one particular web server in our organization.
The first involved requests from 300,000 unique IPs in a span of a few hours. I analyzed them and found that ~250,000 were from Brazil. I'm used to using ASNs to block network ranges sending this kind of traffic, but in this case they were spread thinly over 6,000+ ASNs! I ended up blocking all of Brazil (sorry).
A few days later this same web server was on fire again. I performed the same analysis on IPs and found a similar number of unique addresses, but spread across Turkey, Russia, Argentina, Algeria and many more countries. What is going on?! Eventually I think I found a pattern to identify the requests, in that they were using ancient Chrome user agents. Chrome 40, 50, 60 and up to 90, all released 5 to 15 years ago. Then, just before I could implement a block based on these user agents, the traffic stopped.
In both cases the traffic from datacenter networks was limited because I already rate limit a few dozen of the larger ones.
Sysadmin life...
Try Anubis: <https://anubis.techaro.lol>
It's a reverse proxy that presents a PoC challenge to every new visitor. It shifts the initial cost of accessing your server's resources back at the client. Assuming your uplink can handle 300k clients requesting a single 70kb web page, it should solve most of your problems.
For science, can you estimate your peak QPS?
I've seen a few attacks where the operators placed malicious code on high-traffic sites (e.g. some government thing, larger newspapers), and then just let browsers load your site as an img. Did you see images, css, js being loaded from these IPs? If they were expecting images, they wouldn't parse the HTML and not load other resources.
It's a pretty effective attack because you get large numbers of individual browsers to contribute. Hosters don't care, so unless the site owners are technical enough, they can stay online quite a bit.
If they work with Referrer Policy, they should be able to mask themselves fairly well - the ones I saw back then did not.
We all agree that AI crawlers are a big issue as they don't respect any established best practices, but we rarely talk about the path forward. Scraping has been around for as long as the internet, and it was mostly fine. There are many very legitimate use cases for browser automation and data extraction (I work in this space).
So what are potential solutions? We're somehow still stuck with CAPTCHAS, a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
How can enable beneficial automation while protecting against abusive AI crawlers?
[0] https://arxiv.org/abs/2311.10911
My pet peeve is that using the term "AI crawler" for this conflates things unnecessarily. There's some people who are angry at it due to anti-AI bias and not wishing to share information, while there are others who are more concerned about it due to the large amount of bandwidth and server overloading.
Not to mention that it's unknown if these are actually from AI companies, or from people pretending to be AI companies. You can set anything as your user agent.
It's more appropriate to mention the specific issue one haves about the crawlers, like "they request things too quickly" or "they're overloading my server". Then from there, it is easier to come to a solution than just "I hate AI". For example, one would realize that things like Anubis have existed forever, they are just called DDoS protection, specifically those using proof-of-work schemes (e.g. https://github.com/RuiSiang/PoW-Shield).
This also shifts the discussion away from something that adds to the discrimination against scraping in general, and more towards what is actually the issue: overloading servers, or in other words, DDoS.
I wrote an article about a possible proof of personhood solution idea: https://mjaseem.github.io/tech/2025/04/12/proof-of-humanity.....
The broad idea is to use zero knowledge proofs with certification. It sort of flips the public key certification system and adds some privacy.
To get into place, the powers in charge need to sway.
Blame the "AI" companies for that. I am glad the small web is pushing hard against these scrapers, with the rise of Anubis as a starting point
> Blame the "AI" companies for that. I am glad the small web is pushing hard towards these scrapers, with the rise of Anubis as a starting point
Did you mean "against"?
Corrected, thanks
> So what are potential solutions?
It won't fully solve the problem, but with the problem relatively identified, you must then ask why people are engaging in this behavior. Answer: money, for the most part. Therefore, follow the money and identify the financial incentives driving this behavior. This leads you pretty quickly to a solution most people would reject out-of-hand: turn off the financial incentive that is driving the enshittification of the web. Which is to say, kill the ad-economy.
Or at least better regulate it while also levying punitive damages that are significant enough to both disuade bad-actors and encourage entities to view data-breaches (or the potential therein) and "leakage[0]" as something that should actually be effectively secured against. Afterall, there are some upsides to the ad-economy that, without it, would present some hard challenges (eg, how many people are willing to pay for search? what happens to the vibrant sphere of creators of all stripes that are incentivized by the ad-economy? etc).
Personally, I can't imagine this would actually happen. Pushback from monied interests aside, most people have given up on the idea of data-privacy or personal-ownership of their data, if they ever even cared in the first place. So, in the absence of willing to do do something about the incentive for this maligned behavior, we're left with few good options.
0: https://news.ycombinator.com/item?id=43716704 (see comments on all the various ways people's data is being leaked/leached/tracked/etc)
The best solution I've seen is to hit everyone with a proof of work wall and whitelist the scrapers that are welcome (search engines and such).
Running SHA hash calculations for a second or so once every week is not bad for users, but with scrapers constantly starting new sessions they end up spending most of their time running useless Javascript, slowing the down significantly.
The most effective alternative to proof of work calculations seems to be remote attestation. The downside is that you're getting captchas if you're one of the 0.1% who disable secure boot and run Linux, but the vast majority of web users will live a captcha free life. This same mechanism could in theory also be used to authenticate welcome scrapers rather than relying on pure IP whitelists.
But people don’t interact with your website anymore; they as an AI. So the AI crawler is a real user.
I say we ask Google Analytics to count an AI crawler as a real view. Let’s see who’s most popular.
I hate this but I suspect a login-only deanonymised web (made simple with chrome and WEI!) is the future. Firefox users can go to hell.
We won't.
FWIW, Trend Micro wrote up a decent piece on this space in 2023.
It is still a pretty good lay-of-the-land.
https://www.trendmicro.com/vinfo/us/security/news/vulnerabil...
> I am now of the opinion that every form of web-scraping should be considered abusive behaviour and web servers should block all of them. If you think your web-scraping is acceptable behaviour, you can thank these shady companies and the “AI” hype for moving you to the bad corner.
I imagine that e.g. Youtube would be happy to agree with this. Not that it would turn them against AI generally.
yeah, but you can't, that's the problem. Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures).
On a separate note, I believe open web scraping has been a massive benefit to the internet on net, and almost entirely positive pre-2021. Web scraping & crawling enables search engines, services like Internet Archive, walled-garden-busting (like Invidious, yt-dlp, and Nitter), mashups (Spotube, IFTT, and Plaid would have been impossible to bootstrap without web scraping), and all kinds of interesting data science projects (e.g. scraping COVID-19 stats from local health departments to patch together a picture of viral spread for epidemiologists).
We should have a way to verify the user-agents of the valid and useful scrapers such as Internet Archive by having some kind of cryptographic signature of their user-agents and being able to validate it with any reverse proxy seems like a good start
Self signed, I hope.
Or do you want a central authority that decides who can do new search engines?
Using DANE is probably the best idea even though it's still not mainstream
> Plenty of service operators would like to block every scraper that doesn't obey their robots.txt, but there's no good way to do that without blocking human traffic too (Anubis et al are okay, but they are half-measures)
Why is Anubis-type mitigations a half-measure?
Anubis, go-away, etc are great, don't get me wrong -- but what Anubis does is impose a cost on every query. The website operator is hoping that the compute will have a rate-limiting effect on scrapers while minimally impacting the user experience. It's almost like chemotherapy, in that you're poisoning everyone in the hope that the aggressive bad actors will be more severely affected than the less aggressive good actors. Even the Anubis readme calls it a nuclear option. In practice it appears to work pretty well, which is great!
It's a half-measure because:
1. You're slowing down scrapers, not blocking them. They will still scrape your site content in violation of robots.txt.
2. Scrapers with more compute than IP proxies will not be significantly bottlenecked by this.
3. This may lead to an arms race where AI companies respond by beefing up their scraping infrastructure, necessitating more difficult PoW challenges, and so on. The end result of this hypothetical would be a more inconvenient and inefficient internet for everyone, including human users.
To be clear: I think Anubis is a great tool for website operators, and one of the best self-hostable options available today. However, it's a workaround for the core problem that we can't reliably distinguish traffic from badly behaving AI scrapers from legitimate user traffic.
Yeah, also this means the death of archival efforts like the Internet Archive.
Welcome scrapers (IA, maybe Google and Bing) can publish their IP addresses and get whitelisted. Websites that want to prevent being on the Internet Archive can pretty much just ask for their website to be excluded (even retroactively).
[Cloudflare](https://developers.cloudflare.com/cache/troubleshooting/alwa...) tags the internet archive as operating from 207.241.224.0/20 and 208.70.24.0/21 so disabling the bot-prevention framework on connections from there should be enough.
a large chunk of internet archive's snapshots are from archiveteam, where "warriors" bring their own ips (and they crawl respectfully!). save page now is important too, but you don't realise what is useful until you lose it.
That's basically asking to close the market in favor of the current actors.
New actors have the right to emerge.
They have the right to try to convince me to let them scrape me. Most of the time they're thinly veiled data traders. I haven't seen any new company try to scrape my stuff since maybe Kagi.
Kagi is welcome to scrape from their IP addresses. Other bots that behave are fine too (Huawei and various other Chinese bots don't and I've had to put an IP block on those).
No they don't.
There's no rule that you have to let anyone in who claims to be a web crawler.
which is why they will stop claiming to be one.
So who decides that you can be one? Right now it's Cloudflare, a litteral monopoly...
The truth is that I sympathize with the people trying to use mobile connections to bypass such a cartel.
What Cloudflare is doing now is worse than the web crawlers themselves and the legality of blocking crawlers with a monopoly is dubious at best.
so what happened to competition fostering a better outcome for all then?
This sounds like it would be a good idea. Create a whitelist of IPs and block the rest.
It's interesting but so far there is no definitive proof it's happening.
People are jumping to conclusions a bit fast over here, yes technically it's possible but this kind of behavior would be relatively easy to spot because the app would have to make direct connections to the website it wants to scrap.
Your calculator app for instance connecting to CNN.com ...
iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Android by Google doesn't have such a useful feature of course, but you can run third party firewall like pcapdroid, which I recommend highly.
Macos (little snitch).
Windows (fort firewall).
Not everyone run these app obviously, only the most nerdy like myself but we're also the kind of people who would report on app using our device to make, what is in fact, a zombie or bot network.
I'm not saying it's necessarily false but imo it remains a theory until proven otherwise.
This is a hilariously optimistic, naive, disconnected from reality take. What sort of "proof" would be sufficient for you? TFA includes of course data from the authors own server logs^, but it also references real SDKs and business selling this exact product. You can view the pricing page yourself, right next to stats on how many IPs are available for you to exploit. What else do you need to see?
^ edit: my mistake, the server logs I mentioned were from the authors prior blog post on this topic, linked to at the top of TFA: https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
> iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
Privacy reports do not include that information. They include broad areas of information the app claims to gather. There is zero connection between those claimed areas and what the app actually does unless app review notices something that doesn't match up. But none of that information is updated dynamically, and it has never actually included the domains the app connects to. You may be confusing it with the old domain declarations for less secure HTTP connections. Once the connections met the system standards you no longer needed to declare it.
I wasn't aware of this feature. But apparently it does include that information. I just enabled it and can see the domains that apps connect to. https://support.apple.com/en-us/102188
Pretty neat, actually. Thanks for looking uo that link.
There is already a lot of proof. Just ask for a sales pitch from companies selling these data and they will gladly explain everything to you.
Go to a data conference like Neudata and you will see. You can have scraped data from user devices, real-time locations, credit card, Google analytics, etc.
Given 5his is a thing even in browser plugins, and that so very few people analyse their firewalls, I'd not discount it at all. Much of the world's users hve no clue and app stores are notoriously bad at reacting even with publicsed malware e.g. 'free' VPNs in iOS Store.
> iOS have app privacy report where one can check what connections are made by app, how often, last one, etc.
How often is the average calculator app user checking there Privacy Report? My guess, not many!
All it takes is one person to find out and raise the alarm. The average user doesn't read the source code behind openssl or whatever either, that doesn't mean there's no gains in open sourcing it.
The real solution is to add a permission for network access, with the default set to deny.
The average user is also not reading these raised “alarms”. And if an app has a bad name, another one will show up with a different name on the same day.
You're on a tech forum, you must have seen one of the many post about app, either on Android or iPhone, that acts like spyware.
They happens from time to time, last one was not more than two week ago where it's been shown that many app were able to read the list of all other app installed on a Android and that Google refused to fix that.
Do you really believe that an app used to make your device part of a bot network wouldn't be posted over here ?
"You're on a tech forum", that's exactly the point. The "average user" is not on a tech forum though, the average user opens the app store of their platform, types "calculator" and installs the first one that's free.
Botnets as a Service are absolutely happening, but as you allude to, the scope of the abuse is very different on iOS than, say, Windows.
> So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library into their apps that sells users network bandwidth
AKA "why do Cloudflare and Google make me fill out these CAPTCHAs all day"
I don't know why Play Protect/MS Defender/whatever Apple has for antivirus don't classify apps that embed such malware as such. It's ridiculous that this is allowed to go on when detection is so easy. I don't know a more obvious example of a trojan than an SDK library making a user's device part of a botnet.
Cloudflare and Google use CAPTCHAs to sell web scrapers? I don't get your point. I was under the impression the data is used to train models.
The implication is that the users that are being constantly presented with CAPTCHAs are experiencing that because they are unwittingly proxying scrapers through their devices via malicious apps they've installed.
.. or that other people on their network/Shared public IP have installed
or just that they don't run windows/mac OS with chome like everyone else and it's "suspicious". I get cloudflare capchas all the time with firefox on linux... (and I'm pretty sure there's no such app in my home network!)
When a random device on your network gets infected with crap like this, your network becomes a bot egress point, and anti bot networks respond appropriately. Cloudflare, Akamai, even Google will start showing CAPTCHAs for every website they protect when your network starts hitting random servers with scrapers or DDoS attacks.
This is even worse with CG-NAT if you don't have IPv6 to solve the CG-NAT problem.
I don't think the data they collect is used to train anything these days. Cloudflare is using AI generated images for CAPTCHAs and Google's actual CAPTCHAs are easier for bots than humans at this point (it's the passive monitoring that makes it still work a little bit).
Trojans in your mobile apps ruin your IP's reputation which comes back to you in the form of frequent, annoying CAPTCHAs.
it's not technically malware, you agreed to it when you accepted the terms of service :^)
It's malware it does something malicious.
I don't know if I should be surprised about what's described in this article, given the current state of the world. Certainly I didn't know about it before, and I agree with the article's conclusion.
Personally, I think the "network sharing" software bundled with apps should fall into the category of potentially unwanted applications along with adware and spyware. All of the above "tag along" with something the user DID want to install, and quietly misuse the user's resources. Proxies like this definitely have an impact for metered/slow connections - I'm tempted to start Wireshark'ing my devices now to look for suspicious activity.
There should be a public repository of apps known to have these shady behaviours. Having done some light web scraping for archival/automation before, it's a pity that it'll become collateral damage in the anti-AI-botfarm fight.
I agree, this should be called spyware, and malware. There are many other kind of software that also should, but netcat and ncat (probably) aren't malware.
I agree, but the harm done to the users is only one part of the total harm. I think it's quite plausible that many users wouldn't mind some small amount of their bandwidth being used, if it meant being able to use a handy browser extension that they would otherwise have to pay actual dollars for -- but the harm done to those running the servers remains.
Has anyone tried to compile a list of software that uses these libraries? It would be great to know what apps to avoid
> Has anyone tried to compile a list of software that uses these libraries? It would be great to know what apps to avoid
I wouldn't mind reading a comprehensive report on SOTA with regard to bot-blocking.
Sure, there's Anubis (although someone elsethread called it a half-measure, and I'd like to know why), there's captcha's, there's relying on a monopoly (cloudflare, etc) who probably also wants to run their own bots at some point, but what else is there?
In the case of Android, εxodus has one[1], though I couldn't find the malware library listed in TFA. Aurora Store[2], a FOSS Google Play Store client, also integrates it.
[1] https://reports.exodus-privacy.eu.org/en/trackers/ [2] https://f-droid.org/packages/com.aurora.store/
That seems to be looking at tracking and data collection libraries, though, for things like advertising and crash reporting. I don't see any mention of the kind of 'network sharing' libraries that this article is about. Have I missed it?
No but here's the thing. Being in the industry for many years I know they are required to mention it in the TOS when using the SDKs. A crawler pulling app TOSs and parsing them could be a thing. List or not, it won't be too useful outside this tech community.
A good portion of free VPN apps sell their traffic. This was the thing even before the AI bot explosion.
The broken thing about the web is that in order for data to remain readable, a unique sysadmin somewhere has to keep a server running in the face of an increasingly hostile environment.
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).
there is no incentive for different companies to share data with each other, or with anyone really (facebook leeching books?)
Are there any systems like that, even if experimental?
IPFS
I had high hopes for IPFS, but even it has vectors for abuse.
See https://arxiv.org/abs/1905.11880 [Hydras and IPFS: A Decentralised Playground for Malware]
Can you point me at what you mean? I'm not immediately finding something that indicates that it is not fit for this use case. The fact that bad actors use it to resist those who want to shut them down is, if anything, an endorsement of its durability. There's a bit of overlap between resisting the AI scrapers and resisting the FBI. You can either have a single point of control and a single point of failure, or you can have neither. If you're after something that's both reliable and reliably censorable--I don't think that's in the cards.
That's not to say that it is a ready replacement for the web as we know it. If you have hash-linked everything then you wind up with problems trying to link things together, for instance. Once two pages exist, you can't after-the-fact create a link between them because if you update them to contain that link then their hashes change so now you have to propagate the new hash to people. This makes it difficult to do things like have a comments section at the bottom of a blog post. So you've got to handle metadata like that in some kind of extra layer--a layer which isn't hash linked and which might be susceptible to all the same problems that our current web is--and then the browser can build the page from immutable pieces, but the assembly itself ends up being dynamic (and likely sensitive to the users preference, e.g. dark mode as a browser thing not a page thing).
But I still think you could move maybe 95% of the data into an immutable hash-linked world (think of these as nodes in a graph), the remaining 5% just being tuples of hashes and pubic keys indicating which pages are trusted by which users, which ought to be linked to which others, which are known to be the inputs and output of various functions, and you know... structure stuff (these are our graph's edges).
The edges, being smaller, might be subject to different constraints than the web as we know it. I wouldn't propose that we go all the way to a blockchain where every device caches every edge, but it might be feasible for my devices to store all of the edges for the 5% of the web I care about, and your devices to store the edges for the 5% that you care about... the nodes only being summoned when we actually want to view them. The edges can be updated when our devices contact other devices (based on trust, like you know that device's owner personally) and ask "hey, what's new?"
I've sort of been freestyling on this idea in isolation, probably there's already some projects that scratch this itch. A while back I made a note to check out https://ceramic.network/ in this capacity, but I haven't gotten down to trying it out yet.
Assuming the right incentives can be found to prevent widespread leeching, a distributed content-addressed model indeed solves this problem, but introduces the problem of how to control your own content over time. How do you get rid of a piece of content? How do you modify the content at a given URL?
I know, as far as possible it's a good idea to have content-immutable URLs. But at some point, I need to make www.myexamplebusiness.com show new content. How would that work?
Except no one wants content addressed data - because if you knew what it was you wanted, then you would already have stored it. The web as we know it is an index - it's a way to discover that data is available and specifically we usually want the latest data that's available.
AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.
Yes, for the reasons you describe, you can't be both a useful web-like protocol and also 100% immutable/hash-linked.
But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.
If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:
- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature
- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in
Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.
But we already have HEAD requests and etags.
It is entirely possible to serve a fully cached response that says "you already have this". The problem is...people don't implement this well.
> because if you knew what it was you wanted, then you would already have stored it.
"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.
Are there any lists with known c&c servers for these services that can be added to Pihole/etc?
You can use one of the list from here: https://github.com/hagezi/dns-blocklists
We need a list of apps that include these libraries and any malware scanner - including Windows Defender, Play Protect and whatever Apple calls theirs - need to put infected applications into quarantine immediately. Just because it's not directly causing damage to the device running the malware is running on, that doesn't mean it's not malware.
Apps should be required to ask for permission to access specific domains. Similar to the tracking protection, Apple introduced a while ago.
Not sure how this could work for browsers, but the other 99% of apps I have on my phone should work fine with just a single permitted domain.
My iPhone occasionally displays an interrupt screen to remind me that my weather app has been accessing my location in the background and to confirm continued access.
It should also do something similar for apps making chatty background requests to domains not specified at app review time. The legitimate use cases for that behaviour are few.
On the one hand, yes this could work for many cases. On the other hand, good bye p2p. Not every app is a passive client-server request-response. One needs to be really careful with designing permission systems. Apple has already killed many markets before they had a chance to even exist, such as companion apps for watches and other peripherals.
P2P was practically dead on iPhone even back in 2010. The whole "don't burn the user's battery" thing precludes mobile phones doing anything with P2P other than leeching off of it. The only exceptions are things like AirDrop; i.e. locally peer-to-peer things that are only active when in use and don't try to form an overlay or mesh network that would require the phone to become a router.
And, AFAIK, you already need special permission for anything other than HTTPS to specific domains on the public Internet. That's why apps ping you about permissions to access "local devices".
> On the other hand, good bye p2p.
You mean, good bye using my bandwidth without my permission? That's good. And if I install a bittorrent client on my phone, I'll know to give it permission.
> such as companion apps for watches and other peripherals
That's just apple abusing their market position in phones to push their watch. What does it have to do with p2p?
> using my bandwidth without my permission
What are you talking about?
> What does it have to do with p2p?
It’s an example of when you design sandboxes/firewalls it’s very easy to assume all apps are one big homogenous blob doing rest calls and everything else is malicious or suspicious. You often need strange permissions to do interesting things. Apple gives themselves these perms all the time.
Wait, why should applications be allowed to do rest calls by default?
> What are you talking about?
That’s the main use case for p2p in an application isn’t it? Reducing the vendors bandwidth bill…
Maybe there could be a special entitlement that Apple's reviewers would only grant to applications that have a legitimate reason to require such connections. Then only applications granted that permission would be able to make requests to arbitrary domains / IP addresses.
That's how it works with other permissions most applications should not have access to, like accessing user locations. (And private entitlements third party applications can't have are one way Apple makes sure nobody can compete with their apps, but that's a separate issue.)
Android is so fucking anti-privacy that they still don't have an INTERNET access revoke toggle. The one they have currently is broken and can easily be bypassed with google play services (another highly privileged process running for no reason other than to sell your soul to google). GrapheneOS has this toggle luckily. Whenever you install an app, you can revoke the INTERNET access at the install screen and there is no way that app can bypass it
Asus added this to their phones which is nice.
Do you suggest to outright forbid TCP connections for user software? Because you can compile OpenSSL or any other TLS library and do a TCP connection to port 443 which will be opaque for operating system. They can do wild things like kernel-level DPI for outgoing connections to find out host, but that quickly turns into ridiculous competition.
> but that quickly turns into ridiculous competition.
Except the platform providers hold the trump card. Fuck around, if they figure it out you'll be finding out.
I think capability based security with proxy capabilities is the way to do it, and this would make it possible for the proxy capability to intercept the request and ask permission, or to do whatever else you want it to do (e.g. redirections, log any accesses, automatically allow or disallow based on a file, use or ignore the DNS cache, etc).
The system may have some such functions built in, and asking permission might be a reasonable thing to include by default.
Try actually using a system like this. OpenSnitch and LittleSnitch do it for Linux and MacOS respectively. Fedora has a pretty good interface for SELinux denials.
I've used all of them, and it's a deluge: it is too much information to reasonably react to.
Your broad is either deny or accept but there's no sane way to reliably know what you should do.
This is not and cannot be an individual problem: the easy part is building high fidelity access control, the hard part is making useful policy for it.
I suggested proxy capabilities, that it can easily be reprogrammed and reconfigured; if you want to disable this feature then you can do that too. It is not only allow or deny; other things are also possible (e.g. simulate various error conditions, artificially slow down the connection, go through a proxy server, etc). (This proxy capability system would be useful for stuff other than network connections too.)
> it is too much information to reasonably react to.
Even if it asks, does not necessarily mean it has to ask every time if the user lets it keep the answer (either for the current session for until the user deliberately deletes this data). Also, if it asks too much because it tries to access too many remote servers, then might be spyware, malware, etc anyways, and is worth investigating in case that is what it is.
> the hard part is making useful policy for it.
What the default settings should be is a significant issue. However, changing the policies in individual cases for different uses, is also something that a user might do, since the default settings will not always be suitable.
If whoever manages the package repository, app store, etc is able to check for malware, then this is a good thing to do (although it should not prohibit the user from installing their own software and modifying the existing software), but security on the computer is also helpful, and neither of these is the substitute for the other; they are together.
Vast majority of revenues in the mobile apps ecosystem are ads, which by design pulled from 3rd parties (and are part of the broader problem discussed in this post).
I am waiting for Apple to enable /etc/hosts or something similar on iOS devices.
Oh, that's an interesting idea. A local DNS where I have to add every entry. A white list rather than Australia's national blacklist.
Residential IP proxies have some weaknesses. One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
We are working on an open‑source fraud prevention platform [1], and detecting fake users coming from residential proxies is one of its use cases.
[1] https://www.github.com/tirrenotechnologies/tirreno
The first blog post in this series[1], linked to at the top of TFA, offers an analysis on the potential of using ASNs to detect such traffic. Their conclusion was that ASNs are not helpful for this use-case, showing that across the 50k IPs they've blocked, there is less than 4 IP addresses per ASN, on average.
[1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
What was done manually in the first blog is exactly what tirreno helps to achieve by analyzing traffic, here is live example [1]. Blocking an entire ASN should not be considered a strategy when real users are involved.
Regarding the first post, it's rare to see both datacenter network IPs and mobile proxy IP addresses used simultaneously. This suggests the involvement of more than one botnet. The main idea is to avoid using IP addresses as the sole risk factor. Instead, they should be considered as just one part of the broader picture of user behavior.
[1] https://play.tirreno.com
>One is that they ofter change IP addresses during a single web session. Second, if IP come from the same proxies provider, they are often concentrated within a sing ASN, making them easier to detect.
Both are pretty easy to mitigate with a geoip database and some smart routing. One "residential proxy" vendor even has session tokens so your source IP doesn't randomly jump between each request.
And this is the exact reason why IP addresses cannot be considered as the one and only signal for fraud prevention.
At least here in the US most residential ISPs have long leases and change infrequently, weeks or months.
Trying to understand your product, where is it intended to sit in a network? Is it a standalone tool that you use to identify these IPs and feed into something else for blockage or is it intended to be integrated into your existing site or is it supposed to proxy all your web traffic? The reason I ask is it has fairly heavyweight install requirements and Apache and PHP are kind of old school at this point, especially for new projects and companies. It's not what they would commonly be using for their site.
Indeed, if it's a real user from a residential IP address, in most cases it will be the same network. However, if it's a proxy from residential IPs, there could be 10 requests from one network, the 11th request from a second network, and the 12th request back from the same network. This is a red flag.
Thank you for your question. tirreno is a standalone app that needs to receive API events from your main web application. It can work perfectly with 512GB Postgres RAM or even lower, however, in most cases we're talking about millions of events that request resources.
It's much easier to write a stable application without dependencies based on mature technologies. tirreno is fairly 'boring software'.
My phone will be on the home network until I walk out of the house and then it will change networks. This should not be a red flag.
Effective fraud prevention relies on both the full user context and the behavioral patterns of known online fraudsters. The key idea is that an IP address cannot be used as a red flag on its own without considering the broader context of the account. However, if we know that the fraudsters we're dealing with are using mobile networks proxies and are randomly switching between two mobile operators, that is certainly a strong risk signal.
An awful lot of free Wi-Fi networks you find in malls are operated by different providers. Walking from one side of a mall to the other while my phone connects to all the Wi-Fi networks I’ve used previously would have you flag me as a fraudster if I understand your approach correctly.
We are discussing user behavior in the context of a web system. The fact that your device has connected to different Wi-Fi networks doesn't necessarily mean that all of them were used to access the web application.
Finally, as mentioned earlier, there is no silver bullet that works for every type of online fraudster. For example, in some applications, a TOR connection might be considered a red flag. However, if we are talking about hn visitors, many of them use TOR on a daily basis.
When the enshitification initially hit the fan, I had little flashbacks of Phil Zimmerman talking about Web of Trust and amusing myself thinking maybe we need humans proving they're humans to other humans so we know we aren't arguing with LLMs on the internet or letting them scan our websites.
But it just doesn't scale to internet size so I'm fucked if I know how we should fix it. We all have that cousin or dude in our highschool class who would do anything for a bit of money and introducing his 'friend' Paul who is in fact a bot whose owner paid for the lie. And not like enough money to make it a moral dilemma, just drinking money or enough for a new video game. So once you get past about 10,000 people you're pretty much back where we are right now.
I think it should be possible to build something that generalises the idea of Web of Trust so that it's more flexible, and less prone to catastrophic breakdown past some scaling limit.
Binary "X trusts Y" statements, plus transitive closure, can lead to long trust paths that we probably shouldn't actually trust the endpoints of. Could we not instead assign probabilities like "X trusts Y 95%", multiply probabilities along paths starting from our own identity, and take the max at each vertex? We could then decide whether to finally trust some Z if its percentage is more than some threshold T%. (Other ways of combining in-edges may be more suitable than max(); it's just a simple and conservative choice.)
Perhaps a variant of backprop could be used to automatically update either (a) all or (b) just our own weights, given new information ("V has been discovered to be fraudulent").
True. Perhaps a collective vote past 2 degrees of freedom out where multiple parties need to vouch for the same person before you believe they aren't a bot. Then you're using the exponential number of people to provide diminishing weight instead of increasing likelihood of malfeasance.
But do we need an infinite and global web of trust?
How about restricting them to everyone-knows-everyone sized groups, of like a couple hundred people?
One can be a member of multiple groups so you're not actually limited. But the groups will be small enough to self regulate.
> So there is a (IMHO) shady market out there that gives app developers on iOS, Android, MacOS and Windows money for including a library into their apps that sells users network bandwidth.
This is yet another reason why we need to be wary of popular apps, add-ons, extensions, and so forth changing hands, by legitimate sale or more nefarious methods. Initially innocent utilities can be quickly coopted into being parts of this sort of scheme.
Strange the HolaVPN e.g. Brightdata is not mentioned. They've been using user hosts for those purposes for decades, and also selling proxies en masse. Fun fact they don't have any servers for the VPN. All the VPN traffic is routed through ... other users!
Hola is mentioned in the authors prior post on this topic, linked to at the top of TFA: https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
They are even the first to do it and the most litigious of all. Trying to push patents on everything possible, even on water if they can.
Is it really strange if the logo is right there in the article?
> I am now of the opinion that every form of web-scraping should be considered abusive behaviour and web servers should block all of them. If you think your web-scraping is acceptable behaviour, you can thank these shady companies and the “AI” hype for moving you to the bad corner.
Why jump to that conclusion?
If a scraper clearly advertises itself, follows robots.txt, and has reasonable backoff, it's not abusive. You can easily block such a scraper, but then you're encouraging stealth scrapers because they're still getting your data.
I'd block the scrapers that try to hide and waste compute, but deliberately allow those that don't. And maybe provide a sitemap and API (which besides being easier to scrape, can be faster to handle).
What is the point of app stores holding up releases for review if they don't even catch obvious malware like this?
They pretend to do a review to justify their 30% cartel tax.
Oh no, they review thoroughly, to make sure you don’t try to avoid the tax.
This isn't obvious, 99% of apps make multiple calls to multiple services, and these SDK's are embedded into the app. How can you tell whats legit outbound/inbound? Doing a fingerprint search for the worst culprits might help catch some, but it would likely be a game of cat and mouse.
> How can you tell whats legit outbound/inbound?
If the app isn't a web browser, none are legit?
Their marketing tells you it's for protection. What they fail to omit is it's for their revenue protection - observe that as long as you do not threaten their revenue models, or the revenue models of their partners, you are allowed through. It has never been about the users or developers.
Money
The definition of malware is fuzzy.
I have some success in catching most of them at https://visitorquery.com
I went to your website.
Is the premise that users should not be allowed to use vpns in order to participate in ecommerce?
Nobody said that, it's your choice to take whatever action fits your scenario. I have clients where VPNs are blocked yes, it depends on the industry, fraud rate, chargeback rates etc.
Checked my connection via VPN by Google/Cloudflare WARP: "Proxy/VPN not detected"
Could be, I don't claim 100% success rate. I'll have a look at one of those and see why I missed it. Thank you for letting me know.
measuring latency between different endpoints? I see the webrtc turn relay request..
further reading
https://krebsonsecurity.com/?s=infatica
https://krebsonsecurity.com/tag/residential-proxies/
https://spur.us/blog/
https://bright-sdk.com/ <- way bigger than infatica
I thought the closed-garden app stores were supposed to protect us from this sort of thing?
That's what they want you to think.
Once again this demonstrate that closed gardens only benefit the owners of the garden, and not the users.
What good is all the app vetting and sandbox protection in iOS (dunno about Android) if it doesn't really protect me from those crappy apps...
At the very least, Apple should require conspicuous disclosure of this kind of behavior that isn't just hidden in the TOS.
Sandboxing means you can limit network access. For example, on Android you can disallow wi-fi and cellular access (not sure about bluetooth) on a per-app basis.
Network access settings should really be more granular for apps that have a legitimate need.
App store disclosure labels should also add network usage disclosure.
Also my reaction when the call is for Google, Apple, Microsoft to fix this : DDOS being illegal, shouldn't the first reaction instead to be to contact law enforcement ?
If you treat platforms like they are all-powerful, then that's what they are likely to become...
If you find yourself in a walled garden, understand that you're the crop being grown and harvested.
Let me get this straight: we want computers knowing everything, to solve current and future problems, but we don't want to give them access to our knowledge?
I don't want computers to know everything. Most knowledge on the internet is false and entirely useless.
The companies selling us computers that supposedly know everything should pay for their database, or they should give away the knowledge they gained for free. Right now, the scraping and copying is free and the knowledge is behind a subscription to access a proprietary model that forms the basis of their business.
Humanity doesn't benefit, the snake oil salesmen do.
> Let me get this straight: we want computers knowing everything, to solve current and future problems, but we don't want to give them access to our knowledge?
Who said that?
There's basically two extremes:
1. We want access to all of human knowledge, now and forever, in order to monetise it and make more money for us, and us alone.
and
2. We don't want our freely available knowledge sold back to us, with no credits to the original authors.
Most people don’t want computers to know everything - ask the average person if they want more or less of their lives recorded and stored.
I don't want your computer to know everything about me, in fact.
Not sure we do.
How would I know if an app on my device was doing this?
Install a network monitor or go even deeper and sniff packets.
I feel like this could be automated. Spin up a virtual device on a monitored network. Install one app, click on some stuff for awhile, uninstall and move onto the next. If the app reaches out to a lot of random sites then flag it
Google could do this. I'm sure Apple could as well. Third parties could for a small set of apps
This is being done by a couple of SDKs, it'd be much easier to just find and flag those SDK files. Finding apps becomes a matter of a single pass scan over the application contents rather than attempting to bypass the VM detection methods malware is packed full of.
I think tech can still be beautiful in a less grandiose and "omniparadisical" way than people used to dream of. "A wide open internet, free as in speech this, free as in beer that, open source wonders, open gardens..." Well, there are a lot of incentives that fight that, and game theory wins. Maybe we download software dependencies from our friends, the ones we actually trust. Maybe we write more code ourselves--more homesteading families that raise their own chickens, jar their own pickled carrots, and code their own networking utilities. Maybe we operate on servers we own, or our friends own, and we don't get blindsided by news that the platforms are selling our data and scraping it for training.
Maybe it's less convenient and more expensive and onerous. Do good things require hard work? Or did we expect everyone to ignore incentives forever while the trillion-dollar hyperscalers fought for an open and noble internet and then wrapped it in affordable consumer products to our delight?
It reminds me of the post here a few weeks ago about how Netflix used to be good and "maybe I want a faster horse" - we want things to be built for us, easily, cheaply, conveniently, by companies, and we want those companies not to succumb to enshittification - but somehow when the companies just follow the game theory and turn everything into a TikToky neural-networks-maximizing-engagement-infinite-scroll-experience, it's their fault, and not ours for going with the easy path while hoping the corporations would not take the easy path.
it's funny, i've never heard of or thought about the possibility of this happening but actually in hindsight it seems almost too obvious to not be a thing.
This is nasty in other ways too. What happens when someone uses these B2P residential proxies to commit crimes that get traced back to you?
Anything incorporating anything like this is malware.
Many years ago cybercriminals used to hack computers to use them as residential proxies, now they purchase them online as a service.
In most cases they are used for conducting real financial crimes, but the police investigators are also aware that there is a very low chance that sophisticated fraud is committed directly from a residential IP address.
Couldn't Apple and Google (and, to a lesser extent, Microsoft) pretty easily shut down almost all the apps that steal bandwidth?
How can I detect such behaviour on my devices / in my home network?
I'd expect this to be against app store and google play rules, they are very picky.
Are ad blockers like AdBlock, uBlock effective against these?
i don't believe extensions can modify other extensions
Its a fair point but very dynamic to sort out. This needs a full research team to figure out. Or you know.. all of us combined!! It is definitely a problem.
TINFOIL: Sometimes I always wondered if Azure or AWS used bots to push site traffic hits to generate money... they know you are hosted with them.. They have your info.. Send out bots to drive micro accumulation. Slow boil..
I think that's mostly that they don't care about having malicious bots on their networks as long as they pay.
GCE is rare in my experience. Most bots I see are on AWS. The DDOS-adjacent hyper aggressive bots that try random URLs and scan for exploits tend to be on Azure or use VPNs.
AWS is bad when you report malicious traffic. Azure has been completely unresponsive and didn't react, even for C&C servers.
do you think there’s a realistic path forward for better transparency or detection—maybe at the OS level or through network-level anomaly detection?
I’m really struggling to understand how this is different than malware we’ve had forever. Can someone explain what’s novel about this?
That its not being treated like malware.
In the sense that people are voluntarily installing and running this malware on their computers, rather than being tricked into running it? Is that the only difference?
They are still tricked into running it, since it's normally not an advertised "feature" of any app that uses such SDKs.
I think it is funny that the mobile OS is trying to be as secure as possible, but then they allow this to run on top
"Peer-to-business network"! Amazing. uBlock Origin gets rid of this, right?
>Apple, Microsoft and Google should act.
Do nothing, win.
They are the primary benefactors buying this data since they are the largest AI players.
How is this not just illegal? Surely there’s something in GDPR that makes this not allowed.
iiuc, they do actually ask the user for permission
Which is ironic considering that I strongly disagree with one of the primary walled garden justifications, used particularly in the case of Apple, which amounts to "the end user is too stupid to decide on his own". Unfortunately, even if I disagree with it as a guiding principle sometimes that statement proves true.
It’s not about stupidity, but practicality. People can’t give informed consent for 100 ToS for different companies, and keep those up to date. That’s why there are laws.
No doubt in a dense wall of text that the user must accept to use the application, or worse is deemed to have accepted by using the application at all.
when the shit hits the fan, this seems like the product.
> So if you as an app developer include such a 3rd party SDK in your app to make some money — you are part of the problem and I think you should be held responsible for delivering malware to your users, making them botnet members.
I suspect that this goes for many different SDKs. Personally, I am really, really sick of hearing "That's a solved problem!", whenever I mention that I tend to "roll my own," as opposed to including some dependency, recommended by some jargon-addled dependency addict.
Bad actors love the dependency addiction of modern developers, and have learned to set some pretty clever traps.
I’m constantly amazed at how careless developers are with pulling 3rd party libraries into their code. Have you audited this code? Do you know everything it does? Do you know what security vulnerabilities exist in it? On what basis do you trust it to do what it says it is doing and nothing else?
But nobody seems to do this diligence. It’s just “we are in a rush. we need X. dependency does X. let’s use X.” and that’s it!
> Have you audited this code?
Wrong question. “Are you paid to audit this code?” And “if you fail to audit this code, who’se problem is it?”
I think developers are paid to competently deliver software to their employer, and part of that competence is properly vetting the code you are delivering. If I wrote code that ended up having serious bugs like crashing, I’d expect to have at least a minimum consequence, like root causing it and/or writing a postmortem to help avoid it in the future. Same as I’d expect if I pulled in a bad dependency.
Your expectations do not match the employment market as I have ever experienced it.
Have you ever worked anywhere that said "go ahead and slow down on delivering product features that drive business value so you can audit the code of your dependencies, that's fine, we'll wait"?
I haven't.
Yea, and that’s the problem. If such absolute rock bottom minimal expectations (know what the code does) are seen as too slow and onerous, the industry is cooked!
Yeah, about that, businesses are pushing and introducing code written by AI/LLM now, so now you won't even know what your own code does.
Due diligence is a sliding scale. Work at a webdev agency is "get it done as fast as possible for this MVP we need". Work at NASA or a biomedical device company? Every line of code is triple-checked. It's entirely dependent on the cost/benefit analysis.
"who'se" is wild.
If a car manufacturer sources a part from a third party, and that part has a serious safety problem, who will the customer blame? And who will be responsible for the recall and the repairs?
Malware, botnets… it is very similar. And people including developers are - in 80 per cent - eagier to make money, because… Is greed good? No, it isn’t. It is a plague.
You're a developer who devoted time to develop a piece of software. You discover that you are not generating any income from it: few people can even find it in the sea of similar apps, few of those are willing to pay for it, and those who are willing to pay for it are not willing to pay much. To make matters worse, you're going to lose a cut of what is paid to the middlemen who facilitate the transaction.
Is that greed?
I can find many reasons to be critical of that developer, things like creating a product for a market segment that is saturated, and likely doing so because it is low hanging fruit (both conceptually and in terms of complexity). I can be critical of their moral judgement for how they decided to generate income from their poor business judgment. But I don't thinks it's right to automatically label them as greedy. They may be greedy, but they may also be trying to generate income from their work.
> Is that greed?
Umm, yes? You are not owed anything in this life, certainly not income for your choice to spend your time on building a software product no one asked for. Not making money on it is a perfectly fine outcome. If you desperately need guaranteed money, don't build an app expecting it to sell; get a job.
> If you desperately need guaranteed money, don't build an app expecting it to sell; get a job.
Technically true but a bit of perspective might help. The consumer market is distorted by free (as in beer) apps that does a bunch of shitty things that should in many cases be illegal or require much more informed consent than today, like tracking everything they can. Then you have VC funded ”free” as well, where the end game is to raise prices slowly to boil the frog. Then you have loss leaders from megacorps, and a general anti-competitive business culture.
Plus, this is not just in the Wild West shady places, like the old piratebay ads. The top result for ”timer” on the App Store (for me) is indeed a timer app, but with IAP of $800/y subscription… facilitated by Apple Inc, who gets 15-30% of the bounty.
Look, the point is it’s almost impossible to break into consumer markets because everyone else is a predator. It’s a race to the bottom, ripping off clueless customers. Everyone would benefit from a fairer market. Especially honest developers.
>$800/year IAP
That’s got to be money laundering or something else illicit? No one is actually paying that for a timer app?
No I think it’s designed to catch misclicks and children operating the phone and such, sold as $17/week possibly masquerading as one-time payment. They pay for App Store ads for it too.
I prefer to focus on the technical shortcomings.
We could have people ask for software in a more convenient way.
Not making money could be an indication the software isn't useful, but what if it is? What can the collective do in that zone?
I imagine one could ask and pay for unwritten software then get a refund if it doesn't materialize before your deadline.
Why is discovery (of many creation) willingly handed over to a hand full of mega corps?? They seem to think I want to watch and read about Trump and Elon every day.
Promoting something because it is good is a great example of a good thing that shouldn't pay.
There was an earlier discussion on HN about whether advertising should be more heavily regulated (or even banned outright). I'm starting to wonder whether most of the problems on the Web are negative side effects of the incentives created by ads (including all botnets, except those that enable ransomeware and espionage). Even the current worldwide dopamine addition is driven by apps and content created for engagement, whose entire purpose is ad revenue.
These are kind of separate issues. Apps using Infatica know that they're selling access to their users' bandwidth. It's intentional.
This is especially true for script kiddies, which is why I am so thankful for https://e18e.dev/
AI is making this worse than ever though, I am constantly having to tell devs that their work is failing to meet requirements, because AI is just as bad as a junior dev when it comes to reaching for a dependency. It’s like we need training wheels for the prompts juniors are allowed to write.
I agree that there are things with too many dependencies and I try to avoid that. I think it is a good idea to minimize how many dependencies are needed (even indirect dependencies; however, in some cases a dependency is not a specific implementation, and in that case indirect dependencies are less of a problem, although having a good implementation with less indirect dependencies is still beneficial). I may write my own, in many cases. However, another reason for writing my own is because of other kind of problems in the existing programs. Not all problems are malicious; many are just that they do not do what I need, or do too much more than what I need, or both. (However, most of my stuff is C rather than JavaScript; the problem seems to be more severe with JavaScript, but I do not use that much.)
That may be true but I think you're missing the point here.
The "network sharing" behavior in these SDKs is the sole purpose of the SDK. It isn't being included as a surprise along with some other desirable behavior. What needs to stop is developers including these SDKs as a secondary revenue source in free or ad-supported apps.
> I think you're missing the point here
Doubt it. This is just one -of many- carrots that are used to entice developers to include dodgy software into their apps.
The problem is a lot bigger than these libraries. It's an endemic cultural issue. Much more difficult to quantify or fix.
"Bad actors love the dependency addiction of modern developers"
Brings a new meaning to dependency injection.
I mean, as far as patterns go, dependency injection is also quite bad.
I have found that the dependency injection pattern makes it far easier to write clean tests for my code.
Elaborate on this please. It seems a great boon in having pushed the OO world towards more functional principles, but I'm willing to hear dissent.
How is dependency injection more functional?
My personal beef is that most of the time it acts like hidden global dependencies, and the configuration of those dependencies, along with their lifetimes, becomes harder to understand by not being traceable in the source code.
Dependency injection is just passing your dependencies in as constructor arguments rather than as hidden dependencies that the class itself creates and manages.
It's equivalent to partial application.
An uninstantiated class that follows the dependency injection pattern is equivalent to a family of functions with N+Mk arguments, where Mk is the number of parameters in method k.
Upon instantiation by passing constructor arguments, you've created a family of functions each with a distinct sets of Mk parameters, and N arguments in common.
> Dependency injection is just passing your dependencies in as constructor arguments rather than as hidden dependencies that the class itself creates and manages.
That's the best way to think of it fundamentally. But the main implication of that which is at some point something has to know how to resolve those dependencies - i.e. they can't just be constructed and then injected from magic land. So global cradles/resolvers/containers/injectors/providers (depending on your language and framework) are also typically part and parcel of DI, and that can have some big implications on the structure of your code that some people don't like. Also you can inject functions and methods not just constructors.
I don't understand what you're describing has to do with dependency injection. See https://news.ycombinator.com/item?id=43740196.
> Dependency injection is just passing your dependencies in as constructor arguments rather than as hidden dependencies that the class itself creates and manages.
This is all well and good, but you also need a bunch of code that handles resolving those dependencies, which oftentimes ends up being complex and hard to debug and will also cause runtime errors instead of compile time errors, which I find to be more or less unacceptable.
Edit: to elaborate on this, I’ve seen DI frameworks not be used in “enterprise” projects a grand total of zero times. I’ve done DI directly in personal projects and it was fine, but in most cases you don’t get to make that choice.
Just last week, when working on a Java project that’s been around for a decade or so, there were issues after migrating it from Spring to Spring Boot - when compiled through the IDE and with the configuration to allow lazy dependency resolution it would work (too many circular dependencies to change the code instead), but when built within a container by Maven that same exact code and configuration would no longer work and injection would fail.
I’m hoping it’s not one of those weird JDK platform bugs but rather an issue with how the codebase is compiled during the container image build, but the issue is mind boggling. More fun, if you take the .jar that’s built in the IDE and put it in the container, then everything works, otherwise it doesn’t. No compilation warnings, most of the startup is fine, but if you build it in the container, you get a DI runtime error about no lazy resolution being enabled even if you hardcode the setting to be on in Java code: https://docs.spring.io/spring-boot/api/kotlin/spring-boot-pr...
I’ve also seen similar issues before containers, where locally it would run on Jetty and use Tomcat on server environments, leading to everything compiling and working locally but throwing injection errors on the server.
What’s more, it’s not like you can (easily) put a breakpoint on whatever is trying to inject the dependencies - after years of Java and Spring I grow more and more convinced that anything that doesn’t generate code that you can inspect directly (e.g. how you can look at a generated MapStruct mapper implementation) is somewhat user hostile and will complicate things. At least modern Spring Boot is good in that more of the configuration is just code, because otherwise good luck debugging why some XML configuration is acting weird.
In other words, DI can make things more messy due to a bunch of technical factors around how it’s implemented (also good luck reading those stack traces), albeit even in the case of Java something like Dagger feels more sane https://dagger.dev/ despite never really catching on.
Of course, one could say that circular dependencies or configuration issues are project specific, but given enough time and projects you will almost inevitably get those sorts of headaches. So while the theory of DI is nice, you can’t just have the theory without practice.
Inclined to agree. Consider that a singleton dependency is essentially a global, and differs from a traditional global, only in that the reference is kept in a container and supplied magically via a constructor variable. Also consider that constructor calls are now outside the application layer frames of the callstack, in case you want to trace execution.
Dependency injection is not hidden. It's quite the opposite: dependency injection lists explicitly all the dependencies in a well defined place.
Hidden dependencies are: untyped context variable; global "service registry", etc. Those are hidden, the only way to find out which dependencies given module has is to carefully read its code and code of all called functions.
Because you’re passing functions to call.
??? What functions?
To me it‘s rather anti-functional. Normally, when you instantiate a class, the resulting object’s behavior only depends on the constructor arguments you pass it (= the behavior is purely a function of the arguments). With dependency injection, the object’s behavior may depend on some hidden configuration, and not even inspecting the class’ source code will be able to tell you the source of that bevavior, because there’s only an @Inject annotation without any further information.
Conversely, when you modify the configuration of which implementation gets injected for which interface type, you potentially modify the behavior of many places in the code (including, potentially, the behavior of dependencies your project may have), without having passed that code any arguments to that effect. A function executing that code suddenly behaves differently, without any indication of that difference at the call site, or traceable from the call site. That’s the opposite of the functional paradigm.
> because there’s only an @Inject annotation without any further information
It sounds like you have a gripe with a particular DI framework and not the idea of Dependency Injection. Because
> Normally, when you instantiate a class, the resulting object’s behavior only depends on the constructor arguments you pass it (= the behavior is purely a function of the arguments)
With Dependency Injection this is generally still true, even more so than normal because you're making the constructor's dependencies explicit in the arguments. If you have a class CriticalErrorLogger(), you can't directly tell where it logs to, is it using a flat file or stdout or a network logger? If you instead have a class CriticalErrorLogger(logger *io.writer), then when you create it you know exactly what it's using to log because you had to instantiate it and pass it in.
Or like Kortilla said, instead of passing in a class or struct you can pass in a function, so using the same example, something like CriticalErrorLogger(fn write)
I don't quite understand your example, but I don't think the particulars make much of a difference. We can go with the most general description: With dependency injection, you define points in your code where dependencies are injected. The injection point is usually a variable (this includes the case of constructor parameters), whose value (the dependency) will be set by the dependency injection framework. The behavior of the code that reads the variable and hence the injected value will then depend on the specific value that was injected.
My issue with that is this: From the point of view of the code accessing the injected value (and from the point of view of that code's callers), the value appears like out of thin air. There is no way to trace back from that code where the value came from. Similarly, when defining which value will be injected, it can be difficult to trace all the places where it will be injected.
In addition, there are often lifetime issues involved, when the injected value is itself a stateful object, or may indirectly depend on mutable, cached, or lazy-initialized, possibly external state. The time when the value's internal state is initialized or modified, or whether or not it is shared between separate injection points, is something that can't be deduced from the source code containing the injection points, but is often relevant for behavior, error handling, and general reasoning about the code.
All of this makes it more difficult to reason about the injected values, and about the code whose behavior will depend on those values, from looking at the source code.
> whose value (the dependency) will be set by the dependency injection framework
I agree with your definition except for this part, you don't need any framework to do dependency injection. It's simply the idea that instead of having an abstract base class CriticalErrorLogger, with the concrete implementations of StdOutCriticalErrorLogger, FileCriticalErrorLogger, AwsCloudwatchCriticalErrorLogger which bake their dependency into the class design; you instead have a concrete class CriticalErrorLogger(dep *dependency) and create dependency objects externally that implement identical interfaces in different ways. You do text formatting, generating a traceback, etc, and then call dep.write(myFormattedLogString), and the dependency handles whatever that means.
I agree with you that most DI frameworks are too clever and hide too much, and some forms of DI like setter injection and reflection based injection are instant spaghetti code generators. But things like Constructor Injection or Method Injection are so simple they often feel obvious and not like Dependency Injection even though they are. I love DI, but I hate DI frameworks; I've never seen a benefit except for retrofitting legacy code with DI.
And yeah it does add the issue or lifetime management. That's an easy place to F things up in your code using DI and requires careful thought in some circumstances. I can't argue against that.
But DI doesn't need frameworks or magic methods or attributes to work. And there's a lot of situations where DI reduces code duplication, makes refactoring and testing easier, and actually makes code feel less magical than using internal dependencies.
The basic principle is much simpler than most DI frameworks make it seem. Instead of initializing a dependency internally, receive the dependency in some way. It can be through overly abstracted layers or magic methods, but it can also be as simple as adding an argument to the constructor or a given method that takes a reference to the dependency and uses that.
edit: made some examples less ambiguous
The pattern you are describing is what I know as the Strategy pattern [0]. See the example there with the Car class that takes a BrakeBehavior as a constructor parameter [1]. I have no issue with that and use it regularly. The Strategy pattern precedes the notion of dependency injection by around ten years.
The term Dependency Injection was coined by Martin Fowler with this article: https://martinfowler.com/articles/injection.html. See how it presents the examples in terms of wiring up components from a configuration, and how it concludes with stressing the importance of "the principle of separating service configuration from the use of services within an application". The article also presents constructor injection as only one of several forms of dependency injection.
That is how everyone understood dependency injection when it became popular 10-20 years ago: A way to customize behavior at the top application/deployment level by configuration, without having to pass arguments around throughout half the code base to the final object that uses them.
Apparently there has been a divergence of how the term is being understood.
[0] https://en.wikipedia.org/wiki/Strategy_pattern
[1] The fact that Car is abstract in the example is immaterial to the pattern, and a bit unfortunate in the Wikipedia article, from a didactic point of view.
They're not really exclusive ideas. The Constructor Injection section in Fowler's article is exactly the same as the Strategy pattern. But no one talks about the Strategy pattern anymore, it's all wrapped into the idea of DI and that's what caught on.
I'm curious, which language/dev communities did you pick this up from? Because I don't think it's universal, certainly not in the Java world.
DI in Java is almost completely disconnected from what the Strategy pattern is, so it doesn't make sense to use one to refer to the other there.
It was interesting reading this exchange. I have a similar understanding of DI to you. I have never even heard of a DI framework and I have trouble picturing what it would look like. It was interesting to watch you two converge on where the disconnect was.
How is the configuration hidden? Presumably you configured the DI container.
It starts off feeling like a superpower allowing to to change a system's behaviour without changing its code directly. It quickly devolves into a maintenance nightmare though every time I've encountered it.
I'm talking more specifically about Aspect Oriented Programming though and DI containers in OOP, which seemed pretty clever in theory, but have a lot of issues in reality.
I take no issues with currying in functional programming.
[flagged]
AI scrapers and "sneaker bots" are just the tip of the iceberg. Why are all these entities concentrated and metastasizing from just a few superhubs? Why do they look, smell and behave like state-level machinery? If you've researched you'll know exactly what I'm talking about.
Unless complicit, tech leaders (Apple Google Microsoft) have a duty to respond swiftly and decisively. This has been going on far too long.
"Infatica is partnered with Bitdefender, a global leader in cybersecurity, to protect our SDK users from malicious web traffic and content, including infected URLs, untrusted web pages, fraudulent and phishing links, and more."
That's not good.
[flagged]
@dang can this entire account be banned please
the next time you copy paste an ad remove the trailing quotation mark