Home About Now Archive RSS

LLMs ruin text based browsing

I am an Emacs user. I have recently started to do more of my browsing in Emacs' built-in EWW browser. It is a text-based browser that also shows images, but it does not run JavaScript. It's not for running webapps, but it is a great browser for web pages. It has a nice reader mode to focus in on the content of the page and get rid of clutter. It also integrates well with my RSS reader (elfeed) and makes it easy to open links I find in other documents within my text editor.

Large Languague Models ("AI") are making the lives of people hosting websites harder by scraping their content which add cost to the owners of the sites for data transfer or spinning up extra infrastructure to not go down with the extra load. The scrapers tend not to respect robots.txt files that tell them not to scrape the sites. The legality of this scraping under current copyright law is questionable. The LLMs' owners are especially interested in software source code since this is an area where companies and users are willing to pay serious money to use LLMs. Free software git forges like codeberg.org and sourcehut.org are scraped so hard that it almost constitutes DDOSing. One of the companies doing this is Microsoft, their market-leading, near monopoly competitor on git forges and owner of proprietary git forge GitHub, to improve their CoPilot LLM.

To combat LLM content scraping, many websites use a program called Anubis that runs some JavaScript in the browser of the user to check whether it is in fact a human with a browser and not an LLM scraper that has changed the user-agent string that identifies which browser is being used. This program demands that JavaScript is run before serving the content. This means that when I want to go to a site that uses Anubis with Eww, I get a message that I have to turn on Javascript, which I cannot do since I use a text based browser. So in effect, LLMs ruin text based browsing.

© Einar Mostad 2010 - 2025. Content is licensed under the terms of CC BY SA except code which is GNU GPL v3 or later.
Made with GNU Emacs, Org-Static-Blog, and Codeberg Pages on GNU/Linux.