Jason A. Bloomer - Professional IT Freelancer

AI Snowballing: A new threat to search engines

Published: May 22, 2025

Procedural Generation, Applied to the Web

If you are at all familiar with modern PC games or their development, you have likely heard of Procedural Generation. Many games use it to "create" new content, albeit algorithmically, such as in the case of the Diablo or Path of Exile games - whose claims to fame are their "random" procedurally generated maps. Generally speaking, this means the map and your environment, is different every time you play. As a fun experiment, what if we applied this concept to web development? What would that look like, and what could you do with it? What would search engines do with it?

I had this thought a few years ago - but I recently decided to revisit the idea. It actually stems from a lesser-known network security technique called Tarpitting. The implementations I saw that sparked this, use markov text chains to generate useless text on 404 pages, to "poison" scrapers for LLM training data sets that scrape without permission. But that got me thinking, what if we used an LLM to generate the text? As opposed to creating a "poisoned well" - we could almost do the opposite, which is to create what appears to be a "gold mine" to scrapers and indexing bots. Fundamentally the tech stack is exactly the same, and we can use their code as a jumping-off point to get started.

The efficacy and ethicality of this, is very much up for debate. I do not specifically condone this particular use of LLMs, it merely makes for a fun experiment which allows one to gain a much deeper insight into how search engines like Google actually do their indexing and calculate page rankings, while also calling attention to a flaw with potential to be exploited nefariously. Search drives a lot of revenue in 2025, and Google has historically battled against "SEO hacks" and "black-hat" techniques for the entirety of their existence. This is one potential among the newest coming wave of such methods, and it's only possible with the use of advanced LLMs.

DISCLAIMER: Utilizing any of the following techniques or methods on a live website, could result in that website being entirely delisted and/or blacklisted from ALL search engines. Moreover, it is generally accepted that this is not how LLMs should be used. This article exists for research and educational purposes only.

Putting It Together

I will briefly cover how the whole system works, because it may be interesting to someone. If you don't want technical details, skip to the next section.

One might think this a difficult task, but the reality is quite the opposite. Most of the functional code is contained in a single PHP file. I added a homepage and a search page after the fact, but they aren't truly necessary, and really only exist to make indexing the generated content a bit easier for search engines. To get started, you only need a machine running Apache with PHP, and an OpenAI or xAI API key. The system overall is so simple, someone with zero programming or webdev knowledge could probably maintain it with ease.

The bulk of the work, is done by

generate.php

which only has two basic tasks:

Get the requested URL, and see if we have already generated and cached a page for it. If so, fetch and display it.
If not, we send a request to generate content for the page via an LLM API, then cache and display the output.

First we need to ensure that all URL requests, including ones that do not actually exist, are sent to

generate.php

, while still responding to each client with a "200 OK" code.
To do this, we need to modify our .htaccess rules as follows:

							RewriteEngine On

							RewriteCond %{REQUEST_FILENAME} !-f

							RewriteCond %{REQUEST_FILENAME} !-d

							RewriteRule ^(.*)$ generate.php [L]

And that's... basically it. If we now navigate to a non-existent URL on our server, like:

https://test.com/This-Is-A-Test-URL/

, the generation file should recieve the request.
We can now use a carefully crafted set of instructions in our prompt, to ask Grok or ChatGPT to write some content pertaining to whatever the topic of the page is. We can even ask it to pre-format it into HTML, so that we can cache the response to disk and serve it statically to reduce token usage.

The Snowball Effect

One of the cool aspects of this project, is that we are essentially exploiting the behavior of these indexing bots, using it to our advantage.

Any time a bot finds a link on our site, it will eventually navigate to its destination. If the bot navigates to a "page" that doesn't yet exist, our site generates the content that should be there. We can just tell the LLM to ensure that each page contains a minimum of some number of links to non-existent pages - thus furthering the entrenchment of the indexing bot. It creates an infinite loop of sorts.

This starts what I would call a "snowball effect" - a single small 20kb script which can generate its own content, and turn into a massive text repo, all on its own. Thus I have dubbed the concept as "AI Snowballing" - as this will also refer to other similar implementations of LLM technology which I am sure we will see eventually.

Deriving A Sense Of Purpose

We now have a procedurally generated website, which (at least as long as you have API credits) has content for any, and ALL keyword combinations that could ever be typed into a search box. Moreover, the content itself isn't just random garbage like most tarpit scripts - it has actually been crafted intelligently, displays somewhat accurate facts, and cites "sources" although often times they are just 404's. The most obvious use-case for this tech was a sort of Wikipedia-like site, where pages aren't "edited", but rather are regenerated by LLMs periodically, and served statically.

And thus, Grok-Pedia.com was born - The first entirely AI-managed wiki.

The main unknowns and areas of intrigue with this project, obviously lie outside the project itself: How will search engines like Google and Bing, react to such a site? How will they rank it? Based on what we already know, we can make a couple of educated guesses:

Two of the primary metrics they use to determine ranking, are Click-Through-Ratios for given keywords, as well as in-links (that is, other sites linking to your site), none of which are things we are artificially creating here. We don't have any verifiable populairty or an established presence in any one given keyword-space, which will be one of the biggest hurdles for sites like this to attempt to saturate search engines. This is a good thing, but it's not all roses and sunshine.

What we lack in terms of an established presence in Google's eyes, we could partially make up for with the sheer breadth and scope of content we are capable of serving. With the overwhelming simplicity of this site's inner workings, it is entirely possible that someone intent on disruption could use a single low-powered host to create dozens, if not hundreds of instances of this site. If they all linked to each other, and you had some additional external sites you could put links on, you could likely "fudge" that aspect of search, too - and it's not clear to me how Google could ever detect it.

I have no desire for anarchic disruption, so obviously I have forgone this as a final step. Someone, though, will eventually try it.

Findings and Results

In its first day, the site generated roughly 140 pages from interacting with bots. On the second day, this figure jumped to over 1500.
Midway into the third day, the site had burned through and used up all $150 worth of my free xAI API credits, totaling over 16,000 pages. I shut off page-generation around 2PM, so this figure likely could have been nearly double, had it been allowed to finish out the day.
The rate of page-growth, based on the extremely limited sample size, seems to be just over 10x (1000%) every 24 hours.

Unfortunately the info we're truly after comes from Google, and I would need to wait a bit for the metrics to be generated. Due to how fast these pages were created, it was absolutely necessary to programatically create a sitemap to speed up the indexing process. Once submitted, I would still need to wait a few weeks or months, to see how the site performs. Google won't index all the pages at once either - rather, they seem to process an increasing amount of pages each day. This indexing stage is the most likely place where Google and others might catch this sort of scheme, but we seem to be in the clear.

I have heard rumor that Google uses LLMs to try to "detect" generated content like this - If they do, in this case it doesn't seem to be working. Without correspondence with these companies, I can only speculate on the mechanics. Even so, one would still need to wonder what could happen if an actor was using a more advanced LLM than they could detect.

Initially, the site only got hits for very odd keyword combos that didnt have many other existing search results. After a few days, I noticed that I was picking up some higher-level keywords for more obscure events, people, and places. About two weeks in, we got our first real spike of impressions, jumping from a mere 7, to 185. Around the same time, I noticed we were picking up keywords for much more important subjects like "Microsoft Corporation" and "Facebook Network". Our CTR is still very low, though the website doesnt have any content that would be valuable to a human, so it probably always will be (and should be). This will probably have even more importance to search engines in future than it already does.

This experiment is still on-going, so more research will be done into this type of "attack" and the details will be posted later on.

Theoretical Outcomes

This project exemplifies the fragility of the internet in its current state - I shudder to think of what would happen if I were to post this code publicly, but me doing so isn't even necessary. Someone with 15 minutes of spare time and a moderately low amount of brain-power could easily come up with a similar system. If that person were to spin up 500+ instances of the site, supplied ample API credits, the internet could be effectively flooded overnight. This completely ignores the potential for these sites to spread browser-borne malware or other nefarious code, which I will adress later in a "Part 2" to this article.

What effect this would have, is difficult to say. The most obvious effect would be heavily decreased use of search in daily life, with most individuals likely opting to navigate directly to sites they trust or know from prior experience. This could also cause users to transition from using search engines, to instead use LLMs like ChatGPT or Grok to find content on the web abroad. It may also affect domain pricing, or the registration process. Registrars may start to ask questions about what type of site you will be hosting, hosts might do the same.

Regardless of how you look at it - this is a small piece of the next-generation of web-related SEO attacks at-large, and it is a problem that will only become more difficult to deal with as time goes on.