Natural Patterns Pt. 2
I’ve decided to write a relatively brief follow-up to last week’s Natural Patterns in Cookie Stuffing Schema. Today I’d like to discuss something I’m going to call throttled click-through ratio patterns. As far as I know, this concept is not being discussed in the open and procedures to combat their obvious footprint have not been implemented into any cookie stuffing applications that are currently on the market.
In their most basic form, throttled CTR patterns are the tell-tale signs that a cookie stuffing application is controlling the click-through ratio by decreasing the number of users who are cookie stuffed while maintaining the number of impression pixels that are shown. The expressed intention of this method is to avoid detection by attempting to model the incoming traffic to affiliate networks like the eBay Partner Network and Amazon Associates so that it appears to be entirely legitimate traffic.
While CTR throttling is certainly a bright idea and one that has only been implemented into a small number of cookie stuffing applications, it has one major pitfall; in almost all cases the click traffic it creates is very easy to detect. To discuss this in greater detail, it is helpful for one to understand a little bit of math. For example, if we throttle the CTR to 10% that means that for every 100 visitors to the website, 10 will have their cookies stuffed. This, in and of itself, is fine. However, it is the linear order in which these cookies are stuffed that can become problematic. If exactly one visitor is stuffed and nine receive an impression pixel, it becomes very obvious, very quickly that there is something awry, especially if that pattern repeats itself. It becomes even more obvious as the number of hits increases.
As a quick example, if you receive 20 visitors and if your CTR is throttled at 10% without any kind of footprint randomization, this pattern emerges:
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Click
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Impression pixel
- Click
Obviously, this type of emergent pattern would be pretty easy to discover. Even if the clicks don’t fall on every multiple of 10, a pattern could still emerge. In my personal and humble opinion, the best way to create CTR throttling without being easily discovered, would be to randomize the 10% out over both time and a larger number of visitors. So, even at the end of the day, the same number of impression pixels are shown and the same number of clicks are made, but the order in which the occurred would not be as noticeable.
As always, if anyone would like to discuss in more detail methods for avoiding click-through ratio pattern detection, feel free to comment.
Trackbacks & Pingbacks
- Pingback by blackhatzen :: Natural Patterns in Cookie Stuffing, Pt. 3 on December 10, 2008 @ 3:19 pm
- Pingback by blackhatzen :: Natural Patterns in Cookie Stuffing Schema on December 10, 2008 @ 3:21 pm
After dissent was expressed on the BHW forum, I decided to go into a little more depth of the ideas of randomizing CTR in an attempt to make it undetectable. Here is what I wrote:
“It would depend on the scale and variance of the parameters, but yes, this would be the general idea. The problem comes when you keep in mind the relatively large amount of traffic data they are going to have collected on any site that creates a reasonable amount of money. I’m not sure how familiar many of you are with probability theory, but one of the most basic concepts is that patterns quickly emerge out of systems with finite limits. That is to say, that even “random distributions” aren’t truly random at all.
There are algorithms at work to discern the two. The most common method to do this kind of analysis is something called a Markov chain modeling. In very simple terms, it essentially takes all previous data, creates a current state, and extrapolates out a series of models of potential outcomes based on the current state. If the new data that comes in correlates with one of the deterministic models, the algorithm can ascertain with some degree of accuracy whether or not the traffic that you’re providing them is in line with the traffic data they’ve gathered from all of their other affiliates.
I suppose whether or not any cookie stuffing program currently on the market can defeat this is still up in the air, though my experience with Markov chain models and genetic algorithms says that it’s only a matter of time before they are discovered, as each would have a very distinctive footprint if they weren’t designed by people who have a relatively deep understanding of statistical analysis. I’m not talking about programs that will work undetected for 3 or 6 months, I’m talking about changing the game entirely by integrating the same methods the mathematicians and scientists use to randomize their statistics and environments.
If CTR throttling were to be implemented in any program with the distinct intention of staying undetected, the first thing that I would try to code to begin to attempt to avoid detection would involve integrating a form of a mathematical concept called the drunkard’s walk. The name itself is a great way to describe what it is. Let’s say we are in a finite system like one section of 100 city blocks. A drunk walks out of a bar and onto the street looking for home. At very corner, he has the option to go in one of four directions. Because he is so drunk, his decisions are not based on any sort of logic and he walks what is essentially “randomly.” The question that is often asked is, does he ever get home? Because the size of the system is finite, he does find home, and not through a process of elimination, but rather just by meandering. If we consider home to be our established CTR and the drunkard to be the CTR second-to-second, minute-to-minute, or hour-to-hour, I think you can see how such methods might be useful in creating seemingly random traffic with an expressed purpose. We could ensure that this “drunken” CTR didn’t get beyond, say, 50%, but by doing so, you’d be reducing the size of the system and therefore making it easier to detect.
This isn’t something that can be simply done with one line of code.”