Gain Deep Insight into the Robots crawling your Apache Web server
Two kinds of robots crawl your website: good bots and bad bots.
Good bots identify themselves in their user agent string and obey the rules set forth in your robots.txt
file. They also provide some kind of value to your company in return for the bandwidth required to serve them. For example, you typically want Googlebot to crawl your site so that it shows up in search engine results.
Bad bots, on the other hand, don’t play by the rules. They not only consume server resources to the detriment of your human users, but often scrape proprietary information for their own use. Their intent can even be more malicious, including denial-of-service attacks and automated security vulnerability checking.

In this article, we’ll survey several techniques for identifying both good and bad bots by analyzing Apache log data. Once you’ve identified the bots on your own site, you can optimize good ones by altering robots.txt
or block bad ones by IP address in .htaccess
.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
Finding robots is a more advanced application of analyzing traffic metrics, so this article assumes you’ve already read through the basics in Apache Traffic Analysis.
Identifying Good Bots
Well-behaved bots identify themselves is the user agent portion of the combined log format. This makes it relatively straightforward to isolate log entries created by good bots:
_sourceCategory=Apache/Access ("Googlebot" OR "AskJeeves" OR "Digger" OR "Lycos" OR "msnbot" OR "Inktomi Slurp" OR "Yahoo" OR "Nutch" OR "bingbot" OR "BingPreview" OR "Mediapartners-Google" OR "proximic" OR "AhrefsBot" OR "AdsBot-Google" OR "Ezooms" OR "AddThis.com" OR "facebookexternalhit" OR "MetaURI" OR "Feedfetcher-Google" OR "PaperLiBot" OR "TweetmemeBot" OR "Sogou web spider" OR "GoogleProducer" OR "RockmeltEmbedder" OR "ShareThisFetcher" OR "YandexBot" OR "rogerbot-crawler" OR "ShowyouBot" OR "Baiduspider" OR "Sosospider" OR "Exabot") | parse regex "\"[A-Z]+\s+\S+\s+HTTP/[\d\.]+\"\s+\S+\s+\S+\s+\S+\s+\"(?<agent>[^\"]+?)\"" | parse regex field=agent "(?<bot_name>facebook)externalhit?\W+" nodrop | parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\S+" nodrop | parse regex field=agent "(?<bot_name>PaperLiBot?)/.+" nodrop | parse regex field=agent "(?<bot_name>TweetmemeBot?)/.+" nodrop | parse regex field=agent "(?<bot_name>msn?)bot\W" nodrop | parse regex field=agent "(?<bot_name>Nutch?)-.+" nodrop | parse regex field=agent "(?<bot_name>Google?)bot\W" nodrop | parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\W" nodrop | parse regex field=agent "(?<bot_name>Yahoo?)!\s+Slurp[;/].+" nodrop | parse regex field=agent "(?<bot_name>bing?)bot\W" nodrop | parse regex field=agent "(?<bot_name>Bing?)Preview\W" nodrop | parse regex field=agent "(?<bot_name>Sogou?)\s+web\s" nodrop | parse regex field=agent "(?<bot_name>Yandex?)Bot\W" nodrop | parse regex field=agent "(?<bot_name>rogerbot?)\W" nodrop | parse regex field=agent "(?<bot_name>AddThis\.com?)\s+robot\s+" nodrop | parse regex field=agent "(?<bot_name>ShareThis?)Fetcher/.+" nodrop | parse regex field=agent "(?<bot_name>Ahrefs?)Bot/.+" nodrop | parse regex field=agent "(?<bot_name>MetaURI?)\s+API/.+" nodrop | parse regex field=agent "(?<bot_name>Showyou?)Bot\s+" nodrop | parse regex field=agent "(?<bot_name>Google?)Producer;" nodrop | parse regex field=agent "(?<bot_name>Ezooms?)\W" nodrop | parse regex field=agent "(?<bot_name>Rockmelt?)Embedder\s+" nodrop | parse regex field=agent "(?<bot_name>Sosospider?)\W" nodrop | parse regex field=agent "(?<bot_name>Baidu?)spider" nodrop | parse regex field=agent "(?<bot_name>Exabot?)\W" | where bot_name != "" | if (bot_name="bing","Bing",bot_name) as bot_name | count as hits by bot_name | sort by hits | limit 20</bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></agent>
This is a long query, but don’t be intimidated. All it’s doing is extracting the user agent from every log message and looking for bot-specific strings. Running this query in Sumo Logic will return a list of the top 20 bots crawling your site:

Obviously, you’re not going to want to re-write this query every time you want to analyze your bot traffic. Instead, you can simply record it in your Sumo Logic library by clicking the Save As button underneath the search bar.
Analyzing Bot Traffic Volume
The above query gives you some idea about who’s crawling your site, but to do anything useful with that information, we need to dig deeper. With a few modifications, we can compare the volume of bot traffic to normal traffic over time:
_sourceCategory=Apache/Access | parse regex "\"[A-Z]+\s+\S+\s+HTTP/[\d\.]+\"\s+\S+\s+(?<size>\d+)\s+\S+\s+\"(?<agent>[^\"]+?)\"" | parse regex field=agent "(?<bot_name>facebook)externalhit?\W+" nodrop | parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\S+" nodrop | parse regex field=agent "(?<bot_name>PaperLiBot?)/.+" nodrop | parse regex field=agent "(?<bot_name>TweetmemeBot?)/.+" nodrop | parse regex field=agent "(?<bot_name>msn?)bot\W" nodrop | parse regex field=agent "(?<bot_name>Nutch?)-.+" nodrop | parse regex field=agent "(?<bot_name>Google?)bot\W" nodrop | parse regex field=agent "Feedfetcher-(?<bot_name>Google?)\W" nodrop | parse regex field=agent "(?<bot_name>Yahoo?)!\s+Slurp[;/].+" nodrop | parse regex field=agent "(?<bot_name>bing?)bot\W" nodrop | parse regex field=agent "(?<bot_name>Bing?)Preview\W" nodrop | parse regex field=agent "(?<bot_name>Sogou?)\s+web\s" nodrop | parse regex field=agent "(?<bot_name>Yandex?)Bot\W" nodrop | parse regex field=agent "(?<bot_name>rogerbot?)\W" nodrop | parse regex field=agent "(?<bot_name>AddThis\.com?)\s+robot\s+" nodrop | parse regex field=agent "(?<bot_name>ShareThis?)Fetcher/.+" nodrop | parse regex field=agent "(?<bot_name>Ahrefs?)Bot/.+" nodrop | parse regex field=agent "(?<bot_name>MetaURI?)\s+API/.+" nodrop | parse regex field=agent "(?<bot_name>Showyou?)Bot\s+" nodrop | parse regex field=agent "(?<bot_name>Google?)Producer;" nodrop | parse regex field=agent "(?<bot_name>Ezooms?)\W" nodrop | parse regex field=agent "(?<bot_name>Rockmelt?)Embedder\s+" nodrop | parse regex field=agent "(?<bot_name>Sosospider?)\W" nodrop | parse regex field=agent "(?<bot_name>Baidu?)spider" nodrop | parse regex field=agent "(?<bot_name>Exabot?)\W" nodrop | if (bot_name="","Normal Traffic", "Bot") as traffic_type | timeslice 1m | (size/1048576) as mbytes | sum(mbytes) by traffic_type, _timeslice | transpose row _timeslice column traffic_type</bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></bot_name></agent></size>
Note that we added a nodrop
option to the last parse
line. This ensures that non-bot log entries are not dropped from the results, which is the case for the previous query.
Visualizing the results as a stacked column chart makes it easy to see when bots are flooding your site. This is important because it can slow down response times for your human users.

Since this query only looks for good bots, you should be able to control their crawl frequency and block irrelevant URLs by tweaking your robots.txt
file. While not all bots (notably Googlebot) will honor crawl delay instructions, it’s still a good start towards optimizing bot traffic.
Identifying Misbehaving Bots
However, there’s only so much optimization you can do for good bots because they are, well, good to begin with. By their very nature, they shouldn’t be flooding your site with requests or visiting pages that you’ve declared as “off-limits.”
As a system administrator, you should really be more concerned with misbehaving bots. Bad bots don’t typically identify themselves in their user agent string, which means the only way to detect them is by analyzing their behavior.
The rest of this article introduces a few ways to detect suspicious behavior from specific IP addresses. Regardless of whether these users are bots or humans, their abnormal browsing behavior is often enough cause to block those IPs from visiting your site.
Request Frequency by IP
Let’s start by getting a high-level view of visitor behavior. The following query returns the number of hits every minute, broken down by IP address.
_sourceCategory=Apache/Access | parse regex "(?<client_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})" | timeslice 1m | count as hits by _timeslice, client_ip | transpose row _timeslice column client_ip</client_ip>
Displaying this information as a line chart makes it easy to see how each user is interacting with your website:

This chart contains a lot of information, so it’s worth taking a moment to understand how to interpret it. Each line is a single IP address, and the y-axis shows the number of requests they made each minute. Both axes provide clues for identifying bots.
Humans generally browse web pages one at a time, which means they should be towards the bottom of the y-axis. They also spend some time reading each page, and they eventually leave your site, so you should also see intervals with no interaction. In other words, humans are represented as an irregular jagged line at the bottom of the chart.
Bot behavior can differ in a few ways. First, they can request several pages in parallel, in which case they’ll have many more requests per time interval than their human counterparts. Second, they make requests over a relatively constant interval. And third, they often crawl a large portion of your site instead of just visiting a few pages. This will be visualized as high y-axis values or relatively constant lines that don’t periodically drop to zero hits.
The brown and purple lines towards the top of the above chart are examples of potential bot behavior.
Tracking an IP Address
The goal of the previous query was to find suspicious IPs to investigate. Once you have those IPs, you can stalk their path through your website with the following query (be sure to change the where
clause to use an IP you found in your own log data):
_sourceCategory=Apache/Access | parse regex "(?<client_ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})" | parse regex "[A-Z]+ (?<url>.+) HTTP/1.1" | where client_ip = "166.94.146.84" | timeslice 1m | count as hits by url, _timeslice | transpose row _timeslice column url</url></client_ip>
This returns every URL the user with IP 166.94.146.84
requested each minute. Visualizing this as a stacked column chart shows you where they’ve been spending their time.

Again, there are clues in both dimensions. Simultaneous requests lie along the y-axis, and request frequency can be found along the x-axis. In addition, you can see precisely which pages and media resources they’ve been visiting.
This last part is a powerful tool for identifying scraping behavior. For example, if your company aggregates real-time price points for a particular industry, you want to know if people are stealing this valuable information. If you see the pricing URL pop up over and over again (as in the above screenshot), you’ll know that this IP address is constantly hitting this page to see if you’ve published new data.
Inspecting Image-to-HTML Ratio
The previous examples only identify bot-like behavior by hits/volume. This is great for identifying potential DoS attacks, performance problems, and scraping activities, but other lower-traffic bots can also be a concern. For instance, spam bots collecting email addresses or submitting spam through your contact and comment forms won’t show up as high-traffic users, but they can be just as problematic.
Many of these bots avoid downloading image and other media resources, which means we can identify them by comparing the number of media requests to HTML content requests.
_sourceCategory=Apache/Access | parse regex "(?\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})" | parse regex "\"[A-Z]+ (?.+) HTTP" nodrop | where url != "-" | parse regex field=url "(?jpg|jpeg|png|gif)" nodrop | if (type == "", 1, 0) as content_resource | if (type != "", 1, 0) as media_resource | sum(content_resource) as content_resource, sum(media_resource) as media_resource by client_ip | (media_resource/content_resource) as media_to_content_ratio | sort by media_to_content_ratio asc | fields - media_resource, content_resource
This query calculates the image-to-HTML ratio for each IP address. Alone, this number won’t tell you much, but establishing a baseline for human users and comparing that to potential outliers can help identify bots. A simple bar chart makes this much easier:

The IP at the top of the chart is downloading significantly less static media resources than the rest of your users. This could indicate a bot, but it could also be a human user using a text-only browser.
Summary
It’s important to understand that the methods we discussed in this article are only heuristics for identifying misbehaving bots. They don’t define a magic numerical threshold that distinguishes bots from humans.
There are potential consequences for blocking IPs, so it’s important to be very careful while analyzing bot traffic. Needless to say, mistaking your best customers for bots is not going to be good for business. Unfortunately, this is easier to do than you might think.
For example, a human user that opens a few links in separate tabs will have several simultaneous hits, which is also a tell-tale sign of bots. Deeper analysis is almost always required to make an informed decision about whether it’s worth blocking a particular IP.
Additional resources
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.