Squid doesn’t cache? Check your headers

I am sysadmin at Ivinco and wanted to share my recent experience configuring Squid. We’ve faced a problem with Squid caching which took me quite a while to fix. Below could be an interesting read to anyone who wants to understand what problems you can expect when configuring web page caching with Squid.

Our configuration on this project is simple and quite standard: we use Squid in front of the Apache servers. There are two Squid servers, they are load-balanced on firewall, and their caches are shared between each other (they are configured as siblings). We force caching in Squid by sending http headers in the web app.

Recently we’ve noticed that our Squid does not cache all dynamic pages. After initial research I’ve found a couple of weird things:
– Squid was not caching any files larger than ~50K
– Sometimes a file larger than 50K would get cached, but even when it happened, the sibling Squid server would still not use this cached object.

Deeper investigation showed that static content larger than 50K is not being cached too. The following two options has something to do with that, setting this makes at least caching of big static files work:

range_offset_limit 16 MB
quick_abort_min -1

Also I found that there were some recent bugs in Squid related to maximum_object_size option – if it was defined after cache_dir, it was silently ignored. I moved it before cache_dir just in case. Additionally, I’ve added max-size=16777216 option into cache_dir definition, i.e.:

cache_dir aufs /mnt/data/squid/cache 256000 256 512 max-size=16777216

But all this helped only with static files. So there was still a problem with our PHP generated content – it was sometimes getting cached and sometimes not, without any clear pattern.

Researching further, I’ve found a cool service – http://redbot.org. It can analyze any URL for it’s “cacheability” and give some recommendations. Using this service, I’ve found the key problem. Our PHP scripts were returning “Expires” header using PHP’s date(“r”) function. This format does not fully match the RFC Squid expects to see in the header: date(‘r’) is RFC8222, and Squid wants RFC822. The difference is in how the timezone is specified:

date('r'): "Expires: Tue, 10 Feb 2015 14:40:43 +0000"
what squid respects: "Expires: Tue, 10 Feb 2015 14:40:43 GMT".

This minor difference seriously lowered the chance of object caching. After fixing this our dynamic page caching started working properly.

There was one more issue related to incorrect testing. While making tests, I was running something like this:

"for i in `seq -w 1 1000`; do curl -is http://domain.com/test.html | head -20 | grep Age; done;"

This returns part of the header ‘Age: NNN’ which means that the object is cached. But what I saw was that when the request went to the neighbor Squid server, my testing script returned no such header, which means the object was still not properly cached. After making a deeper research I found that the problem was just in my method of testing. Doing “|head -20” or any other method of cutting part of response and terminating the request makes Squid skip caching since the client disconnects before getting full response. After changing the test method to this:

"for i in `seq -w 1 1000`; do wget -S http://domain.com/test.html 2>&1 | grep Age; done;"

the problem was gone.

Basically that’s it. The resume is that at first we thought that Squid is glitching since it’s caching behavior was quite unpredictable, but after the deeper research we figured out a number of internal issues and once they were fixed, Squid started working properly.

Just be careful with the http headers you return to control Squid’s behavior, because it won’t give you any feedback/clue when you’re do this wrong. Otherwise Squid proved to be a good web caching solution.

Massive Indexing with Sphinx

Introduction

Imagine the following situation: multiple Sphinx daemons are serving huge amount of data running on numerous hardware servers. All indexes are based on the same servers with searchd daemons in order to minimize network data transfers. Request ratio is high, Sphinx servers are busy, but not overloaded. System as a whole is healthy. So far so good. And now you need to rebuild all these Sphinx indexes. That’s the point where the challenge comes.

The problem

Servers have limited performance – in terms of both CPU and I/O and are already running searchd daemons under significant load. Search requests ratio is high, maintenance downtime is not possible. Running indexer on the same server with searchd introduces additional load on the server which can easily saturate hardware server capabilities. Since searchd provides system’s key functionality we can’t allow performance degradation during indexation period – for hours and hours in total.

 

Problem Summary

  1. Indexing requires significant amount of resources (CPU especially) and performing it on Sphinx servers affects production performance.
  2. Management tends to be laborious and complex

+ simple and straightforward

– limited scalability and maintenance complexity

 

What are we looking for?

Under such circumstances we are looking for decoupling of building indexes and serving indexes. Proof of concept solution would be producing indexes on several dedicated servers not involved directly in handling search requests and background-based distribution of result indexes on target Sphinx servers – preferably involving network shaping.

 

General solution features

  1. Stand-alone indexing and
  2. Results distribution over target servers
  3. Ease of management

+ scalability and flexibility

– complexity

 

Proposed solution – Distributed Indexer

Unified solution for Sphinx-based massive indexing. Key idea is to provide high-available and easy-to-use solution for scalable, decoupled massive indexing.

 

Distributed Indexer – components

As a distributed system running on multiple servers Distributed Indexer consists of several components, the most notable of which are workers, jobs and managers.  General schema:

dist ributed indexer schema

Main Components

Worker — daemon, running on indexing server, receives job from Managers, handles received job according to job description and reports result back to Managers. Typically workers would be counted by hundreds. In more depth replicated indexing process performed by workers is described here

 

Job — bundle including indexer configuration file and Sphinx configuration file. Job

description can contain calls to other applications within distributed Indexer or to

any other external application. Typical workflow of a job on a workers consists of the following parts:

  1. Produce indexes
  2. Check Indexes
  3. Deliver results
  4. Rotate sphinx
  5. Hooks — external programs can be called at certain points and events

 

Manager — daemon, choreographing the whole distributed process. Consists of the following systems:

  1. Scheduler — two types — «start at» and «countdown». Can be intermixed.
  2. Console/Communication — via command console and command language
  3. Failover — peer-to-peer failover between managers
  4. Configuration parser
  5. Auto-rebalancer

Good practice would be to have two Managers running on different servers in order to provide failover and neglect possible downtimes.

We’ve been using distributed indexer in production for some time now and can emphasize the following advantages:

 

  1. Scalable solution. Adding new server to indexing pool is as easy as installing worker package. All the rest the system performs by itself. No hand-management of each indexation process is required. Launch worker, have indexes being built on the server.
  2. Possible to utilize idling resources. Additional interesting feature is convenience in utilizing idling resources. In case you have underloaded servers with spare resources they can easily be involved in useful activities – building and distributing indexes. The same as previously – launch worker, have indexes being built on the server.
  3. Automated rebalance. In order to spread requests load over the cluster evenly, we have to mix frequently and seldom requested data on servers. Since search queries tends to drift over the time, shifting point of popularity from one set of data to another and having new data added continuously, data intermixing have to be done periodically, on regular basis. Having this automated with distributed indexer helps admin team with routine tasks.

Real-World Numbers

Currently we have Distributed indexer running on 20 servers of which 2 are fully dedicated to distributed indexer and 18 are used partially in order to utilize idling resources. Around 1000 workers are launched on the system, building indexes constantly. Index build time varies from seconds up to several days for different indexes. Built indexes consume around 6T of disk space.

Summary
Servers 20
Workers 1000
Disk 6T

Conclusion

Distributed Indexer presented itself as an easy-to-use massive Sphinx-based indexing solution having clear advantages and strong sides. The system is constantly evolving based on typical use-cases.

WordPress Sphinx Search plugin update 3.9.4

Hi everyone!

We are glad to let you know that WordPress Sphinx Search plugin has been updated to version 3.9.4 and here are some details:

  • Checked compatibilty with new version of WordPress 4.0 benny.
  • Updated Sphinx Search engine binaries up to 2.1.9

Please find out more about WordPress Sphinx Search plugin here.

WordPress Sphinx Search plugin update

Hi everyone!

We are glad to let you know that WordPress Sphinx Search plugin has been updated to version 3.9 and here are some details:

  • Checked compatibilty with new version of WordPress 3.9.
  • Styles updates to fit new WP styles.

Please find out more about WordPress Sphinx Search plugin here.

DokuWiki Sphinx Search plugin update

Hi everyone!

We are glad to let you know that DokuWiki Sphinx Search Plugin has been updated.

Please find out more about DokuWiki Sphinx Search Module here.

←Older