Using a reverse proxy server in front of a web server is usually needed for every big site and it is a very good thing to do so as reverse proxy server will handle all the anonymous traffic and your webserver will not be flooded with requests. Most popular combination of web server and reverse proxy server is of Apache and Varnish.
Setting up and configuring Varnish in front of Apache is itself a very interesting and challenging task and the highest point of the thrill reaches when Varnish stops caching the pages and you are asked to debug the same. Non-trivial, least to say. "Wasn't it working a while ago? What just happened?" and you know that you a long night is awaiting you.
I am sharing one such experience of debugging Varnish which actually taught me “Read and understand both the positives and negatives before using anything”.
In one of our client Drupal project, we were using Varnish in front of Apache. Varnish and Apache services were running on two different servers. Now we setup the servers and tested with vanilla Drupal setup and we found everything was working perfectly fine. Varnish was caching all anonymous requests and serving them without calling the backend. Life was really good till this point of time. We hosted the Drupal project on this server architecture.
Day 1:
When we were testing Varnish, it stopped working all of a sudden. We did a restart because we made a change in the configuration file so we thought maybe it was not reflecting because we forgot to restart. We crosschecked our varnish configuration in Drupal but they seems to be all correct. But with the same configuration it was working on a fresh vanilla Drupal setup.
From this we came to the conclusion, that this is not the problem of Varnish configuration, this is an application level issue. Somehow our application is stopping varnish from caching. The most obvious thing was to look for cache control headers in the request. They were set to
Cache-control: “no-cache, must-revalidate, post-check=0, pre-check=0”
and for varnish they must be set to
Cache-control: “public, max-age={some number}”
and then a weird problem started to happen, Varnish started working on its own on some random pages in Incognito window. This was alarming, the same page would be served from Varnish in incognito but would reach the backend in case of normal window.
Now again this increased the confusion for us, whether it is Server level issue or Application level. Having this confusion in mind we started our debugging to resolve this confusion first. So we started inspecting the headers for different requests in Incognito and Normal window. After investing a decent amount of time, we found that there were two cookies (device and device_type) set by a Drupal module called context_mobile_detect in response headers. Basically this module is used to identify the devices and accordingly render content on the page. And basis the device of the user, the module sets these 2 cookies.
Now after identifying this, the feeling was “Hurray! We have nailed the issue”. Disabling the module or removing these cookies should resolves the problem. So we went ahead and uninstalled the module. Alas! Varnish was still not working. Ah, we forgot to clear cache :) So we cleared the cache and kept our fingers crossed. Yay! Varnish was working for home page now.
As expected, time to relax and party. Before doing that, wanted to check some more pages to be doubly sure that everything was fine. Checked new pages - Varnish was not working. Aaah. Checked the one that was working earlier. More of aaah. This wasn't working too.
Now the random behaviour started to begin, some pages started to come from varnish but as soon as we browse a page which does not come from varnish, all cached was getting purged i.e. varnish stops working. The situation demanded more debugging. But one thing became more and more certain, the issue of application level and not server / configuration.
And it was not just us, our client and the development team were frustrated as well. The site had gone live and because of no Varnish, the site was kind of opening slow. Urgent and important situation, a good situation if you are aware of the reasons of the problem but not a good one if otherwise. So the whole day had passed with no solution, neither was the root cause of the problem identified.
Day 2:
Now with the new day, we started with new energy and again jumped on the issue. We started with inspecting headers. Now both type of cache control headers were getting set randomly for random pages and we observed that there was an error message on some pages. So we thought there might be some code which is setting wrong Cache control headers but no luck. There were various modules which were playing with cache control headers but none of them was looked to be the culprit.
First half of the day passed in just debugging and understanding code. It was not clear as to why that error was coming on that page for anonymous user. We checked for the "Logging & Errors" configuration for the anonymous user, it was set to none. i.e No error or warning messages should be shown to that user.
So now there were two things we need to resolve
Varnish cache problem
Anonymous users were seeing the error even when the configuration was set not to.
The error that we were facing - https://www.drupal.org/node/1937436
We started working on #2 but we did not have any clue of how to go about the same. We talked to the development team and they confirmed that we are also facing the same issue and have not yet been able to rectify it. The whole day passed in debugging but all our efforts were to no avail. Then we decided to turn off all the contributed modules we were using one by one and see if that could provide us with some hint. And we finally found one :-)
In this case, it was the CSM module which was throwing this error. Uninstalling this module fixed this error. Little relief, but #1 & #2 problems were still there.
Day 3:
The battle with the problem was still on. To be honest, we were very soon running short of ideas. In an attempt to relate issues #1 and #2 with each other, we found the culprit. Whenever drupal sets any message, it also sets a session cookie with that message and hence if a session cookie is set varnish does not caches the page.
So we analysed the random behaviour of varnish, once error message is set on some page it stops serving pages from cache because it sees the session cookie. After debugging the issue we found that the submodule messages_alter of csm module was setting the messages for anonymous user even after the main configuration was turned off.
And finally uninstalling this module rectified all the problems.
In short, the 3 day battle meant that "session cookies" can be set by the application and that could result in Varnish not working.
If you also get stuck in some problems related to varnish, check for cookies once. They might be the real culprit :-)
Let me know in comments if we can debug such type of issues quicker and in a better way.