You’ve heard it before: it’s never too early to prepare for worst-case scenarios on Black Friday and Cyber Monday (or other high-load days) so that you can avoid or mitigate critical issues and site failures. And it’s an inescapable truth that if you don’t have holiday readiness plans in place well in advance, you may have just enough time to discover why your site is lagging or crashing but not enough time to address issues, improve the user experience, and avoid as many problems as possible that are within your control.
What you may be overlooking is that peak-event readiness is about more than just load testing or ensuring that your servers are up throughout a specific timeframe.
It’s about ensuring that your front-end is also working perfectly, that your site can deliver a delightful experience to your users or customers, and that it is functional – even when it’s experiencing up to seven or more times the typical traffic load. Therefore, it’s critical that you have both a technical and a business contingency plan in place so that if anything does go wrong (because the odds are that something will, unfortunately), you can recover from disaster and protect your business’s reputation.
While peak-event readiness is of particular importance for ecommerce, businesses of all types must be prepared for heavy-load days. Some are predictable – not just Black Friday but also Mother’s Day (flowers? cards? candy?), Memorial Day (decorations? food delivery?), Tax Day in the United States (banking? Shipping? research?), and so on – while others arise based on factors like running a SuperBowl ad or a special marketing promotion.
Additionally, keep in mind that while business stakeholders are likely already aware of these events or promotions, technical teams may not have the same insight. Be sure that the lines of communication between teams are open so that all involved on both sides of an organization have early awareness of upcoming high-load days and can plan accordingly to manage them.
The Cost of Site Failure During High-Load Days
When sites don’t function properly – or go down completely – on high-load days, businesses lose money and take a hit to their reputation. Case studies abound. In one high-profile example, Amazon dealt with significant outages on Prime Day 2018, which may have cost them as much as $99 million in sales. (They seem to have learned from their mistakes for Prime Day 2019.)
JCrew’s site went down on Black Friday 2018 for five hours, costing the company more than $700,000 in sales and impacting approximately 323,000 shoppers. And Nordstrom dealt with crashes and other issues at the beginning of its popular anniversary sale in 2019.
The numbers tell the story – planning is essential.
Start Thinking About High-Load Days Early (and Often)
Being proactive is a far better use of your team’s time than being reactive.
According to Rich Howard, CEO of Optimal, a business dedicated to optimizing websites and mobile apps, there are ten steps your business can take now to start preparing your website or web application for the upcoming holiday season.10 Steps to Peak-Load Readiness: Being proactive is a far better use of your team's time than being reactive. #blackfriday #webperf #perfmatters Click To Tweet
Whether you’re in the early stages of planning or you’re already further along, here’s a checklist that will help you ensure you’re covering all of your bases:
10 Steps to Peak-Load Readiness
STEP 1: Invest in the Right Tooling and Get Everyone On Board
There are many tools available that can help you monitor, manage, and optimize your website and application performance: APM, synthetics, RUM, alerting, alert integrations, and more. How are you monitoring end-user performance today?
Next, look into the tools you are already using. Are they set up sufficiently? Are they providing meaningful, actionable data? Do you have a good sense of what will occur during any peak periods? At a minimum, Optimal recommends that you have RUM, APM, synthetic, and load test tools set up, instrumented, and producing meaningful data.
If you have a small budget or are just getting started, there are free web performance monitoring tools that can provide one-time reports and basic information for a rudimentary baseline, such as:
However, peak-event readiness typically requires more robust information, including data over multiple time periods rather than just a single point. For that, you’ll want to invest in more robust best-in-breed solutions (like Rigor) that can provide ongoing, actionable reporting.
STEP 2: Use Synthetic Monitoring to Baseline Your Current Performance
RUM or analytics data (funnel analysis) can help you determine the user flows that drive the majority of your business and that will be most critical on high-load days. This information can then be used to set up tests within a synthetic monitoring data platform such as Rigor to test how a user experience those flows on every type of device, network, connection, and location.
For example, an ecommerce business would likely want to test mission-critical conversion paths, API flows, and business transactions on both mobile and desktop.
If the high-load event is recurring (like Black Friday) and your business has gone through it at least once before, you can also run a retrospective by pulling traffic and performance data to give you insight into what worked well and where failures occurred.
Additionally, make sure that the business side is aligned with the technical side in terms of expectations and reality based on historical data and current capabilities. Both teams should work together to understand any technical limitations and also to see what plans are in place to upgrade or optimize the site. That information can help the business team determine such goals as conversion rates and can help technical teams evangelize the need for a delightful user experience to reach those goals.
STEP 3: Audit Your Performance Alert Integrations and Workflow
Take an inventory of the tools and integrations you are currently using to set up and send alerts. Are you using the alerting features within your monitoring solution, or are you taking advantage of a service like PagerDuty, OpsGenie, ServiceNow, or Slack?
Review all alerts to ensure that their thresholds are appropriate and fall within your performance budget. The performance of any one component needs to be seen in the context of a) where it normally performs and b) where it should be performing.
For peak events, you’ll want to be looking for anything that is anomalous – for example, make sure key players are alerted when something degrades to more than 25% outside the expected range.
According to Optimal, some key monitoring alerts that you should have in place are:
- Page load time degradation [RUM or Synthetic]
- Traffic patterns outside of normal [RUM or Analytics]
- Servers down or server health issues [APM]
- Drop in cart completions, increase in abandonment or other key conversion events [RUM or Analytics]
- Response code or issues with the rendered page [Synthetic]
- Third-party load time increase [RUM or Synthetic]
- Third-party broken [Synthetic or RUM]
- Slow DNS times [Synthetic]
- Slow CDN times [Synthetic or RUM]
- DDoS attack or malicious behavior [Security/WAF dashboarding tools]
- DNS slowdowns [Synthetic]
- CDN offload issues [CDN]
- Stock level issues [E-commerce engine]
- Credit card gateway error responses [Synthetic or Gateway provider]
STEP 4: Combine the Power of Load Testing and Synthetics
Of course, load testing and stress testing are essential – you likely already have your eye on those numbers. However, to be prepared for a peak event, you need a view not only of your back-end performance but also of the user experience as the site load increases, which relates to front-end performance. Using synthetic data in combination with load testing can deliver a snapshot of your users’ experiences during high volume conditions.
Commonly, with synthetics, you will see measurements that return erroneous data, indicating that something on the page has broken (e.g., the checkout button no longer works). In other cases, a third party may be slow and impacting the render of your page. Issues like these could cause a customer to lose interest and move on, despite the fact that you’ve managed to deliver the page from your server within, for example, 0.8 seconds.
Remember that just because your servers are active doesn’t mean your site is being delivered quickly to your users or even presenting back the information or providing the functionality necessary for them to complete their task. In fact, if your users perceive your site as down on a high-load day, then it’s down, no matter what any other measurements say.
You need to be able to manage both sides.
STEP 5: Review Your Security Compliance
Security holes in your site open you to exploit, which could not only compromise your customers’ data but also impact your performance and availability.Security holes in your site open you to exploit, which could not only compromise your customers' data but also impact your performance and availability. #webperf #perfmatters Click To Tweet
If your site has security concerns that are not appropriately addressed:
- Malicious actors could cause damage to your data or software, which can have long-lasting effects on your business.
- DDoS attacks could bring your site down, which can be especially devastating to your business’s revenue and reputation on a high-load day.
- Attackers could steal your site’s data and expose it, potentially revealing confidential information about you or your users and could tarnish your brand’s reputation.
To decrease your chances of an attack, start with the following five steps:
- Ensure you are running the latest versions of software on all systems that deliver your application. This will ensure that you are not running software with a known vulnerability that has already been fixed.
- Engage a third party to conduct an external application penetration test. This will give you an idea of security issues with your publicly facing assets.
- Review OWASP’s recommendations and best practices with your developers. This will prompt discussions about the methods you are using to develop applications securely.
- Use a static analysis tool or a vulnerability scanner (available from all major vendors such as Qualys, Veracode, or Micro Focus) to assess how well you are complying with security best practices.
- Ensure you have a Web Application Firewall (or other security device) in place and that its rules engine is up to date. This will help detect attacks as they happen.
STEP 6: Tweak Your Bot Management Plan
A peak event creates not only a timeframe for your business to boost its revenue and reputation but also, unfortunately, an ideal time for a cyber-attacker to bring down your site if they wish to do so.
Because of this increased vulnerability, if you haven’t already, you need to get a plan in place for bot management. Bots range from benign to a nuisance to full-on malicious. While a DDoS attack could bring down your site, bots can also be used to scrape your site for vulnerabilities, to steal PII, to make fraudulent purchases, and more.
STEP 7: Tune Your CDN Performance
Prior to any peak event, take the time to review your CDN’s cache design and offload metrics to determine what improvements can be made. Ultimately, the key to surviving a peak event is to have very good CDN cache designs.A key to surviving a peak event is to have very good #CDN cache designs. #blackfriday #webperf #perfmatters Click To Tweet
CDNs offload traffic from your core infrastructure, delivering assets much more quickly than you would normally be able to. What specifically gets offloaded is typically subject to a series of (sometimes complicated) configuration rules.
For example, you may choose to enable offload of all your JS and CSS, but not your HTML. In this case, when you make a request for your HTML page, the CDN then has to reach back to your origin every time to fetch this, and it becomes the source of a bottleneck.
Ideally, you want to offload as much as possible instead.
In addition to deciding what to offload, you also decide how long the CDN can hold onto an object before it becomes stale (known as CDN cache time). If the cache time is too low, then the amount of offload will not be great, making your website less scalable.
CDNs also can operate using tiered distribution. If the edge server a user is talking to does not have a copy of a required asset, that server can ask one of its parent servers whether another CDN node has the asset before reaching back to your origin. This process is often quicker than making the origin request and further reduces the load on your infrastructure.
Checking your CDN and enabling tiered distribution can be a quick win in your quest to scale your website for peak load days.
STEP 8: Review Third-Party Tags and Domains
Third-party vendors and tags can have a bigger impact on your performance and user experience than you think. Tech and business teams need to be aligned about which third-party applications are most critical for a user and which, such as commenting systems or chatbots, for example, could be temporarily disabled during high-load periods.
Synthetic monitoring can help identify which of these tags and domains cause the greatest lag, and the business team can work to identify which are critical to meet their goals for the high-load event.
Similarly, make a plan to disable A/B testing and similar initiatives during peak load times wherever possible.
STEP 9: S&*! Happens: Establish a Contingency Plan for High-Load Days
Despite your best efforts, your careful checklist completion, and your detailed plans, it’s still possible that your site will face an unforeseen, unpreventable catastrophic event on a high-load day. What will you do if your site slows, crashes, or becomes unusable?
Your contingency plan should be two-pronged:
- A business plan that addresses your users
- A technical plan to get your site back on track as quickly as possible
Technically, plan for how you will handle the flood of traffic that can’t access your site, either because the front or back end failed. Your technical contingency plan should include:
- The ability to quickly pull levers or implement quick changes to systems – e.g., “feature switches” that can disable or degrade certain components and reduce system load.
- A way to quickly disable a third-party integration if necessary.
- Establishment up front and early on of ownership and accountability for key systems. This includes ensuring that senior technical resources will be available to troubleshoot when required during a heavy load period.
For example, your team could set up waiting rooms to lessen the load, as Macy’s did when it had a failure on Black Friday 2016. The company posted a message on its desktop site stating, “To make sure everyone gets the best shopping experience possible, we’re asking new shoppers to wait approximately 10 seconds, and then we’ll refresh your browser and welcome you in.”
On the business side, make sure you have business processes in place to handle customer service should the website have issues. For example, if your payment gateway goes down, you can establish a plan to email users with a link that takes them back to their shopping cart with all items – and offers – intact.
Additionally, make sure there is a team available to address any issues in real-time on your social media channels and to respond to unhappy customers.
The hit to a company’s reputation after a site failure is as real as the hit to its bottom line, so messaging quickly if and when an outage occurs is key to mitigating both types of losses.
STEP 10: Run a Retrospective Soon After the Event
The best time to review peak-load performance is as close to the event as possible. It needs to be fresh in everyone’s mind – before business as usual returns. Invite members of every team to the table so that they can express their wins and their concerns from their specific point of view.
Pull out any anomalies and combine them with running commentary using an editable wiki page or shared document sectioned by discipline with dedicated space for general or cross-team issues. Use the outcome of your retrospective to scope and plan for the next high-load event, whether that is in a week, a month, or a year.
Above all, be mindful of the data retention policy of any of your system monitoring tools and be sure to grab and archive critical data from the event so it can be reviewed as needed before it’s lost forever.
Whether it’s holiday readiness or preparation for another high volume period that will affect your website, it’s critical to have a plan in place that you follow well before the event itself.
From ensuring that you have critical tools in place to reviewing third-party tags to establishing a bot management plan, you need to get started now – not later.
If you’re already monitoring site performance, you’re several steps into holiday readiness. Make sure you have the information in place to pivot, correct, and move forward as needed.
The Rigor digital performance management platform delivers ongoing, actionable insights that can help you establish a baseline and develop your action plan for high-load days like Black Friday and Cyber Monday. Reach out now for your free trial.
And if you want a professional service engagement to assist you in your plans for holidays and other peak-load events, contact us to get started today.
How confident are you that your web application is holiday-ready? What steps are you taking? Tell us in the comments!