4 minute read

Hi there!!! 👋

It’s the second day of the #100dayschallenge, and today I will talk about the history and evolution of SRE.

So, I have planned the contents for next 100 days, and I will be posting one blog post each and everyday under the hashtag #100daysofSRE. ✌️

I hope you tag along and share valuable feedback as I grow my knowledge and share my findings. 🙌

Alright! Let’s begin…

Preface

In the earlier post of Day o1, I have talked about what is SRE, the key concepts, benefits, and how the role is different from DevOps and traditional SysAdmin roles.

#100daysofSRE (Day 01): Introduction to Site Reliability Engineering (SRE)

SRE is a relatively new field that has recently gained much attention. Still, its origination can be traced back to the early days of the Internet, and especially when cloud computing emerged.

In this post, let’s explore the root and evolution of SRE as an important role for every tech companies.

The Origins of SRE

The first web server was created in 1989 by Tim Berners-Lee, and the first website went live in 1991. In the early days of the web, most websites were simple static pages that didn’t require much maintenance.

However, as the web grew and became more complex, it became clear that a new approach was needed to ensure that websites and web applications remained reliable and available.

In the early 2000s, Google was experiencing frequent outages of its services. The company’s engineers were spending most of their time fixing issues instead of working on new features, which was affecting the company’s growth.

To solve this problem, Google created a new role called Site Reliability Engineer (SRE), whose sole responsibility was to ensure the reliability of Google’s services. The SREs were tasked with monitoring Google’s systems, identifying potential issues before they occurred, and developing solutions to prevent outages.

With the help of SREs, Google was able to significantly reduce the frequency and duration of its outages, and free up its engineers to work on new features.

The Evolution of SRE

Even though SRE was initially introduced to improve the reliability of Google’s infrastructure, it has since been adopted by many other companies.

One of the critical features of SRE is its focus on automation. By automating routine tasks, such as deployment, configuration, and monitoring, SRE teams can reduce the risk of human error and improve the reliability of their systems.

Another important aspect of SRE is its emphasis on measurement and monitoring. SRE teams use a wide range of tools and techniques to measure their systems’ performance and identify improvement areas. This allows them to address issues before they become serious problems and continuously improve their systems’ reliability and performance.

SRE has also evolved to encompass a broader range of responsibilities. Initially, it was focused primarily on infrastructure and operations. Still, it has since expanded to include security, compliance, and incident management.

This reflects the growing recognition that reliability is not just a technical issue but also a business issue that affects the bottom line.

As the scale and complexity of software systems continue to grow, I believe, SRE will likely become even more important in the years to come.

Significance and Conncluding Remarks

Imagine a large e-commerce company that sells products to customers all over the world. The company has a website that receives millions of visitors each day, and it uses a complex system of servers and databases to process orders and payments.

One day, the company experiences a major outage that lasts for several hours, during which time customers are unable to access the website or make purchases. The company’s IT team works tirelessly to resolve the issue, but they are unable to do so quickly enough. As a result, the company loses millions of dollars in sales and damages its reputation.

After the incident though, the company decides to invest in Site Reliability Engineering, hiring a team of SREs to monitor its systems, identify potential issues, and develop solutions to prevent future outages.

With the help of SREs, the company is able to significantly reduce the frequency and duration of its outages, and ensure that its website is always available to customers.

So, exactly, this seems to be the story of all companies who provide cloud-based services and has their own data center or infrastructure.

Well, if you are using Amzon Web Service or Microsoft Azure, then you don’t have to worry. However, if you have your own infrastructure, you will need SREs all around it.

That’s all for today! See you in the next post!!! 🤝



Thank you for reading my blog post! 🙏

If you enjoyed it and would like to stay updated on my latest content and plans for next week, be sure to subscribe to my newsletter on Substack. 👇

Once a week, I’ll be sharing the latest weekly updates on my published articles, along with other news, content and resources. Enter your email below to subscribe and join the conversation! ✍️

I am also writing on Medium. You can follow me here. 👉

Leave a comment