Performance Nerd

Monday, April 25, 2016

How to Compete with Teridion

So you want to beat Teridion to the software defined networking space? Are you interested in building a solution to improve performance on content, like file uploads, that a traditional Content Delivery Network (CDN) cannot address? Too late. Teridion launched late last year. However, if you are interested in understanding the main concepts of Teridion, this article should help.
As I have heard my team state multiple times, the solution is elegantly simple. Teridion is comprised of three components. They are the Teridion Measurement Agents (TMA), Teridion Management System (TMS), and Teridion Cloud Routers (TCR). Trademarked names aside, the platform can be referred to as the data collection, data analytics and virtual backbone network.

The Internet Heat Map

Agent technology has been around longer than the internet. Teridion Measurement Agents are specifically designed to gather latency, bandwidth, congestion, and other network health indicators. These agents live and die in cloud data centers where Teridion can currently create VBNs. As opposed to other agents technology like Thousand Eyes or Gomez, TA's are only looking at network statistics and not application centric metrics. This is because Teridion is a layer 3 technology and only cares about route optimization and not application acceleration in the general sense.

The Brains of the Operation

Like all properly designed systems, Teridion Measurement Agents send data to the Teridion Management System for analysis. This TMS leverages elasticsearch to make as-quick-as-possible determinations on where the optimum path is for a virtual backbone network. The algorithm used by the system is proprietary. I could say it is too secret to share, but honestly I have not asked what the formula is. Once an optimum path is discovered, the TMS will either spin up a new TCR or update current TCRs routing tables accordingly.

The (Virtual) Backbone of the Solution

If you think of the other two components as setting the table, the Teridion Cloud Routers are the main course. These are Linux VMs turned routers that create HOV lanes of communication for a majority of the packets life on the internet. Simply put, a TCR is the on and off ramp locations to the virtual backbone network (VBN). The VBN is set up very similarly to a highway. An optimum highway would have an on ramp close to all drivers and an off ramp close to the destination.

Does it Work?

Yes. 'Nuff said? Well for those of you who need "proof" give message at sales@teridion.com and get a trial started. For those of you looking for current data points on real life customers gains, here is a nice chart:

For those of you joining in from a mobile device or just cannot read the axis, the chart is depicting times to download a 5MB file using the public internet and using the Teridion network. Teridion makes the internet ~20x faster and content delivery is much less susceptible to degradation in performance. When congestion happens on the internet, Teridion pivots to a new routing path.

Wednesday, March 30, 2016

I Don't Believe You: The Story of Teridion

I have nearly five years of experience in the APM space. I have lost count how many times I have drawn this picture:

Although this is vastly simplifying the world of today, it is a (although terribly drawn) depiction of how applications work in theory. There is an end user that communicates over some network to some data center. APM strategies in general focus around understanding this transaction through data collection points at critical components and deliver a picture of this flow.
Over my years in this space, I have come to realize there are many great ways to get visibility into the transaction. In each area I always had the confidence to make suggestions in improving performance at any point in this diagram except one; the public internet. Do not get me wrong. There are a ton of tools like Thousand Eyes or Outage Analyzer to determine WHERE a problem exists, but I have never seen a great strategy that SOLVES this problem.

I was extremely happy at Dynatrace. I was 26 years old, working in the Bay Area and regularly meeting with the logos that are associated with Silicon Valley. In the past two years I lost less than 10% of opportunities to competition. I was becoming the new technology SME and had opportunities like presenting at Docker Meetups. I was having meetings with Site Reliability and Application Performance Engineers who are well known in the development and deployment world. I was not the only one feeling the success of the product. Dynatrace as a company is wrapping up another stellar year with the announcement of the unification of Ruxit and Dynatrace. The organizations market share is still number one and growing. The future looks extremely bright for Dynatrace and I was looking forward to enjoying the ride. Then, in steps Teridion.

Optimizing the Pipe and Not Building Water Towers

As my experience with Dynatrace grew, so did my understanding of just how complex applications could be. A good APM solution will be able to tell if the problem is on the end user device, inside the data center (the cloud is just someone else's data center) or in the network between the two (most often the public internet). I could have significant performance improvement conversations with customers who were interested in fixing problems on the end user device, or inside their data center. However, the public internet only had one solution that I knew of; use the internet less. The only way I knew how to improve the performance of the internet was to enable caching, pushing content out to CDNs, adding more data centers closer to end users, or build out a private fiber network. Most large sites utilize a CDN to address this concern since it is the easiest "fix."

Content Delivery Networks (CDN) are a way to help reduce this problem. They are great for storing cacheable content near end users. However, the internet is moving more and more towards individual experiences. Pages are comprised of more dynamic content and less static content. For example; when I look at a news feed, I click on titles that interest me, read a few sentences and immediately go to the comments section. Any item revolving around the personal experience will be a challenge for a CDN to deliver on. A better theory for delivery is needed for the modern web. This issue I was very aware of, but never had an answer. I did not know there was a better way.

Holding the Internet Accountable

What do Comcast, Dish, and Sprint all have in common? If you answered they all represent last mile content delivery, you would be correct. If you also answered they were voted into the Top 10 Most Hated Companies in the US according to 24/7 Wall St, you would ALSO be correct. There is personally nothing worse (exaggeration) then coming home from a hard day, firing up Netflix and... the stream is not working. Who gets blamed? ISPs! Who's at fault? I DO NOT CARE I JUST WANT TO WATCH BOB'S BURGERS! ISPs are the face of the end users frustration. They are just the last leg of a long journey of content delivery that spans multiple handlers over many literal miles. They get blamed because the path is complicated and they are the face of that complication. Just how complicated can this path be? Here is a traceroute output from my terminal to google.com:

Eleven hops. Double digit points of failure that are mostly set statically and are comprised of multiple players. Teridion asked me a very simple question to start off the conversation; "Who controls the internet?" Unless you subscribe to conspiracy theories, you probably know that no one controls the internet. There are big players, but even when you go to Google there are at least half a dozen touch points that the traffic is routed through. Coming from the APM space, I knew the user to datacenter bandwidth was inconsistent, I just did not realize how much of a performance drag it was! Teridion showed me an elegantly simple demo:

"I don't believe you." This was my response to the demo. This also the exact response I wanted any potential lead to have when I was showcasing a product. The logic behind the solution makes sense as well. I had to get involved. I saw an answer for a problem that all the IT organizations worried about, but only had a band aid for the solution. How do you control the public internet? If you are Facebook or Google, you lay your own fiber. If you are not those guys, Teridion is the solution to provide reliability and performance.

Modern Routing for the Modern Web

Whenever I have a customer who says those four words stated in the title, I have to follow it up with a simple explanation of how it works. Whelp, Teridion is the Waze of the internet (the "Uber of ..." line died in 2015). The main protocol used to route traffic across the internet is BGP. Think of BGP like a paper map of the United States. If you need to take a trip from Washington D.C to San Francisco, you would plot out a course using the main highways most likely. The map cannot tell you about construction, congestion, weather or anything else that will impact your journey. BGP is older than me. I have never used a paper map to plan a road trip. Waze is a GPS application that also takes in current road conditions to plot out a course that will be the fastest. Similarly Teridion is proactively determining what is the fastest way to get from user to location of content by taking in multiple metrics from agents constantly testing the internet. This data is fed into a singular analytics engine which can then create HOV lanes for internet traffic. It is "elegantly simple."

Want to learn more? Can't believe it is true? Check out Teridion for more information

Thursday, November 19, 2015

All I Want for Christmas is a Perfect Performance Tool

Dear Santa,

It has been a few years since we last talked. Again, I apologize for being three and crying hysterically on your lap at the mall. I know you were probably gearing up for your trip around the globe, and little me did not make that easier. Please remember the good times we had, like this one:

Purple was in that year

Anyways, I have come up with a list requirements I would like for today and tomorrow's complex world. If you and the elves can work your magic and get this delivered on December 25th, I promise to have some gourmet milk and cookies waiting for you. Below is my Christmas list for the perfect performance tool.

Logically Monitors Applications

I am a firm believer that transaction based monitoring from the end user perspective is the only way to truly understand where and when a problem occurs in todays applications. Would it not be great if there was solution that could tell me EXACTLY what services are invoked in a delivery chain? I drew up something to help visualize this communication path:

The tool should be able to inject its own tag to follow a transaction across multiple tiers, I hate solutions that require someone to define how transactions flow through an environment. We can put robots on Mars, so we should be able to figure out how services communicate within an application without having to manually define each interaction.

Eliminates the Need for a War Room

I have a problem with war rooms. Fundamentally when a war room is called together the individuals are not looking for the root cause, but looking for data to back up the fact they are not at fault. If I was on a call at 8PM on a Friday, I am looking to get my butt off that call ASAP so I can go out... I mean go do good Christian things like volunteer at a soup kitchen. In most cases a war room can cause more confusion and frustration. They are are an illogical approach to resolving an issue because when an application problem happens, it is most likely because of a singular event that cascades to other incidents. The tool should be able to determine what service/process/host/link is causing the end user impacting issues immediately. On top of that, how many times do multiple alarms go off because of a single problem? The tool should be able to use causation to get to the real root cause as opposed to guessing a user action response time anomaly is caused by a singular query degradation:

Powerful, Simple, and Trustworthy Analysis

If a tool is blaming my component for a problem, it better have some damn (err DARN) good proof that I am at fault. I want to challenge it with questions like; "So, you're telling me this method has high execution time, what is that execution time breakdown?" I want to know that the timeout exception seen at the front end tier is directly correlated to a downstream call:

Data driven decisions are key. Without the proper data, I end up making doubt driven decisions which ultimately do not resolve the open issue.

Deployable to any Environment

Todays world is all about the cloud. Tomorrows world will probably be a micro-service container based deployments. The solution should address the needs of today, with a roadmap to properly prepare my initiatives for future endeavors. I do not want to work with a tool that is going to rely off of the same architecture for ever. The tool should already start answering my concerns around internet of things and docker based applications:

Let us not forget about today as well. I still work with mainframes and large ERP like systems which are not going away any time soon. These applications although stable most of the time, have their own performance issues. Please make sure we have coverage for these types of systems as well. Something like this should help:

Easy to Setup AND Maintain

Todays applications are not heterogeneous. However, the way to get visibility into a .NET process on windows should practically the same as monitoring a docker JVM deployed on Linux. On top of that, I should not have to bring on a certified expert every time I want to change my dashboards or upgrade the monitoring tool. I am going to swing for the fences and say the administration of the tool should be able to get handled by less than one full time resource. I cannot really portray this in a screenshot, so here is a picture I drew for you:

Thanks Santa!
PS: Give Mrs. Clause my love
PPS: I also want a drone

Friday, August 14, 2015

Containers: The Concept is Small, but the Impact is Big!

Due for a Disrupt

Next year AWS turns 10 years old. The cloud is getting extremely mature and there are multiple cloud providers touting their individual features that makes each unique in value. There is a huge following (myself being one of them) that "the next cloud" will be containers. The AWS for containers will be Docker. I have attended a few Meetups regarding Docker specifically, and I am always impressed with the innovations that are constantly being released. Docker version 1.8 was released this week and I see it as the first release that makes the solution enterprise friendly. There are a few items to call out specifically in this release that makes getting started with Docker that much easier, most notably Docker Toolbox. One of the items I do not see discussed enough is the overall value containers bring to each group within the software development life cycle. Containers are not just a new way to develop and deploy applications. Containers provide significant value to each team who interacts with an application. Let us investigate how containers impact development, deployment, operations, and business teams.

Developers Dream

Take a look at any downloaded application, and you will see something like this:

To get each one of these icons, developers have to create specialized libraries specific to each operating system. Imagine a world where this is not needed. Imagine the reduction in work required by a developer to produce an application that can be ran on any operating system. Imagine if a developer did not have to push out fix packs for supporting a new security patch. This would fundamentally cut down on the time required to reach all potential users of that application. Do you want to develop faster? Containers are the option for you!

Microservices Deploy Faster

It only takes one syntax error to break a deployment. Too many times have I been sitting on a deployment call where someone fat fingered a host name or forgot to open a port and the whole deployment failed. Microservices are the answer to address these items, and containers effectively ensure that microservices are deployed correctly. If there is an issue with a single configuration, it is easily identifiable from the containers perspective and will only impact one microservice as opposed to corrupting the whole push. There is a lot of players developing container management solutions, and in turn are making microservices extremely easy to manage and control.

Growing Elastically for Production

An item I always ask customers is; "How many concurrent users do you have?" When I do get an answer (which is very infrequently), it is often wrong and not true for every scenario. Virtualization was the original answer to address the requirement for elastic environments. This did work for a time, but now environments are becoming too large to manage. An application developed in containers can grow horizontally and vertically to account for any event (even catastrophic outages). Getting ready for black Friday? Add more containers. Want to cut down on cloud charges during non peak hours? Scale down your containers. Want to embrace continuous deployment/integration? Containers allow for the smallest of small incremental changes and are built to handle true A/B testing in any environment! Containers allow for 100% availability of any application that can still be constantly improved without impact to the end user!

Protecting the Brand

I have the impression that the business is the hardest group to convince containers are the best going forward strategy. Docker's image of a whale carrying shipping containers does not paint a picture that reflects simple and easy.

However, there is significant value to the brand when utilizing containers. One of the terms business/feature folks always seem to know is six-sigma, or the guarantee that the application will be running practically all of the time. Containers are THE way to ensure that end users are not impacted by hiccups within the delivery chain. On top of this, with the reduction of development time required for cross OS level support, new features can be added into the product more efficiently and can be market tested to a subset of users before deploying it to everyone. If you are the key feature driver of the application, and have ever asked "why can't we get this in quicker", you should have your team investigate containers!

The Better Way is Here!

Will containers become the standard for application development? I strongly believe the answer is yes. Living in the software world, I often see key solutions that benefit a specific group. Containers are a technology that has a key value prop regardless of what team you sit on. I for one am going to hitch my brain to this Docker wagon and see how much value I can derive!

Thursday, July 2, 2015

The Problem With Big Data

Over the past few years, Big Data has been a marque term in the technology space. There have been multiple different organizations who have tried to tackle the fundamental problems that are associated with Big Data analytics. Heavy investments have come into key areas mainly focused around scalability and request efficiency. Even recently, IBM made a huge investment into the Big Data space. Ultimately, Big Data technologies are meant to answer questions based on noticed trends that would be impossible for a user or groups of users to determine themselves. Although these technologies are quite impressive, they often lack an easy way to COLLECT metrics outside of involving someone familiar with the components specific outputs. This is the problem with Big Data, the question of "how do I get the data I want to be consumed by the Big Data solution" is still a question no one has an industry recognized solution for.

"Just Get Us the Data, and We'll Take it From There"

If you read my previous article, you know I am borderline obsessed with the phrase "there has to be a better way". There are plenty of examples of Big Data tools utilizing common communication frameworks and protocols. The real effort in this process is how does an application owner pull out that data and submit it to the Big Data solution with minimal and effective efforts. Many times, I see organizations put the effort of outputting the data on developers to either write REST interfaces into the application, or even worse (from a performance perspective) write out the data to log files. Both of these efforts end up solving the problem, but could introduce security issues along side the fact you are asking a developer to implement a brand new component into the solution for one off requests. Try this exercise:

Think of a question you would like to ask your application
Write down all of the metrics or points of data required to come up with an answer
Picture all of the individuals required to get at each one of those points of data
What if one of those individuals defines a metric differently, what would be the impact to the answer?
Can anyone maliciously use this data if accessed?

This is just the topics you need to cover for answering one question. The only way to prevent this from becoming unbearable, is to get to the same results using a different path.

If You are Relying off Logs, You are Doing IT Wrong

Log files are up for interpretation. Did a developer come up with that string that is written? Then there are probably edge cases where that output is not right. Basically, stop writing log files to get at specific points of data. There are solutions out there that will instrument the application and provide a much richer context of what is going on within the stack. This context could never be captured in a string written to the file system. Instead of writing one off messages to solve one off problems, work on how to implement a single framework that can be utilized across the organization to get at all points of data.

The Three Points

Coming full circle, a real Big Data implementation is comprised of three components (similar to monitoring tools); data collection, data analysis, and data presentation. I get the feeling there large number of players trying to corner the latter two areas. The data collection area is still extremely green in my eyes, and I am eagerly waiting for someone to really make a play for answering that question.

Sunday, May 31, 2015

Velocity Takeaways

I had the privilege of attending the 2015 Velocity Conference in Santa Clara. It was an amazing show with a ton of great speakers and even more exhibitors. I always find the interactions at the booths are were a lot of knowledge transfer happens. I did hear a pretty distinct pattern in a lot of the offering though. The pitch went something along the lines of "we get you the metrics to find root cause". This messaging sounds pretty amazing at face value! I believe it is imperative to determine what the speaker means by root cause, and how many scenarios they are ready for. To determine what types of problem patterns a monitoring solution could potentially solve, I look at three components. The three are; data collection, data analysis and data presentation. This article will cover the first.

Data Collection

For monitoring solutions, they can only solve problems they can see. Are they collecting end user clicks? If not, then they cannot determine the impact to end users. Seems simple enough, but you should have an in depth conversation on the mechanism that they collect those statistics.

Some tips on collection

1. If they do not have a library or agent somehow injected into the running process, they will not get root cause on the call stack. Examples:

PurePath view from Dynatrace Application Monitoring

sync issues
CPU consumption hogs
exceptions
correlating log messages to called transaction

2. If they do not have network based capture, they will not get network issue resolution. Example:

Network heath breakdown from Dynatrace DCRUM

Retransmission issues
Packet loss
Network redirects

3. End user device capture was the biggest opportunity for business at velocity this year. The key call outs for this to be possible are:

Visit capture screen from Dynatrace User Experience Management

SDK for native devices
Ability to see non-web based transactions (think traditional thick client requirements)
JS agents injected into browser/mobile browser*

*Watch out for W3C based capturing if you are using Angular.JS. W3C timings will not get the same visibility as other frameworks.

Final Thoughts

All in all, the conference was a ton of fun. I was able to see what others in the industry are releasing into this space. I even got a demo of New Relics and AppDynamics portal:

Although a lot of organizations where touting the "root cause" messaging, I only saw some glimmers of others demonstrating a true root cause analysis. The more complex applications have become, the more powerful monitoring tools have started to pull away in this fight. I have yet to find the silver bullet for performance analysis, and probably never will, but the journey is very entertaining!

Monday, May 18, 2015

DevOps is Not Just a Tool

I was forwarded an interesting article last week and it really made me think. The article in question is written by John Allspaw and entitled "An Open Letter To Monitoring/Metrics/Alerting Companies". The main point John makes is regardless how advanced and powerful a piece of software is, there will always be a need for a user on the other side of the screen making sense of the data. Software at its core is just a product. I agree with John that it is disingenuous for a company to pitch their product as "the end all be all solution for better troubleshooting". There are other components in making a true DevOps practice a reality.

I am a huge fan of the CNBC show The Profit. If you have never watched it before; the host, Markus Lemonis travels around and invests his personal capital in struggling businesses. He dedicates his time and effort into getting businesses on an exponential growth path. Markus always refers to a simple formula for success. The formula consists of the three P's; Product, Process, and People. A true DevOps environment will also include the three P's.

Product

Is it all there?

The tools of an organization working on DevOps practices must have the ability to facilitate cross team conversations. This means that the tool (or tools) implemented must present data in multiple ways that can be interpreted by different teams. So logically, the tool must be able to do one thing above all else, collect data. Analyzing incomplete pictures will always lead to incomplete results. Be weary of solutions that will always "filter out noise". The hardest problems to solve always seem to be in the "noise" of the environment.

Process

It's ok to look down

Why does your company do things the way it does? Why do you have those status calls? Why do you have an internal forum? If you cannot answer these questions, I do not want to talk to you. You should have a clear understanding of the process of your organization and they "why" behind that process. In simpler terms, process is meant for the sake of progress. In a DevOps world the process should facilitate communication between all the groups involved in the SDLC. I have seen some of the best organizations put Ops members on the weekly Dev touch bases and vice versa. Production problems suck. They are stressful, a huge drain on the business and a black eye on the brand. If an organization takes the steps required to allow honest discussions regularly you will see a noticeable drop in P1 production issues.

People

No computer will replace you

The center of the DevOps movement is the people. Bottom line. To build the best applications, you need the best people. I have worked with some amazing engineers, and some not so amazing engineers. I have sadly had to walk away from potential deals because I know the group that would be responsible for understanding the data was not capable of doing so. The reason why I walk away from those situations is the same reason a chainsaw company will not sell its product to a 10 year old. The child is not ready and will likely hurt itself as opposed to providing any value.

Final Rant

I hate the fact that organizations look down on services when evaluating potential purchases. Requiring services does not mean the product is hard to understand, more so your organization requires some assistance in perfecting your individual products, process, and people. You cannot honestly think your team will instantly become better just by installing something in your environment. Just like learning anything new, it will require changes in all three fronts to truly become a DevOps shop.