October 26, 2021

ipSpace.net Blog (Ivan Pepelnjak)

Interactions Between BFD and Graceful Restart

We have school holidays this week, so I’m reposting wonderful comments that would otherwise be lost somewhere in the page margins. Today: Dmitry Perets on the interactions between BFD and GR.


Well, assuming that the C-bit is set honestly (will be funny if not) and assuming that the Helper is using this bit correctly (and I think it’s pretty well defined what “correctly” means - see section 4.3 in RFC 5882), the answer is pretty clear.

October 26, 2021 06:52 AM

October 25, 2021

Packet Pushers

The Three Key Factors For a Successful SD-WAN Project

Aruba Networks has collected customer best practices and guidance on what it takes for a successful SD-WAN implementation. Those stories and best practices have been captured in a new eBook from Aruba, available now.

The post The Three Key Factors For a Successful SD-WAN Project appeared first on Packet Pushers.

by Sponsored Blog Posts at October 25, 2021 07:50 PM

ipSpace.net Blog (Ivan Pepelnjak)

Feedback: How Networks Really Work

A few weeks ago I asked my subscribers which webinar they’d like to see in November (thanks a million to everyone who replied!). Not surprisingly, network automation got the top spot, but I was a bit sad to see my long-term pet project at the bottom of the list:

<figure> </figure>

October 25, 2021 08:14 AM

XKCD Comics

October 24, 2021

ipSpace.net Blog (Ivan Pepelnjak)

Worth Reading: Making a Case for Automation Architecture

In case you’re ever asked to justify an investment in network automation, read How to Make the Case for Automation Architecture first. Not surprisingly, it includes the evergreen what problem are you trying to solve?

October 24, 2021 07:15 AM

October 23, 2021

ipSpace.net Blog (Ivan Pepelnjak)

Worth Reading: Network Validation Evolution at Hostinger

Network validation is becoming another overhyped buzzword with many opinionated pundits talking about it and few environments using it in practice (why am I not surprised?)

As always, there are exceptions. They don’t have to be members of the FAANG club, and some of them get the job done with open-source tools regardless of what vendor marketers would like you to believe. For example, Donatas Abraitis described how the Hostinger networking team gradually implemented network validation using Cumulus VX, Vagrant, SuzieQ, PyTest and Test Kitchen. Enjoy!

October 23, 2021 06:54 AM

October 22, 2021

The Networking Nerd

Fast Friday Thoughts From Security Field Day

<figure class="wp-block-image size-large"></figure>

It’s a busy week for me thanks to Security Field Day but I didn’t want to leave you without some thoughts that have popped up this week from the discussions we’ve been having. Security is one of those topics that creates a lot of thought-provoking ideas and makes you seriously wonder if you’re doing it right all the time.

  • Never underestimate the value of having plumbing that connects all your systems. You may look at a solution and think to yourself “All this does is aggregate data from other sources”. Which raises the question: How do you do it now? Sure, antivirus fires alerts like a car alarm. But when you get breached and find out that those alerts caught it weeks ago you’re going to wish you had a better idea of what was going on. You need a way to send that data somewhere to be dealt with and cataloged properly. This is one of the biggest reasons why machine learning is being applied to the massive amount of data we gather in security. Having an algorithm working to find the important pieces means you don’t miss things that are important to you.
  • Not every solution is going to solve every problem you have. My dishwasher does a horrible job of washing my clothes or vacuuming my carpets. Is it the fault of the dishwasher? Or is it my issue with defining the problem? We need to scope our issues and our solutions appropriately. Just because my kitchen knives can open a package in a pinch doesn’t mean that the makers need to include package-opening features in a future release because I use them exclusively for that purpose. Once we start wanting the vendors to build a one-stop-shop kind of solution we’re going to create the kind of technical debt that we need to avoid. We also need to remember to scope problems so that they’re solvable. Postulating that there are corner cases with no clear answers are important for threat hunting or policy creation. Not so great when shopping through a catalog of software.
  • Every term in every industry is going to have a different definition based on who is using it. A knife to me is either a tool used on a campout or a tool used in a kitchen. Others see a knife as a tool for spreading butter or even doing surgery. It’s a matter of perspective. You need to make sure people know the perspective you’re coming from before you decide that the tool isn’t going to work properly. I try my best to put myself in the shoes of others when I’m evaluating solutions or use cases. Just because I don’t use something in a certain way doesn’t mean it can’t be used that way. And my environment is different from everyone else’s. Which means best practices are really just recommended suggestions.
  • Whatever acronym you’ve tied yourself to this week is going to change next week because there’s a new definition of what you should be doing according to some expert out there. Don’t build your practice on whatever is hot in the market. Build it on what you need to accomplish and incorporate elements of new things into what you’re doing. The story of people ripping and replacing working platforms because of an analyst suggestion sounds horrible but happens more often than we’d like to admit. Trust your people, not the brochures.

Tom’s Take

Security changes faster than any area that I’ve seen. Cloud is practically a glacier compare to EPP, XDR, and SOPV. I could even make up an acronym and throw it on that list and you might not even notice. You have to stay current but you also have to trust that you’re doing all you can. Breaches are going to happen no matter what you do. You have to hope you’ve done your best and that you can contain the damage. Remember that good security comes from asking the right questions instead of just plugging tools into the mix to solve issues you don’t have.

by networkingnerd at October 22, 2021 04:23 PM

ipSpace.net Blog (Ivan Pepelnjak)

Video: Introduction to AI/ML Hype

In May 2021, Javier Antich ran a great webinar explaining the principles of Artificial Intelligence and Machine learning and how they apply (or not) to networking.

He started with a brief overview of AI/ML hype that should help you understand why there’s a bit of a difference between self-driving cars (not that we got there) and self-driving networks.

You need Free ipSpace.net Subscription to access this webinar.

October 22, 2021 10:10 AM

XKCD Comics

October 21, 2021

Packet Pushers

Reimagining ‘Show IP Interface Brief’

Welcome back! In my first post here on Packet Pushers (Applying A Software Design Pattern To Network Automation – Packet Pushers) we explored the Model View Controller (MVC) software design pattern and how it can be applied to network automation. This post will go a little deeper into how this is achieved and the mix […]

The post Reimagining ‘Show IP Interface Brief’ appeared first on Packet Pushers.

by John Capobianco at October 21, 2021 01:00 PM

ipSpace.net Blog (Ivan Pepelnjak)

Circular Dependencies Considered Harmful

A while ago my friend Nicola Modena sent me another intriguing curveball:

Imagine a CTO who has invested millions in a super-secure data center and wants to consolidate all compute workloads. If you were asked to run a BGP Route Reflector as a VM in that environment, and would like to bring OSPF or ISIS to that box to enable BGP ORR, would you use a GRE tunnel to avoid a dedicated VLAN or boring other hosts with routing protocol hello messages?

While there might be good reasons for doing that, my first knee-jerk reaction was:

October 21, 2021 06:48 AM

October 20, 2021

Potaroo blog

Fifty Years On

What's likely to happen in computer networking in the next 50 years? Lets polish up the crystal ball and see what awaits!

October 20, 2021 11:00 PM

Packet Pushers

Automating Data Center VXLAN/EVPN Using CI/CD: Gluware LiveStream Video [6/8]

Chris DiPaola, Senior Systems Engineer – Network at Acuity, chats with Ethan Banks of the Packet Pushers about Acuity’s EVPN/VXLAN network. Chris & his team used the Gluware API to automate their EVPN deployments, all while tied into their company’s CI/CD pipeline. If Gluware might be a fit for your network automation needs, visit here. […]

The post Automating Data Center VXLAN/EVPN Using CI/CD: Gluware LiveStream Video [6/8] appeared first on Packet Pushers.

by The Video Delivery at October 20, 2021 04:00 PM

ipSpace.net Blog (Ivan Pepelnjak)

Do We Need Multiple Global IPv6 Addresses Per Interface (RFC 7934)

I was happily munching popcorn while watching the latest season of Lack of DHCPv6 on Android soap opera on v6ops mailing list when one of the lead actors trying to justify the current state of affairs with a technical argument quoted an RFC to prove his rightful indignation with DHCPv6 and the decision not to implement it in Android:

[…not having multiple IPv6 addresses per interface…] is also harmful for a variety of reasons, and for general purpose devices, it’s not recommended by the IETF. That’s exactly what RFC 7934 is about - explaining why it’s harmful.

If you’re new to this discussion, you might want to start with Why Does DHCPv6 Matter blog post

October 20, 2021 06:21 AM

XKCD Comics

October 19, 2021

Packet Pushers

Aruba Puts DPUs Into New Top-of-Rack Switch – 5 Questions

Aruba Networks has announced a new top-of-rack switch that includes two Data Processing Units from Pensando that can offload and accelerate functions such as stateful firewalling and DDoS protection. How does Aruba's approach compare to other methods for distributing services in a data center?

The post Aruba Puts DPUs Into New Top-of-Rack Switch – 5 Questions appeared first on Packet Pushers.

by Drew Conry-Murray at October 19, 2021 09:14 PM

ipSpace.net Blog (Ivan Pepelnjak)

Graceful Restart and BFD

The whole High Availability Switching series started with a question along the lines of “does it make sense to run BFD together with Graceful Restart”. After Non-Stop Forwarding 101, Graceful Restart 101, and Graceful Restart and Convergence Speed we finally have enough information to answer that question.

TL&DR: Most probably not.

A more nuanced answer depends (as always) on a gazillion implementation details.

October 19, 2021 06:51 AM

October 18, 2021

Packet Pushers

Today’s Scripts Are Tomorrow’s Technical Debt: Gluware LiveStream Video [5/8]

Michael Haugh, VP Of Product Marketing at Gluware, joins Greg Ferro of the Packet Pushers for a discussion of building a network automation system on top of a platform instead of DIY with Python, Ansible, etc. If Gluware might be a fit for your network automation needs, visit here. Thanks! You can subscribe to the […]

The post Today’s Scripts Are Tomorrow’s Technical Debt: Gluware LiveStream Video [5/8] appeared first on Packet Pushers.

by The Video Delivery at October 18, 2021 04:00 PM

ipSpace.net Blog (Ivan Pepelnjak)

netsim-tools: Start a Virtual Lab with a Single Command

In mid-October I finally found time to add the icing to the netsim-tools cake: netlab up command takes a lab topology and does everything needed to have a running virtual lab:

  • Create Vagrantfile or containerlab topology file
  • Create Ansible inventory
  • Start the lab with vagrant up or containerlab deploy
  • Deploy device configurations, from LLDP and interface addressing to routing protocols and Segment Routing

October 18, 2021 06:57 AM

XKCD Comics

October 17, 2021

ipSpace.net Blog (Ivan Pepelnjak)

Worth Reading: The Software Industry IS STILL the Problem

Every other blue moon someone writes (yet another) article along the lines of professional liability would solve so many broken things in the IT industry. This time it’s Poul-Henning Kamp of the FreeBSD and Varnish fame with The Software Industry IS STILL the Problem. Unfortunately it’s just another stab at the windmills considering how much money that industry pours into lobbying.

October 17, 2021 07:11 AM

October 16, 2021

ipSpace.net Blog (Ivan Pepelnjak)

MUST READ: ARP Problems in EVPN

Decades ago there was a trick question on the CCIE exam exploring the intricate relationships between MAC and ARP table. I always understood the explanation for about 10 minutes and then I was back to I knew why that’s true, but now I lost it.

Fast forward 20 years, and we’re still seeing the same challenges, this time in EVPN networks using in-subnet proxy ARP. For more details, read the excellent ARP problems in EVPN article by Dmytro Shypovalov (I understood the problem after reading the article, and now it’s all a blur 🤷‍♂️).

October 16, 2021 07:04 AM

October 15, 2021

Potaroo blog

DNSSEC with RSA-4096 keys

The role of cryptography is to keep one step ahead of advances in computing capability. One response is to keep using the same algorithm, but extend the key lengths. Here we look at the viability of DNSSEC when we use a 4,096-bit RSA key.

October 15, 2021 07:00 PM

The Networking Nerd

Choosing the Least Incorrect Answer

<figure class="wp-block-image size-large"></figure>

My son was complaining to me the other day that he missed on question on a multiple choice quiz in his class and he got a low B grade instead of getting a perfect score. When I asked him why he was frustrated he told me, “Because it was easy and I missed it. But I think the question was wrong.” As usual, I pressed him further to explain his reasoning and found out that the question was indeed ambiguous but the answer choices were pretty obviously wrong all over. He asked me why someone would write a test like that. Which is how he got a big lesson on writing test questions.

Spin the Wheel

When you write a multiple choice test question for any reputable exam you are supposed to pick “wrong” answers, known as distractors, that ensure that the candidate doesn’t have a better than 25% chance of guessing the correct answer. You’ve probably seen this before because you took some kind of simple quiz that had answers that were completely wrong to the point of being easy to pick out. Those quizzes are usually designed to be passed with the minimum amount of effort.

This also extends to a question that includes answer choices that are paired. If you write a question that says “pick the three best answers” with six options that are binary pairs you’re basically saying to the candidate “Pick between these two three times and you’re probably going to get it right”. I’ve seen a number of these kinds of questions over the years and it feels like a shortcut to getting one on the house.

The most devious questions come from the math side of the house. Some of my friends have been known to write questions for their math tests and purposely work the problem wrong at a critical point to get a distractor that looks very plausible. You make the same mistake and you’re going to see the correct answer in the choices and get it wrong. The extra effort here matters because if you see too many students getting the same wrong distractor as the answer you know that there may be confusion about the process at that critical point. Also, the effort to make math question distractors look plausible is impressive and way too time consuming.

Why Is It Wrong?

Compelling distractors are a requirement for any sufficiently advanced testing platform. The professionals that write the tests understand that guessing your way through a multiple choice exam is a bad precedent and the whole format needs to be fair. The secret to getting the leg up on these exams is more than just knowing the right answer. It’s about knowing why things are wrong.

Take an easy example: OSPF LSAs. A question may ask you about a particular router in a diagram and ask you which LSAs that it sees. If the answer choices are fairly configured you’re going to be faced with some plausible looking answers. Say the question is about a not-so-stubby-area (NSSA). If you know the specifics of what makes this area unique you can start eliminating choices from the question. What if it’s asking about which LSAs are not allowed? Well, if you forgot the answer to that you can start by reading the answer choices and applying logic.

You can usually improve your chances of getting a question right by figuring out why the answers given are wrong for the question. In the above example, if LSA Type 1 is listed as an answer choice ask yourself “Why is this the wrong answer?” For the question about disallowed LSA types you can eliminate this choice because LSA Type 1 is always present inside an area. For a question about visibility of that LSA outside of an area you’d be asking a different question. But if you know that Type 1 LSAs are local and always visible you can cross off that as a potential answer. That means you boosted your chances of guessing the answer to 33%!

The question itself is easy if you know that NSSAs use Type 7 LSAs to convey information because Type 5 LSAs aren’t allowed. But if you understand why the other answers are wrong for the question asked you can also check your work. Why would you want to do that? Because the wording of the question can trip you up. How many times have you skimmed the question looking for keywords and missing things like “not” or “except”? If you work the question backwards looking for why answers are wrong and you keep coming up with them being right you may have read the question incorrectly in the first place. Likewise, if every answer is wrong somehow you may have a bad question on your hands.

What happens if the question is poorly worded and all the answer choices are wrong? Well, that’s when you get to pick the least incorrect answer and leave feedback. It’s not about picking the perfect answer in these situations. You have to know that a lot of hands touch test questions and there are times when things are rewritten and the intent can be changed somehow. If you know that you are dealing with a question that is ambiguous or flat-out wrong you should leave feedback in the question comments so it can be corrected. But you still have to answer the question. So, use the above method to find the piece that is the least incorrect and go with that choice. It may not be “right” according to the test question writer, but if enough people pick that answer you’re going to see someone taking a hard look at the question.


Tom’s Take

We are going to take a lot of tests in our lives. Multiple choice tests are easier but require lots of work, both on the part of the writer and the taker. It’s not enough to just memorize what the correct answers are going to be. If you study hard and understand why the distractors are incorrect you’ll have a more complete understanding of the material and you’ll be able to check your work as you go along. Given that most certification exams don’t allow you to go back and change answers once you’ve moved past the question the ability to check yourself in real time gives you an advantage that can mean the difference between passing and retaking the exam. And that same approach can help you when everything on the page looks wrong.

by networkingnerd at October 15, 2021 05:11 PM

ipSpace.net Blog (Ivan Pepelnjak)

Lessons Learned: Complexity Will Kill Your System

You wouldn’t believe the intricate network designs I created decades ago until I learned that having an uninterrupted sleep is worth more than proving I can get the impossible to work (see also: using EBGP instead of IGP in a 4-node data center fabric).

Once I started valuing my free time, I tried to design things to be as simple as possible. However, as my friend Nicola Modena once said, “Consultants must propose new technologies because they must be seen as bringing innovation,” and we all know complexity sells. Go figure.

You’ll need a Free ipSpace.net Subscription to watch the video.

October 15, 2021 06:49 AM

XKCD Comics

October 14, 2021

Packet Pushers

Flexible Automation For A Complex Enterprise: Gluware LiveStream Video [4/8]

Angelo Rossi, GNS LAN-WAN Architect at WSP joins Drew Conry-Murray of the Packet Pushers to explain how WSP automated their brownfield network with Gluware. If Gluware might be a fit for your network automation needs, visit here. Thanks! You can subscribe to the Packet Pushers’ YouTube channel for more videos as they are published. It’s […]

The post Flexible Automation For A Complex Enterprise: Gluware LiveStream Video [4/8] appeared first on Packet Pushers.

by The Video Delivery at October 14, 2021 04:00 PM

ipSpace.net Blog (Ivan Pepelnjak)

BGP Optimal Route Reflection 101

Almost a decade ago I described a scenario in which a perfectly valid IBGP topology could result in a permanent routing loop. While one wouldn’t expect to see such a scenario in a well designed network, it’s been known for ages1 that using BGP route reflectors could result in suboptimal forwarding.

Here’s a simple description of how that could happen:

October 14, 2021 06:23 AM

October 13, 2021

ipSpace.net Blog (Ivan Pepelnjak)

Why Does DHCPv6 Matter?

In case you missed it, there’s a new season of Lack of DHCPv6 on Android soap opera on v6ops mailing list. Before going into the juicy details, I wanted to look at the big picture: why would anyone care about lack of DHCPv6 on Android?

Please note that I’m not a DHCPv6 fan. DHCPv6 is just a tool not unlike sink plunger – nobody loves it (I hope), but when you need it, you better have it handy.

The requirements for DHCPv6-based address allocation come primarily from enterprise environments facing legal/compliance/other layer 8-10 reasons to implement policy (are you allowed to use the network), control (we want to decide who uses the network) and attribution (if something bad happens, we want to know who did it).

October 13, 2021 06:35 AM

XKCD Comics

October 12, 2021

Packet Pushers

Evolving From CLI To Infrastructure-as-Code: Gluware LiveStream Video [3/8]

Chris Ellerman, VP of Presales + Service Delivery at Gluware, chat with Ethan Banks of the Packet Pushers about Gluware’s multi-vendor network device data modeling, no matter if that device is CLI-only, API-capable, or cloud-native. If Gluware might be a fit for your network automation needs, visit here. Thanks! You can subscribe to the Packet […]

The post Evolving From CLI To Infrastructure-as-Code: Gluware LiveStream Video [3/8] appeared first on Packet Pushers.

by The Video Delivery at October 12, 2021 04:00 PM

ipSpace.net Blog (Ivan Pepelnjak)

Graceful Restart and Routing Protocol Convergence

I’m always amazed when I encounter networking engineers who want to have a fast-converging network using Non-Stop Forwarding (which implies Graceful Restart). It’s even worse than asking for smooth-running heptagonal wheels.

As we discussed in the Fast Failover series, any decent router uses a variety of mechanisms to detect adjacent device failure:

  • Physical link failure;
  • Routing protocol timeouts;
  • Next-hop liveliness checks (BFD, CFM…)

October 12, 2021 06:35 AM

October 11, 2021

ipSpace.net Blog (Ivan Pepelnjak)

New Content in AWS Networking Webinar

Last week’s update session of the AWS Networking webinar covered two hours worth of new (or not-yet-covered) features, including:

  • Transit Gateway Connect functionality (GRE tunnel+BGP between Transit Gateway and in-cloud SD-WAN appliances)
  • AWS Private Link
  • Intra-VPC static routes that you can use to send inter-subnet traffic to a BYOD security appliance
  • IGMPv2 support
  • Custom global accelerators
  • Assigning whole IP prefixes to VM interfaces

The recordings have already been published, either as independent videos or integrated with the existing materials. Enjoy ;)

October 11, 2021 06:34 AM

XKCD Comics

October 10, 2021

About Networks

How to simulate a host in a real network?

How to simulate a host

Like some other posts, I didn’t think I would write this one because it seemed obvious. But, after talking to a lot of engineers and customers, I realized that not everyone knows this trick. So here it is. The question is this: how to simulate a real host in a physical network environment when you don’t have computer at your disposal? Well, let’s take an example. The environment Here is an example with a very simple VXLAN topology consisting of two spines and two leafs. I’m using Cisco Nexus switches…

The post How to simulate a host in a real network? appeared first on AboutNetworks.net.

by Jerome Tissieres at October 10, 2021 05:01 PM

ipSpace.net Blog (Ivan Pepelnjak)

OMG: Democratizing Network Automation

I totally understand that entities relying on sponsors have to become creative while promoting whatever theirs sponsors want to sell, but in my opinion this is a bridge too far:

[…] explore how Gluware aims to democratize automation; that is, get you quick wins around common tasks such as configuration changes and OS updates.

Democratizing automation? Because it’s authoritarian now? By providing the abilities like configuration changes and OS updates that have been available in network management tools like CiscoWorks or SolarWinds for ages?

You know what’s really hard when automating existing networks? Figuring out how to simplify them to the point where it makes sense to automate them. Will any shrink-wrapped GUI product solve that? Of course not.

October 10, 2021 06:45 AM

October 09, 2021

ipSpace.net Blog (Ivan Pepelnjak)

Must Read: BGP Private AS Range

We all know that you have to use an AS number between 64512 and 65535 for private BGP autonomous systems, right? Well, we’re all wrong – the high end of the range is 65534, and Chris Parker wrote a nice blog post explaining the reasons behind that change.

October 09, 2021 06:55 AM

October 08, 2021

The Networking Nerd

What Can You Learn From Facebook’s Meltdown?

<figure class="wp-block-image size-large"></figure>

I wanted to wait to put out a hot take on the Facebook issues from earlier this week because failures of this magnitude always have details that come out well after the actual excitement is done. A company like Facebook isn’t going to do the kind of in-depth post-mortem that we might like to see but the amount of information coming out from other areas does point to some interesting circumstances causing this situation.

Let me start off the whole thing by reiterating something important: Your network looks absolutely nothing like Facebook. The scale of what goes on there is unimaginable to the normal person. The average person has no conception of what one billion looks like. Likewise, the scale of the networking that goes on at Facebook is beyond the ken of most networking professionals. I’m not saying this to make your network feel inferior. More that I’m trying to help you understand that your network operations resemble those at Facebook in the same way that a model airplane resembles a space shuttle. They’re alike on the surface only.

Facebook has unique challenges that they have to face in their own way. Network automation there isn’t a bonus. It’s a necessity. The way they deploy changes and analyze results doesn’t look anything like any software we’ve ever used. I remember moderating a panel that had a Facebook networking person talking about some of the challenges they faced all the way back in 2013:

<figure class="wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio">
<iframe allowfullscreen="true" class="youtube-player" height="329" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/mcrRQFRBpVw?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="584"></iframe>
</figure>

That technology that Najam Ahmad is talking about is two or three generations removed for what is being used today. They don’t manage switches. They manage racks and rows. They don’t buy off-the-shelf software to do things. They write their own tools to scale the way they need them to scale. It’s not unlike a blacksmith making a tool for a very specific use case that would never be useful to any non-blacksmith.

Ludicrous Case Scenarios

One of the things that compounded the problems at Facebook was the inability to see what the worst case scenario could bring. The little clever things that Facebook has done to make their lives easier and improve reaction times ended up harming them in the end. I’ve talked before about how Facebook writes things from a standpoint of unlimited resources. They build their data centers as if the network will always be available and bandwidth is an unlimited resource that never has contention. The average Facebook programmer likely never lived in a world where a dial-up modem was high-speed Internet connectivity.

To that end, the way they build the rest of their architecture around those presumptions creates the possibility of absurd failure conditions. Take the report of the door entry system. According to reports part of the reason why things were slow to come back up was because the door entry system for the Facebook data centers wouldn’t allow access to the people that knew how to revert the changes that caused the issue. Usually, the card readers will retain their last good configuration in the event of a power outage to ensure that people with badges can access the system. It could be that the ones at Facebook work differently or just went down with the rest of their network. But whatever the case the card readers weren’t allowing people into the data center. Another report says that the doors didn’t even have the ability to be opened by a key. That’s the kind of planning you do when you’ve never had to break open a locked door.

Likewise, I find the situation with the DNS servers to be equally crazy. Per other reports the DNS servers at Facebook are constantly monitoring connectivity to the internal network. If that goes down for some reason the DNS servers withdraw the BGP routes being advertised for the Facebook AS until the issue is resolved. That’s what caused the outage from the outside world. Why would you do this? Sure, it’s clever to basically have your infrastructure withdraw the routing info in case you’re offline to ensure that users aren’t hammering your system with massive amounts of retries. But why put that decision in the hands of your DNS servers? Why not have some other more reliable system do it instead?

I get that the mantra at Facebook has always been “fail fast” and that their architecture is built in such a way as to encourage individual systems to go down independently of others. That’s why Messenger can be down but the feed stays up or why WhatsApp can have issues but you can still use Instagram. However, why was their no test of “what happens when it all goes down?” It could be that the idea of the entire network going offline is unthinkable to the average engineer. It could also be that the response to the whole network going down all at once was to just shut everything down anyway. But what about the plan for getting back online? Or, worse yet, what about all the things that impacted the ability to get back online?

Fruits of the Poisoned Tree

That’s where the other part of my rant comes into play. It’s not enough that Facebook didn’t think ahead to plan on a failure of this magnitude. It’s also that their teams didn’t think of what would be impacted when it happened. The door entry system. The remote tools used to maintain the networking equipment. The ability for anyone inside the building to do anything. There was no plan for what could happen when every system went down all at once. Whether that was because no one knew how interdependent those services were or because no one could think of a time when everything would go down all at once is immaterial. You need to plan for the worst and figure out what dependencies look like.

Amazon learned this the hard way a few years ago when US-East-1 went offline. No one believed it at the time because the status dashboard still showed green lights. The problem? The board was hosted on the zone that went down and the lights couldn’t change! That problem was remedied soon afterwards but it was a chuckle-worthy issue for sure.

Perhaps it’s because I work in an area where disasters are a bit more common but I’ve always tried to think ahead to where the issues could crop up and how to combat them. What if you lose power completely? What if your network connection is offline for an extended period? What if the same tornado that takes our your main data center also wipes out your backup tapes? It might seem a bit crazy to consider these things but the alternative is not having an answer in the off chance it happens.

In the case of Facebook, the question should have been “what happens if a rogue configuration deployment takes us down?” The answer better not be “roll it back” because you’re not thinking far enough ahead. With the scale of their systems it isn’t hard to create a change to knock a bunch of it offline quickly. Most of the controls that are put in place are designed to prevent that from happening but you need to have a plan for what to do if it does. No one expects a disaster. But you still need to know what to do if one happens.

Thus Endeth The Lesson

What we need to take away from this is that our best intentions can’t defeat the unexpected. Most major providers were silent on the schadenfreude of the situation because they know they could have been the one to suffer from it. You may not have a network like Facebook but you can absolutely take away some lessons from this situation.

You need to have a plan. You need to have a printed copy of that plan. It needs to be stored in a place where people can find it. It needs to be written in a way that people that find it can implement it step-by-step. You need to keep it updated to reflect changes. You need to practice for disaster and quit assuming that everything will keep working correctly 100% of the time. And you need to have a backup plan for everything in your environment. What if the doors seal shut? What if the person with the keys to unlock the racks is missing? How do we ensure the systems don’t come back up in a degraded state before they’re ready. The list is endless but that’s only because you haven’t started writing it yet.


Tom’s Take

There is going to be a ton of digital ink spilled on this outage. People are going to ask questions that don’t have answers and pontificate about how it could have been avoided. Hell, I’m doing it right now. However, I think the issues that compounded the problems are ones that can be addressed no matter what technology you’re using. Backup plans are important for everything you do, from campouts to dishwasher installations to social media websites. You need to plan for the worst and make sure that the people you work with know where to find the answers when everything fails. This is the best kind of learning experience because so many eyes are on it. Take what you can from this and apply it where needed in your enterprise. Your network may not look anything like Facebook, but with some planning now you don’t have to worry about it crashing like theirs did either.

by networkingnerd at October 08, 2021 04:21 PM

Packet Pushers

Intelligent Low Code Network Automation: Gluware LiveStream Video [2/8]

Gluware’s Michael Haugh, VP of Product Marketing & Greg Ferro of the Packet Pushers discuss the state of network automation. How are the hand-crafted, artisanal scripts & playbooks working out? How do legacy NCCMs fit? If Gluware might be a fit for your network automation needs, visit here. Thanks! You can subscribe to the Packet […]

The post Intelligent Low Code Network Automation: Gluware LiveStream Video [2/8] appeared first on Packet Pushers.

by The Video Delivery at October 08, 2021 04:00 PM

ipSpace.net Blog (Ivan Pepelnjak)

Video: Theoretical View of Network Addressing

After explaining the basics of (network) names, addresses and routes, I wasted a few minutes of everyone’s time discussing the theoretical aspects of layered addressing, and then got back to practical issues like address scopes, namespaces, and address provisioning.

The video ends with a simple (and unappreciated) truth: if you have a point-to-point link between two nodes you don’t need data-link-layer addresses. The consequences of that fact are left as an exercise for the viewer (or you can wait till the next video ;)

You need Free ipSpace.net Subscription to watch the video, and the Standard ipSpace.net Subscription to register for upcoming live sessions.

October 08, 2021 06:45 AM

XKCD Comics

October 07, 2021

ipSpace.net Blog (Ivan Pepelnjak)

Should You Build or Buy a Router?

Patrik Schindler sent me an interesting comment to my Open-Source DMVPN Alternatives blog post:

I’ve done searches myself some time ago about the readymade Linux distros supporting DMVPN and got exactly what I asked for.

Glancing over that page appalled me: Different stuff with different configuration languages, probably the need to restart things, thus generating service outages for configuration changes…

Your blog is heavily biased towards big deployments with good opportunities for automation, and the diversity of different components can be easily hidden behind automation scripts of choice. Smaller deployments are almost never being able to compensate the initial overhead of creating all the automation fuzz, and from that perspective, I must admit that configuring a Cisco router feels way more smooth to me.

Welcome to the build-or-buy dilemma, router edition.

October 07, 2021 07:11 AM

October 06, 2021

Potaroo blog

Learning from Facebook's Mistakes

On October 4th Facebook managed to achieve one of the more impactful of outages of the entire history of the Internet, assuming that the metric of "impact" is how many users one can annoy with a single outage. What can we as an industry learn from this outage to ensure that we can avoid a recurrence of such a widespread outage in other important and popular service platforms?

October 06, 2021 11:00 PM

Packet Pushers

Protecting Anywhere Workers With SD-WAN And Zero Trust Network Access

The following post is by Drew Conry-Murray on behalf of Fortinet. We thank Fortinet for being a sponsor. Fortinet’s Zero Trust Network Access (ZTNA) is a smarter way to control which applications your end users connect to. Unlike a typical VPN client that gives a remote user full access to the corporate network, ZTNA provides […]

The post Protecting Anywhere Workers With SD-WAN And Zero Trust Network Access appeared first on Packet Pushers.

by Sponsored Blog Posts at October 06, 2021 04:36 PM

Who Is Gluware? With CEO Jeff Gray: Gluware LiveStream Video [1/8]

Gluware CEO Jeff Gray explains to Greg Ferro of the Packet Pushers who Gluware is and what they do. If Gluware might be a fit for your network automation needs, visit here. Thanks! You can subscribe to the Packet Pushers’ YouTube channel for more videos as they are published. It’s a diverse a mix of […]

The post Who Is Gluware? With CEO Jeff Gray: Gluware LiveStream Video [1/8] appeared first on Packet Pushers.

by The Video Delivery at October 06, 2021 04:00 PM

XKCD Comics

October 05, 2021

Packet Pushers

Juniper Says Its New Chassis Switch Is Just Fine For Your Leaf-Spine

Juniper's newest swtich, the QFX5700, is a 5RU chassis switch that can mix and match line cards and interfaces from 10G to 400G. Juniper positions the switch for enterprise data centers, service providers, and clouds.

The post Juniper Says Its New Chassis Switch Is Just Fine For Your Leaf-Spine appeared first on Packet Pushers.

by Drew Conry-Murray at October 05, 2021 06:36 PM

Hybrid Security Just Got A Lot More SASE

The following post is by Anupam Upadhyaya, VP of Product Management at Palo Alto Networks. We thank Palo Alto Networks for being a sponsor. Businesses today not only have to deal with the increased onslaught of cyber attacks brought about by the pandemic, but also the arduous task of modernizing their infrastructures to accommodate their […]

The post Hybrid Security Just Got A Lot More SASE appeared first on Packet Pushers.

by Sponsored Blog Posts at October 05, 2021 01:07 PM

Honest Networker

Middle-management consulting their most BGP-savvy engineers in post-facebook meltdown world.

<figure class="wp-block-video wp-block-embed is-type-video is-provider-videopress">
<iframe allowfullscreen="allowfullscreen" data-resize-to-parent="true" frameborder="0" height="512" src="https://video.wordpress.com/embed/1v1kxFbo?autoPlay=1&amp;loop=1&amp;muted=1&amp;persistVolume=0&amp;preloadContent=metadata&amp;hd=1&amp;cover=1" width="908"></iframe><script src="https://v0.wordpress.com/js/next/videopress-iframe.js?m=1632495956"></script>
<figcaption>Every reader of this account right now…</figcaption></figure>

by ohseuch4aeji4xar at October 05, 2021 12:12 PM

Cloudflare bloggers when they scramble to publish a blogpost about the outage at another company before the outage is even over.

<figure class="wp-block-video wp-block-embed is-type-video is-provider-videopress">
<iframe allowfullscreen="allowfullscreen" data-resize-to-parent="true" frameborder="0" height="516" src="https://video.wordpress.com/embed/KWWy2WzS?autoPlay=1&amp;loop=1&amp;muted=1&amp;persistVolume=0&amp;preloadContent=metadata&amp;hd=1&amp;cover=1" width="908"></iframe><script src="https://v0.wordpress.com/js/next/videopress-iframe.js?m=1632495956"></script>
<figcaption>Cloudflare bloggers</figcaption></figure>

by ohseuch4aeji4xar at October 05, 2021 08:53 AM