August 17, 2018

The Networking Nerd

The Cargo Cult of Google Tools

You should definitely watch this amazing video from Ben Sigelman of LightStep that was recorded at Cloud Field Day 4. The good stuff comes right up front.

<iframe allowfullscreen="allowfullscreen" frameborder="0" height="329" mozallowfullscreen="mozallowfullscreen" src="" title="LightStep Rethinking Observability with Ben Sigelman" webkitallowfullscreen="webkitallowfullscreen" width="584"></iframe>

In less than five minutes, he takes apart crazy notions that we have in the world today. I like the observation that you can’t build a system more than three or four orders of magnitude. Yes, you really shouldn’t be using Hadoop for simple things. And Machine Learning is not a magic wand that fixes every problem.

However, my favorite thing was the quick mention of how emulating Google for the sake of using their tools for every solution is folly. Ben should know, because he is an ex-Googler. I think I can sum up this entire discussion in less than a minute of his talk here:

Google’s solutions were built for scale that basically doesn’t exist outside of a maybe a handful of companies with a trillion dollar valuation. It’s foolish to assume that their solutions are better. They’re just more scalable. But they are actually very feature-poor. There’s a tradeoff there. We should not be imitating what Google did without thinking about why they did it. Sometimes the “whys” will apply to us, sometimes they won’t.

Gee, where have I heard something like this before? Oh yeah. How about this post. Or maybe this one on OCP. If I had a microphone I would have handed it to Ben so he could drop it.

Building a Laser Moustrap

We’ve reached the point in networking and other IT disciplines where we have built cargo cults around Facebook and Google. We practically worship every tool they release into the wild and try to emulate that style in our own networks. And it’s not just the tools we use, either. We also keep trying to emulate the service provider style of Facebook and Google where they treated their primary users and consumers of services like your ISP treats you. That architectural style is being lauded by so many analysts and forward-thinking firms that you’re probably sick of hearing about it.

Guess what? You are not Google. Or Facebook. Or LinkedIn. You are not solving massive problems at the scale that they are solving them. Your 50-person office does not need Cassandra or Hadoop or TensorFlow. Why?

  • Google Has Massive Scale – Ben mentioned it in the video above. The published scale of Google is massive, and even it’s on the low side of the number. The real numbers could even be an order of magnitude higher than what we realize. When you have to start quoting throughput numbers in “Library of Congress” numbers to make sense to normal people, you’re in a class by yourself.
  • Google Builds Solutions For Their Problems – It’s all well and good that Google has built a ton of tools to solve their issues. It’s even nice of them to have shared those tools with the community through open source. But realistically speaking, when are you really going to use Cassandra to solve all but the most complicated and complex database issues? It’s like a guy that goes out to buy a pneumatic impact wrench to fix the training wheels on his daughter’s bike. Sure, it will get the job done. But it’s going to be way overpowered and cause more problems than it solves.
  • Google’s Tools Don’t Solve Your Problems – This is the crux of Ben’s argument above. Google’s tools aren’t designed to solve a small flow issue in an SME network. They’re designed to keep the lights on in an organization that maps the world and provides video content to billions of people. Google tools are purpose built. And they aren’t flexible outside that purpose. They are built to be scalable, not flexible.

Down To Earth

Since Google’s scale numbers are hard to comprehend, let’s look at a better example from days gone by. I’m talking about the Cisco Aironet-to-LWAPP Upgrade Tool:

I used this a lot back in the day to upgrade autonomous APs to LWAPP controller-based APs. It was a very simple tool. It did exactly what it said in the title. And it didn’t do much more than that. You fed it an image and pointed it at an AP and it did the rest. There was some magic on the backend of removing and installing certificates and other necessary things to pave the way for the upgrade, but it was essentially a batch TFTP server.

It was simple. It didn’t check that you had the right image for the AP. It didn’t throw out good error codes when you blew something up. It only ran on a maximum of 5 APs at a time. And you had to close the tool every three or four uses because it had a memory leak! But, it was a still a better choice than trying to upgrade those APs by hand through the CLI.

This tool is over ten years old at this point and is still available for download on Cisco’s site. Why? Because you may still need it. It doesn’t scale to 1,000 APs. It doesn’t give you any other functionality other than upgrading 5 Aironet APs at a time to LWAPP (or CAPWAP) images. That’s it. That’s the purpose of the tool. And it’s still useful.

Tools like this aren’t built to be the ultimate solution to every problem. They don’t try to pack in every possible feature to be a “single pane of glass” problem solver. Instead, they focus on one problem and solve it better than anything else. Now, imagine that tool running at a scale your mind can’t comprehend. And you’ll know now why Google builds their tools the way they do.

Tom’s Take

I have a constant discussion on Twitter about the phrase “begs the question”. Begging the question is a logical fallacy. Almost every time the speaker really means “raises the question”. Likewise, every time you think you need to use a Google tool to solve a problem, you’re almost always wrong. You’re not operating at the scale necessary to need that solution. Instead, the majority of people looking to implement Google solutions in their networks are like people that put chrome everything on a car. They’re looking to show off instead of get things done. It’s time to retire the Google Cargo Cult and instead ask ourselves what problems we’re really trying to solve, as Ben Sigelman mentions above. I think we’ll end up much happier in the long run and find our work lives much less complicated.

by networkingnerd at August 17, 2018 01:29 PM

XKCD Comics

August 16, 2018

Aaron's Worthless Words

Nyansa Voyance at NFD18

Disclaimer : I was lucky enough to have been invited to attend Network Field Day 18 this past July in Silicon Valley. This event brings independent thought leaders to a number of IT product vendors to share information and opinions. I was not paid to attend any of these presentations, but Tech Field Day did provide travel, room, and meals for the event. There is no expectation of providing any blog content, and any posts that come from the event are from my own interest. I’m writing about Nyansa strictly from demonstrations of the product.  I’ve not installed it on my own network and have no experience running it.

Anyway,…on with the show!

Nyansa (pronounced nee-ahn’-sa) is focused on user expereince on the access network. Their product, Voyance, analyzes data from a list of sources to provide a view into what client machines are seeing. This is more than just logs from the machine itself. We’re talkin about taking behaviors on the wireless, access network, WAN, and Internet, and correlating those data points to predict user experience issues and recommend actions to remediate those problems. As we discussed in the presentation, there are products that do each of these (wireless, access, WAN, Internet), but Voyance is built to use ALL of them to show exactly what’s going on with your users. I sat in with Nyansa last year at Network Field Day 14, and, to tell the truth, I had no interest in what they were offering. At the time, the focus for Voyance was on the wireless experience with some access thrown in there. Since that presentation, Nyansa has made big leaps to expand what it can see.

Check out this screengrab I took from the presentation.

The big, shining star of the Voyance platform is the Crawler.  This device sits on your network to collect data from all over the place.  It talks to your wireless controllers to get data about the wireless clients (client densities, RFI, SNR, other wireless stuff I don’t really care about).  It’s a SPAN destination so it can watch packets run across your network.  It’s also a NetFlow server (sFlow and jFlow are coming, by the way), a syslog server, and an API client to get information from applications like Skype for Business.  The Crawler does its magic on this data and sends the resulting metadata to a backend for the real crunching.

Why does it need so many data points?  Simply put, the more you know, the better your decision is.  If, for example, you were to focus on just Netflow data, you can see the source/destination info, when the flow started, how much data was sent, and when the flow ended.  If  your wireless network is having issues, NetFlow doesn’t really help you figure out why users are reporting problems.  When you add other data sources like the packet data and wireless status, you can get a real picture of exactly what’s happening to the user and see what needs to be done.  We’ve all gotten the ticket saying “the Internet is down” when it’s not, right?  Voyance, with its wide data set, is able to show that your DHCP server is having issues, or your Internet circuit is saturated, or the admission server isn’t letting any new clients on the network.

One recent addition to the Voyance platform is the client agent.  This is a piece of software that runs on your Windows, OSX, and Android machines (no iOS support) to provide yet another data point for analysis.  My first reaction to this was “you already have all the data, why do you need this?”  Well, you don’t have all the data, actually.  If the client machine was having problems connecting to the wireless, you would never see that.  If your users are remote and can’t VPN in due to local ISP issues, you wouldn’t get that information, either.  The client agent reports those events from the user’s perspective right from the start, filling in those data gaps.  The client agent also runs synthetic tests to your gateway, to the data center, to the Internet, to whatever, in order to compare against already-established baselines to know if the client is having issues or not.  The client agent runs in the background, has minimal resource consumption, and provides a lot of data that could potentially be lost.  It’s a good addition to the Voyance platform.

That’s enough about collecting data.  Let’s see what information we can get out of all those data collection points.

We’re talking about user experience here, so baselines are probably the most important part of the analysis by Voyance.  Is my 150ms ping time to the data center normal?  Don’t know without a baseline.  Voyance is able to tell you very easily that 150ms is great for that specific connection from Kansas City to Singapore since a baseline has been established with historical data.  Let’s not forget, too, that latency is only one part of what’s measured here.  Check out this screengrab that shows some of the baselines for sample sites.

The performance tab shows that the Daly City and South San Francisco sites have a Radius problem where 44% of user are affected.  Palo Alto, Manilla, Bangalore have problems getting to the Internet. Pittsburgh has wifi issues.  These are established and measured baselines taken from data reported from the various sources.  I guess we can consider this the norm.

Compare that screen to this one.

The incidents tab shows that the Mexico office is having problems with the Radius server.  Eleven percent of user are being affected, which is a 5.1x deviation from the baseline.  Since this measurement is in deviations from baseline, we know this is a new, atypical issue.  Or at least the scale is atypical. What can we do about it?

Here’s a shot of the recommendations tab.

Not only does Voyance tell you what to fix but it also tells you the postive impact that the recommendation will make.  This particular shot shows wifi problems in an office due to interference on the 2.4 GHz band from rogue APs.  You can see a metric called client hours, which is one user having a poor experience in one hour.  The wider the impact, the bigger this number will be.  In this example, we see that fixing the interference will save 476 client hours.  This number can be compared against other recommendations to maximize the impact of your time.  If the Radius problem in Mexico calculates out to 210 client hours, then your time may be better spent dealing with the interference problem.

I’m kinda drooling a little here.  I would love to have a tool that watched every step and hop in a user’s experience and let me know when things are going wrong.  This is well beyond the traditional methods of monitoring where you have a million data sets that you have to correlate by hand.  Think about an issue where a user reports that they can’t get to GMail.  In a traditional top-down or bottom-up troubleshooting approaches, you would have to look at wireless statistics and logs, switch logs, firewall logs, router statistics, NetFlow data, SNMP data, etc., and try to match up events and statistics by hand (or by head).  I’m afraid I’ve become a little too good at that over the years, but my junior network admin and our helpdesk can’t do that.  Voyance can not only tell them what’s happening, but it can tell you what you should do about it.  This is great stuff here.

Of course, there are some things Voyance doesn’t do.

  • It’s not a tool designed to monitor your applications in your data centers.  Though these can be used as client agent targets, Voyance isn’t going to provide the metrics you want when serving your own applications.  You’ll want your own monitoring to fill that gap.
  • Voyance is also not going to fix your network by itself.  Nyansa is thinking about adding that feature, but, it doesn’t exist today.  It’s also a whole new can of worms getting into the management arena.  API calls into Voyance from your own tool were discussed as a viable option.
  • It’s not a traditional monitoring system.  Though that data is collected (and you can click enough to see that data), this system isn’t for showing you your link utilization or the number of clients on an AP.  These are just pieces of the whole picture that Voyance knows about.  If you insist on sending an alert when your WAN link gets over 80% utilized, stick with your current tool (since you’re totally missing the point here).

How many times have you been called in the middle of the night because the “Internet is down” in an office only to learn that the DHCP server isn’t responding any more.  Or that “the network is down” because a user can’t connect to an application that’s crashed.  A tool like Voyance can be a God-send for under-staffed helpdesks and even for the engineers in charge of the access network.  Like I said earlier, I had no interest in Nyansa and Voyance the last time I spoke with them.  Today’s features, though, round out that platform into something that has piqued my interest.

Check out Nyansa’s presentation at NFD18.  Be sure to compare that to their NFD14 presentation.  Their website is here.

Send any cool logos questions to me.

by Aaron Conaway at August 16, 2018 08:28 PM Blog (Ivan Pepelnjak)

GitOps in Networking

This blog post was initially sent to the subscribers of my SDN and Network Automation mailing list. Subscribe here.

Tom Limoncelli published a must-read article in ACM Queue describing GitOps – the idea of using Pull Requests together with CI/CD pipeline to give your users the ability to request changes to infrastructure configuration.

Using GitOps in networking is nothing new – Leslie Carr talked about this concept almost three years ago @ RIPE 71, and I described some of the workflows you could use in Network Automation 101 webinar.

Read more ...

by Ivan Pepelnjak ( at August 16, 2018 02:26 PM

Networking Now (Juniper Blog)

Dutch Water Counsel Stays Afloat with Juniper Networks

There are 21 water counsels in the Netherlands that collectively have the job of protecting the 25 percent of the country (four million people) that are somewhere between 10 meters below and one meter above sea level. These regional water authorities are among the oldest forms of local government. They provide clean drinking water, manage every aspect of water provision and control within the region and are considered Category A Critical Infrastructure by the Dutch Ministry of Justice and Security.

by mtjonenfa at August 16, 2018 02:00 PM

August 15, 2018

My Etherealmind

Automation Learning Charter

The world changes. The hit novel “Who moved my cheese?” written twenty years ago, has sold over 25 million copies to help with people experiencing change. For those who work with networking technology, we’re experiencing seismic activity in the world of change and new continents are forming from scattered islands. Some of these continents so to speak are unchartered and misunderstood. This generation of engineers are the explorers of the new world and the lands are ripe for pillaging.

Common feedback around learning includes:

  • I just don’t know where to begin
  • Is Python really where the world is going?
  • There is so much to learn
  • If learn a programming language, my problems are solved
  • I feel like I can’t catch up
  • There is nothing to hold on to
  • I can’t seem to drag myself out of despair

Some of this feedback has lead me to write and publish this article based on my own sanity saving methodology.


The relationship between change and progress is interesting. Not all change is progress, but all progress is change. In IT, sometimes we’ve played both polar opposite parts in the “Change for change’s sake” murder novel.

Change, rate of change, disruption and progress, have all played a game of leapfrog and have changed order and role in the last few years. We’re all feeling change and if we’re honest, sometimes struggle. Opportunities are riper than they have ever been to learn something new and to explore new technology, whilst commodity as an underlying theme has destroyed the notion of proprietary and has made a lot of networking technology transparent. Proprietary technology used to yield a sales and therefore a competitive lead for a short while but even with industry shifts (look at merchant silicon as an example), there is no competing with the behemoths and so companies that used to develop silicon to use in products now consume from these behemoths, leveling off the competitive lead focussing competition either on customer service or speed to market on the enshrouding software interface. The big organizations get bigger and the small disappear through acquisition or destruction.

Over the years, readers of this blog have emailed asking a variety of questions around learning to code, Python, P4 and one or more SDN controller platforms. The underlying driver is change and humans base planning on what the other human sheep are doing. When everyone bleats Python, the sheep on fields nearby, bleat the same with very high probability. Professional certifications used to be a driving force of learning and I recall radically improving both operations and network designs based on knowledge gained through the certification process. Thanks to the increase in the rate of change, I now plan to learn skills that are broader and figure out how to focus when required.

Learning Charter

With change being a constant and progress being tumultuous, whilst working for Brocade I formed my learning charter. This learning charter was and still is my method to embrace the change constant and make sure my brain doesn’t feel like it’s left behind. Our brains are emotional little critters and mine prefers to enjoy the mental food of building something I don’t know how to build. My process is to build and learn in that at first glance, strange order. Despite being left behind at a steady constant (lag might be the better term), my brain is happy and not flailing with panic. The cold chilling facts are thus: There is one of you and many people churning out technology, therefore you will always be lagging. Enjoy the most famous words from my favorite author: Don’t panic!

Building Things

As the industry experienced the roller coaster ride of SDN controllers, I invested significant time into building and forcibly bending a wide selection to my will. Even trying to build some of them, new skills we’re picked up along with some creative swear words. Some of those skills are now vague memories, but here’s the trick; I know where to go and what to look for if I need to use those tools again. I didn’t start out by learning the build tool then figuring out what to build with it. Building stuff is the network engineer’s “lab it out” mantra when trying to make some RFC paragraph mentally stick and the change here has been configuring products to building systems from a set of components that fuse with some probability (look up version or dependency hell).

Speaking transparently about my own journey, experimentation and the need to build a certain type of tool or service has lead me in to the arms of specific programming languages and tools. My tools may not be your tools. Golang offers absolute usefulness as a tool to learn other things and not the language tricks other people might find to be the driver. Exploring the process of consuming from a Kafka bus, transforming the data and publishing it to RabbitMQ, Go is my preferred tool that empowers me to build processes with simplicity. If the method of exploration is simple, then I can focus on the process and learning the mechanisms that lead to success. You will learn many sorts of things, without setting out to learn about RabbitMQ or Kafka. In the building phase, learnings will present themselves as one approaches building, targeted from design encapsulated within the flowchart.

Deciding what to build can be a challenge and justifying the time spent can also be difficult. If it’s challenging to construct the system to run an end-to-end workflow, then there is value in building. Justifying your personal time sometimes is easier than work time depending on the fun factor, either way, consult the management team.

For the last couple of months, processing pipelines have consumed me and I’ve created a number of them. Listed below is a short set of outcomes.

  • Increase of Golang debugging skills (pprof & gops)
  • Increase of InfluxDB knowledge
  • Increase of Kafka knowledge
  • Increase of ZooKeeper knowledge
  • NETCONF refresher
  • Modification of ‘go to’ template for a Dockerfile
  • 2x successful projects (publication TBC)
  • 1x botched project but with valuable learnings
  • 1x rejected PR for an influxDB agent

I want to emphasize, I didn’t start out to learn blobs of technology but instead drew a flowchart, researched each flowchart block, and created a pipeline that acquired data, transformed it and placed it into InfluxDB and Kafka.

As this is a learning exercise and there is a need in all of us to make this beautiful. This does not have to be pretty. It’s a learning exercise. Whilst it’s good to exercise security controls and production build configuration, most READMEs, forums and open source documentation will advise what to do and where given the scenario. Do not worry so much about production grade when broad-learning.

As a last word in building, your employer may be focussed heavily on Java or Python, Salt Stack or RunDeck. That’s fine. Choose your own set of tools and share the knowledge of data acquisition, transformation and conditions that resulted in success. Your engineering ability is not tied to any tool, but how you solve problems and share the knowledge.

Learning Things

When going through “Build Something”, I keep a notepad next to my keyboard where I write things down. I started out using Trello for this and discovered my sinister subconscious used it to generate additional mental weight as the list only seemed to grow, stemming from the time based calculus of “learning > listing things to learn”. Almost being mentally bankrupt, I ditched Trello for this task. There is a limited amount of real estate on an A5 sized writing pad and I’m very measured with what I learn after consuming a leaf. This might seem strange and I view this as mechanical sympathy for the brain. We’re organic matter and until memory chips are available from the local Costco, I’m trying to work with whatever grey matter that remains after my somewhat enjoyed teenage years.

Imagine that you yourself want to create a web application that consumes data from a Kafka bus, published with data from a system transmitting telemetry data. Lots of components here to learn about and if familiarity is approximately zero, you’re in for a fun ride! If reader, you are anything like me, the sheer weight of the learning tasks can sit heavily on you. One approach would be to learn each topic then bring it together. After a few years of experience, network engineers are conditioned to think in this mode. My brain as an example shoots to “I’m going to create a UML flow chart diagram of the process”. A pre-requisite to achieving that was in learning a tool that generates a Universal Modeling Language (UML) type diagram using Markdown called ‘YUML’ not because I was learning the tool, but because there was a need to create diagrams easily. This tool is now part of my daily arsenal of weaponry. Here is such an activity diagram for the scenario that has just been explored.

// {type: activity}
(start) -> |a| -> (Start web app)
|a| -> (Start Kafka bus) -> (Start Kafka\ndata collector) -> (Get latest| JSON Transform) -> (Publish to topic) -> (TOPIC)
(note: Looping process) -> (Get latest| JSON Transform)
(note: Looping process) -> (Publish to topic)
(Start web app) -> (App Query) -> (Acquire Data from topic)
(TOPIC) -> (Acquire Data from topic) -> (Serve to web user) -> (end)

Note, the tool used here integrates in to Visual Code and is called ‘yUML’.
The project wiki is useful for Syntax, which can be found here‘.

Starting out as an R&D engineer, I do things differently and therefore propose you try this:

  1. Draw a flow chart of data acquisition, processing and decision making for the system.
  2. Describe in each block the task at hand succinctly with absolute clarity.
  3. Use your Google powers using clear and descriptive key words.
  4. Make a note of the links, READMEs and how to guides.
  5. Discard anything that doesn’t cover 80% of your individual task.
  6. Target component commonality where functions and features are re-used. Reduce where possible.
  7. Build controlled prototypes.
  8. Version control everything and create READMEs.

Before we move on, 80% is a universal number in technology. 80% gives us hope that the result is in reach and 20% means we can close the gap without rebuilding the whole tool. This is perfect project contribution territory along with ‘in reach’ levels of hope.

Point 6 is interesting. This is where experience has taught me to lean on platforms like StackStorm. They are great for tying systems together conceptually whilst exploring how to do it. Whilst writing this I saw Tweets around building Rube Goldberg machines and authors of those Tweets asking people not to build them. I’m not talking about the physical versions, because they’re super fun and we all love them. If you created a system like the one being discussed in this post, using twenty different components and point 6 allowed you to boil it down to 3, then it’s probably looking like a Rube Goldberg machine. Engineers use abstraction to encapsulate behavior of a system and make it easier to interface with other systems. The issue with abstraction is people get a little abstract crazy and before you know it, the code projects itself in a separate universe of meaningless. The same is true for system components. The idea is that you’ve placed your chips on the poker table and have used the best of your Google powers to figure out hunt out components that do what you require. Aim for minimal real-estate in terms of compute and complexity whilst still achieving the same set of asks.

Post getting your arms and hands dirty from the farm yard hacking, you witness a working system. Sure, it might only work once then collapse in a heap of broken intelligible scripts, but it works. You instantiated your flow of logic from beginning to end and proved out your ability to build a system. What follows is targeted learning around data normalization, transformation, error handling and correctness in terms of what should happen and where. The MVP (minimum viable product) repetitive cycle begins here and given a few iterations, it will work more than once and your skill level rockets.


Using the examples contained to share the charter, I have particular fondness for Golang and StackStorm because they get the job done. When you use “Build & Learn” as your driver, common tools make themselves known. Before long you have a go to set of tools that you didn’t target to learn, but did because of the approach, not because of the industry. Your GitHub footprint grows not because of the need to exercise Git but because of the systems you’ve tried to build and learn from.

Your situation may require that tools are pre-selected for any exploration. This is true of some companies that have seen success or pivot around fear. Embrace your constraint and make the best of it. You never know, there may be an opportunity to improve the whole approach through your own learnings.

Narrowing down the list of learnings is a hard task. Anything that is too cumbersome or requires a hero, I try and avoid. A real example would be that of a web application kit that serves as a common starting point. If that kit comes with a tool that helps you bootstrap databases and setup themes, a learning opportunity has been removed and has abstracted what’s happening under the hood. That means you can’t fix it because of your own lack of awareness, and whilst mewing over the glorious result, you’ve not learned anything other than how to use the tool itself. Abstraction & automation are great when you understand the topic at hand and that understanding comes from the ‘hacking it together’ experience.

Modern businesses require that ‘we do more with less’ and those businesses willing to invest in to their employees with hacking time often benefit the most. Their employees are familiar with technology, know where to optimize and how to fix when the opportunities arise.

If this post twanged a nerve string, then create a challenge, hack it together and learn from it. It will be painful the first time. Your laptop might not have basic tools installed, or you might not know what a Make file is, it doesn’t matter. Step away from fear and into this enormous learning opportunity you too can enjoy.

IPEngineer Learning Charter

1 Target a logical workflow that solves a challenge.

  • Draw a flow chart of data acquisition, processing and decision making for the system.
  • Describe in each block the task at hand succinctly with absolute clarity.
  • Use your Google powers using clear and descriptive key words.
  • Make a note of the links, READMEs and how to guides.
  • Discard anything that doesn’t cover 80% of your individual task.
  • Target component commonality where functions and features are re-used. Reduce where possible.
  • Build controlled prototypes.
  • Version control everything and create READMEs.

2 Learn from the experience of building it and target specific areas of difficulty or intrigue

3 Iterate using Reliability as the keyword and Rejoice

The post Automation Learning Charter appeared first on

by David Gee at August 15, 2018 01:22 PM

XKCD Comics

August 14, 2018 Blog (Ivan Pepelnjak)

Updated: First Set of Building Next-Generation Data Centers Self-Study Materials

When I started the Building Next-Generation Data Centers online course, I didn’t have the automated infrastructure to support it, so I had to go with the next best solution: a reasonably-flexible Content Management System, and Mediawiki turned out to be a pretty good option.

In the meantime, we developed a full-blown course support system, included guided self-paced study (available with most online course), and progress tracking. It was time to migrate the data center material into the same format.

Read more ...

by Ivan Pepelnjak ( at August 14, 2018 07:10 AM

Potaroo blog

Measuring ECDSA in DNSSEC - A Final Report

Four years ago we started looking at the level of support for ECDSA in DNSSEC. At the time we concluded that ECDSA was just not supported broadly enough to be usable. Four years later, let's see if we can provide an updated answer to the question of the viability of ECDSA.

August 14, 2018 05:15 AM

August 13, 2018 Blog (Ivan Pepelnjak)

Schneier’s Law Applied to Networking

A while ago I stumbled upon Schneier’s law (must-read):

Any person can invent a security system so clever that she or he can't think of how to break it.

I’m pretty sure there’s a networking equivalent:

Any person can create a clever network design that is so complex that she or he can't figure out how it will fail in production.

I know I’ve been there with my early OSPF network designs.

by Ivan Pepelnjak ( at August 13, 2018 02:48 PM

XKCD Comics

August 10, 2018

Dyn Research (Was Renesys Blog)

Civil War in Yemen Begins to Divide Country’s Internet

The latest development in Yemen’s long-running civil war is playing out in the global routing table.  The country’s Internet is now being partitioned along the conflict’s battle lines with the recent activation of a new telecom in government-controlled Aden.

Control of YemenNet

The Iranian-backed Houthi rebels currently hold the nation’s capital Sana’a in the north, while Saudi-backed forces loyal to the president hold the port city of Aden in the south (illustrated in the map below from Al Jazeera).  One advantage the Houthis enjoy while holding Sana’a is the ability to control Yemen’s national operator YemenNet.  Last month, the Houthis cut fiber optic lines severing 80% of Internet service in Yemen.

Launch of AdenNet

In response to the loss of control of YemenNet, the government of President Hadi began plans to launch a new Yemeni telecom, AdenNet, that would provide service to Aden without relying on (or sending revenue to) the Houthi-controlled incumbent operator.  Backed with funding from UAE and built using Huawei gear, AdenNet (AS204317) went live in the past week exclusively using transit from Saudi Telecom (AS39386), as depicted below in a view from Dyn Internet Intelligence.

The new Aden-based telecom would also allow the Yemeni government to restrict access to the submarine cables that land in Aden without impacting their own Internet service.

More recently, the government of President Hadi has been lobbying ICANN to regain control of Yemen’s Internet numbers and RIPE NCC to regain control of the country’s ccTLD, which would restore their control over domains ending with .ye.  These vital components to operating the Internet of Yemen are traditionally controlled by YemenNet, now in the hands of the rebels.

Divided Internet

Internet service in Yemen faces myriad challenges in this troubled nation from hackers to sabotage.  As the conflict rages on in Yemen, the country’s Internet is now being partitioned between YemenNet (AS12486, AS30873), controlled by the Houthi rebels, and now AdenNet (AS204317), controlled by the Saudi-backed Yemeni government.

The Internet doesn’t exist in a vacuum.  From Cuba to Crimea, a country’s Internet is regularly shaped by events and conditions on the ground.  And in the case of Yemen, divided along the lines of an intractable civil war.  On the upside, Yemen now has two backbone providers, which could ultimately improve resiliency and increase competition within the market.

Thanks to Fahmi Albaheth, President of the Internet Society of Yemen for assistance in this analysis.

by Doug Madory at August 10, 2018 07:00 PM

The Networking Nerd

Are We Seeing SD-WAN Washing?

You may have seen a tweet from me last week referencing a news story that Fortinet was now in the SD-WAN market:

<script async="async" charset="utf-8" src=""></script>

It came as a shock to me because Fortinet wasn’t even on my radar as an SD-WAN vendor. I knew they were doing brisk business in the firewall and security space, but SD-WAN? What does it really mean?

SD Boxes

Fortinet’s claim to be a player in the SD-WAN space brings the number of vendors doing SD-WAN to well over 50. That’s a lot of players. But how did the come out of left field to land a deal rumored to be over a million dollars for a space that they weren’t even really playing in six months ago?

Fortinet makes edge firewalls. They make decent edge firewalls. When I used to work for a VAR we used them quite a bit. We even used their smaller units as remote appliances to allow us to connect to remote networks and do managed maintenance services. At no time during that whole engagement did I ever consider them to be anything other than a firewall.

Fast forward to 2018. Fortinet is still selling firewalls. Their website still focuses on security as the primary driver for their lines of business. They do talk about SD-WAN and have a section for it with links to whitepapers going all the way back to May. They even have a contributed article for SDxCentral back and February. However, going back that far the article reads more like a security company that is saying their secure endpoints could be considered SD-WAN.

This reminds me of stories of Oracle counting database licenses as cloud licenses so they could claim to be the fourth largest cloud provider. Or if a company suddenly decided that every box they sold counted as an IPS because it had a function that could be enabled for a fee. The numbers look great when you start counting them creatively but they’re almost always a bit of a fib.

Part Time Job

Imagine if Cisco suddenly decided to start counting ASA firewalls as container engines because of a software update that allowed you to run Kubernetes on the box. People would lose their minds. Because no one buys an ASA to run containers. So for a company like Cisco to count them as part of a container deployment would be absurd.

The same can be said for any company that has a line of business that is focused on one specific area and then suddenly decides that the same line of business can be double-counted for a new emerging market. It may very well be the case that Fortinet has a huge deployment of SD-WAN devices that customers are very happy with. But if those edge devices were originally sold as firewalls or UTM devices that just so happened to be able to run SD-WAN software, it shouldn’t really count should it? If a customer thought they were buying a firewall they wouldn’t really believe it was actually an SD-WAN router.

The problem with this math is that everything gets inflated. Maybe those SD-WAN edge devices are dedicated. But, if they run Fortinet’s security suite are also being counting in the UTM numbers? Is Cisco going to start counting every ISR sold in the last five years as a Viptela deployment after the news this week that Viptela software can run on all of them? Where exactly are we going to draw the line? Is it fair to say that every x86 chip sold in the last 10 years should count for a VMware license because you could conceivably run a hypervisor on them? It sounds ridiculous when you put it like that, but only because of the timelines involved. Some crazier ideas have been put forward in the past.

The only way that this whole thing really works is if the devices are dedicated to their function and are only counted for the purpose they were installed and configured for. You shouldn’t get to add a UTM firewall to both the security side and the SD-WAN side. Cisco routers should only count as traditional layer 3 or SD-WAN, not both. If you try to push the envelope to put up big numbers designed to wow potential customers and get a seat at the big table, you need to be ready to defend your reporting of those numbers when people ask tough questions about the math behind those numbers.

Tom’s Take

If you had told me last year that Fortinet would sell a million dollars worth of SD-WAN in one deal, I’d ask you who they bought to get that expertise. Today, it appears they are content with saying their UTM boxes with a central controller count as SD-WAN. I’d love to put them up against Viptela or VeloCloud or even CloudGenix and see what kind of advanced feature sets they produce. If it’s merely a WAN aggregation box with some central control and a security suite I don’t think it’s fair to call it true SD-WAN. Just a rinse and repeat of some washed up marketing ideas.

by networkingnerd at August 10, 2018 05:39 AM

XKCD Comics

August 09, 2018

Network Design and Architecture

EIGRP in the Service Provider Networks

EIGRP in the Service Provider Networks. If you are wondering whether EIGRP (Enhanced Interior Gateway Routing Protocol) is used in the Service Provider networks, then continue to read this post.       EIGRP is very uncommon in the Service Provider networks. As I teach network design training to thousands of students and through my […]

The post EIGRP in the Service Provider Networks appeared first on Cisco Network Design and Architecture | CCDE Bootcamp |

by Orhan Ergun at August 09, 2018 07:43 PM

Submarine cables carry whole Internet Traffic ! More than 95%

Submarine cables carry whole Internet Traffic. I am not exaggerating. Today’s 95% of the Internet Traffic is carried over Submarine cables.     They are so important but as a network engineer how much do you know about Submarine cables ?       I explained the fundamentals of submarine cables in this post. If […]

The post Submarine cables carry whole Internet Traffic ! More than 95% appeared first on Cisco Network Design and Architecture | CCDE Bootcamp |

by Orhan Ergun at August 09, 2018 08:00 AM Blog (Ivan Pepelnjak)

Updated: Building Next-Generation Data Centers Live Sessions

After fixing the Building Network Automation Solutions materials, I decided to tackle the next summer janitorial project: creating standard curriculum pages for Building Next Generation Data Centers online course and splitting it into more granular modules (the course is ~150 hours long, and some modules have more than 40 hours of self-study materials).

Read more ...

by Ivan Pepelnjak ( at August 09, 2018 06:49 AM

August 08, 2018

Network Design and Architecture

Istanbul/Turkey Onsite CCDE Training – 33% OFF until August-15, 2018

Istanbul/Turkey Onsite CCDE Training will be held between August 30 – September 3, 2018.   Course will be in English as usual, everyday will be between 9am – 6pm, 9 hours.     I am going to extend my CCDE Materials for this course as there was new scenarios and the technologies after August 29, […]

The post Istanbul/Turkey Onsite CCDE Training – 33% OFF until August-15, 2018 appeared first on Cisco Network Design and Architecture | CCDE Bootcamp |

by Orhan Ergun at August 08, 2018 08:57 PM

Ask these questions before you replace any technology in your network !

If you are replacing one technology with the other, these questions you should be asking.         This may not be the complete list and one is maybe more important than the other for your network , but definitely keep in mind or come back to this post and check before you replace […]

The post Ask these questions before you replace any technology in your network ! appeared first on Cisco Network Design and Architecture | CCDE Bootcamp |

by Orhan Ergun at August 08, 2018 08:19 PM Blog (Ivan Pepelnjak)

Another Benefit of Open-Source Networking Software

You probably know my opinion on nerd knobs and the resulting complexity, but sometimes you desperately need something to get the job done.

In traditional vendor-driven networking world, you might be able to persuade your vendor to implement the knob (you think) you need in 3 years by making it a mandatory requirement for a $10M purchase order. In open-source world you implement the knob, write the unit tests, and submit a pull request.

Read more ...

by Ivan Pepelnjak ( at August 08, 2018 09:04 AM

XKCD Comics

August 07, 2018

Network Design and Architecture

Discussion with Maldivian Operator Dhiraagu (AS7642)

I discussed the BGP Router Reflector design, Settlement Free Peering , Transit Operator choice, Internet Gateways and the Route Reflector connections, MPLS deployment option at the Internet Edge and many other things with the Operator from Maldives. Operator name is Dhiraagu. Autonomous System Number is 7642.   Engineer from the ISP Core team, who is […]

The post Discussion with Maldivian Operator Dhiraagu (AS7642) appeared first on Cisco Network Design and Architecture | CCDE Bootcamp |

by Orhan Ergun at August 07, 2018 08:05 PM

How many labels for VPN in MPLS

How many labels for VPN in MPLS ?     For those who has good amount of knowledge in MPLS already may know the answer. Or if you have taken my CCDE course before, this question is basic for you.   But understanding this fundamental piece of knowledge is key to understand MPLS Applications. MPLS […]

The post How many labels for VPN in MPLS appeared first on Cisco Network Design and Architecture | CCDE Bootcamp |

by Orhan Ergun at August 07, 2018 07:50 PM

Networking Now (Juniper Blog)

Be Ready for Cloud, 5G and IoT with Advanced Security Acceleration

Now more than ever, our networks and infrastructure require security that keeps pace not only with cybercrime, but with the demands of ubiquitous streaming, a myriad of devices and accelerated cloud evolution. This explosive growth and fluid environment means that organizations need more muscle from their firewalls.


by Amy James at August 07, 2018 11:45 AM Blog (Ivan Pepelnjak)

Updated: Building Network Automation Solutions Materials and Descriptions

The materials and descriptions for the Building Network Automation Solutions online course got a slight makeover: all live session recordings are now part of self-study materials, and the module description pages use consistent format for self-study materials and live sessions.

Next on the janitor’s list: a similar makeover for the Data Center online course.

by Ivan Pepelnjak ( at August 07, 2018 06:21 AM

August 06, 2018

Dyn Research (Was Renesys Blog)

Last Month in Internet Intelligence: July 2018

In June, we launched the Internet Intelligence microsite, including the new Internet Intelligence Map. In July, we published the inaugural “Last Month in Internet Intelligence” overview, covering Internet disruptions observed during the prior month. The first summary included insights into exam-related outages and problems caused by fiber cuts. In this month’s summary, covering July, we saw power outages and fiber cuts, as well as exam-related and government-directed shutdowns, disrupt Internet connectivity. In addition, we observed Internet disruptions in several countries where we were unable to ascertain a definitive cause.

Power Outages

It is no surprise that power outages can wreak havoc on Internet connectivity – not every data center or router is connected to backup power, and last mile access often becomes impossible as well.

At approximately 20:00 GMT on July 2, the Internet Intelligence Map Country Statistics view showed a decline in the traceroute completion ratio and DNS query rate for Azerbaijan, related to a widespread blackout. These metrics gradually recovered over the next day. Published reports (Reuters, Washington Post) noted that the blackout was due to an explosion at a hydropower station, following an overload of the electrical system due to increased use of air conditioners, driven by a heat wave that saw temperatures exceed 100° F. Power was restored after several hours, but reportedly failed again, causing a second blackout, which again impacted the traceroute and DNS metrics as seen around 15:00 GMT on July 3.

Just a day later, Tropical Storm Maria caused an islandwide power outage in Guam, which disrupted Internet service on the island for several hours. However, Guam Power Authority (GPA) responded quickly once the storm had passed, with the Guam Daily Post noting that the GPA expected “to have substantial load for power restoration around 11 am”. In looking at the graphs shown below, they appear to have hit that target as the traceroute completion ratio and BGP routes count returned to prior levels around that time (Guam is GMT+10).


At the end of the month, Venezuela experienced a large power failure that left most of the capital city of Caracas without electricity, which caused a disruption in Internet connectivity as well. As shown in the figure below, both the traceroute and DNS metrics saw minor declines at around 13:00 GMT. Approximately two hours later, a Tweet from the country’s Energy Minister stated that 90% of the service had been restored in Caracas, and a subsequent Tweet several hours later explained that the initial fault in Caracas originated from voltage transformer control cables being cut. It appears that the measured metrics for Venezuela returned to regular levels several hours after power was restored.


Fiber/Cable Cuts

On July 4, Twitter user @ADMIRAL12 posted the following Tweet:

<script async="async" charset="utf-8" src=""></script>

Oracle Director of Internet Analysis Doug Madory responded, noting “DNS query rate is down. Otherwise BGP routes and completing traceroutes are unaffected.” Minutes later, Madory also commented, “Both YemenNet ASNs lost transit from @GlobalCloudX and @etisalat (AS15412 and AS8966) at that time.” YemenNet’s issues can be seen in the Traffic Shifts graphs below.

AS12486 YemenNet

AS30873 YemenNet

A published report indicated that Houthi rebels disrupted Internet service to nearly 80% of Yemen by damaging a fiber optic cable in the port city of Hodeidah. The publication quoted a source from the Public Telecommunication Corporation, who explained “The cable that connects the country to the Internet was cut in three places in the districts of Al Kanawes and Al Marawya in Hodeidah as the Houthi militia continues to dig trenches in the area.”

Just days later, Internet connectivity in Haiti was disrupted for more than a day, including a complete outage for local telecommunications provider Digicel Haïti. The Internet disruptions occurred in the midst of widespread protests over government plans to raise gas prices. Several Digicel fiber optic lines were cut, and the U.S. Embassy in the country stated “Telecommunications services, including Internet and phone lines, have been affected throughout Haiti.” The disruption to Haiti’s Internet connectivity, as well as Digicel’s outage, can be seen in the graphs below.


AS27653 Digicel Haiti

Following this damage to Digicel’s infrastructure, the company’s Chairman took to Twitter to provide a status update on repairs:

<script async="async" charset="utf-8" src=""></script>

As would be expected, Digicel’s outage impacted connectivity for downstream customers. As seen in the graph below, traceroutes to targets in AS263685 (Sogebank, one of Haiti’s three largest commercial banks) passed through Digicel ahead of the fiber cuts, as seen in the yellow area on the left side of the graph. Concurrent with the fiber cut, traceroutes fail to reach Sogebank for several hours, before they shift to using Télécommunications de Haití as an upstream provider, as seen in the green area on the graph. They maintained this connectivity arrangement for approximately three days before shifting back to Digicel.


On July 9, incumbent provider Telecommunication Services of Trinidad and Tobago (TSTT) was down for over three hours, causing a partial disruption to Internet connectivity in Trinidad and Tobago, as seen in the graphs below. A published report quoted a TSTT executive as stating that a major break in a fiber optic cable in the Chaguaramas area had caused a temporary disruption in all mobile data and Internet services.

Trinidad & Tobago

AS5639 - Telecommunication Services of Trinidad & Tobago

Similar to the Sogebank discussion above, the Traffic Shifts graph below shows the impact of this cable cut on AS26317 (Lisa Communications), which uses TSTT as an upstream provider. As the graph shows, the vast majority of traceroutes to Lisa Communications passed through TSTT in the days prior to the cut, as see in the yellow area on the left side of the graph. Concurrent with the cut, the number of completed traceroutes briefly declines to approximately half of its average rate, although provider Columbus Communications Trinidad Limited quickly picked up the slack, as seen in the blue area. After approximately half a day, the majority of the Lisa Communications-bound traceroutes begin passing through TSTT once again, as seen in the return of the yellow area on the right side of the graph.

AS26317 - Lisa Communications Ltd

A fiber cut caused a multi-hour Internet disruption in Kenya on July 22, starting at approximately 06:30 GMT. A published report indicated that service was restored by 11:00 GMT, which aligns with the traceroute completion ratio and DNS query rates shown in the figure below.


Safaricom, Kenya’s largest telecommunications provider, issued a statement to customers that noted “We wish to apologize to our customers and partners that are currently experiencing voice and data outage, caused by multiple fiber link cuts affecting critical transmission equipment”. To that end, the Traffic Shifts graph below shows the impact of the cut on One Communications. A subsidiary of Safaricom since 2008, it relies on its parent for Internet connectivity. The cut caused a complete loss of completed traceroutes to targets in One Communications for several hours, until service was restored.

AS37061 - One Communications Ltd

While not explicitly a cable cut, Internet connectivity in Bangladesh was significantly impacted at the end of July as the SeaMeWe-4 (SMW4) submarine cable was taken down from July 25-30 for maintenance, resulting in a loss of almost half of the country’s international Internet capacity. Repairs to the SMW4 cable also impacted the Internet in Bangladesh in May 2018, October 2017, and August 2011, as did cuts to the cable in June 2012. Taking down SMW4 for repairs resulted in a significant shift in how traffic reaches Bangladesh, as shown in the Traffic Shifts graph below for AS17494 (Bangladesh Telecommunications Company Limited). The biggest impact appeared to occur during the first few days of the repair period, stabilizing by July 27.

AS17494 - Bangladesh Telecommunications Company Limited


On the heels of exam-related Internet shutdowns on June 21 and 27 (covered in last month’s post), similar disruptions were observed in Iraq on July 1, 4, 7, and 11 as seen in the figures below. A table published by media advocacy and development organization SMEX listed high school diploma exams as taking place between June 21 and July 12, which aligns with the shutdowns discussed here. In addition, the issues observed in June and July also fit the profile of similar past actions – a significant, but not complete, outage lasting two to three hours.

Iraq July 1-4

Iraq July 4-7

Iraq July 7-11

As seen in the figure below, similar Internet disruptions in Iraq were also observed in the Internet Intelligence Map on July 17 and 19. While they appear to be similar in profile to the exam-related outages discussed above, there was no available information that could be found regarding exams taking place on these two days.
Iraq July 17-19

Heading into the end of July, Syria closed out the month with three multi-hour outages where the Internet was shut down nationwide to prevent cheating on high school exams. As seen in the figure below, the number of completed traceroutes into Syrian endpoints dropped to near zero, and the number of routed networks in Syria also dropped to near zero. However, as we have seen with similar prior shutdowns in Syria, the number of DNS requests from resolvers within the country jumps sharply during the shutdown. We believe that this indicates that the shutdown was implemented asymmetrically – that is, traffic from within Syria can reach the global Internet, but traffic from outside the country can’t get in. These spikes in DNS traffic are likely related to local DNS resolvers retrying when they don’t receive the response from Oracle Dyn authoritative nameservers – normally, the client traffic they are making requests on behalf of would be served from the resolver’s cache.



Sandwiched between the exam-related outages referenced above, Iraq experienced a nationwide Internet blackout that lasted nearly two days, stemming from a shutdown ordered in response to a week of widespread protests. The disruption, shown in the figure below, lasted from July 14-16, and had a significant impact on all three measured metrics.

Iraq July 13-16

Unfortunately, as noted in a blog post on the disruption, “Government-directed Internet outages have become a part of regular life in Iraq.” The first such outage documented by the Internet Intelligence team occurred in 2013 and revolved around a pricing dispute between the Iraqi Ministry of Communications and various telecommunications companies operating there. Over the subsequent five years, we have seen several more such Internet disruptions.

The Internet Intelligence blog post referenced above highlighted that not all of Iraq was taken offline during the weekend disruption, with about 400 BGP routes (out of a total of 1,300 for Iraq) staying online. Some telecommunications providers with independent Internet connections through the north of Iraq stayed online, as did those with independent satellite links.

AS60929-ITC Iraq

ITC operates the Iraqi fiber backbone, and the impact of the government-directed disruption is clearly evident in the Traffic Shifts graph above over the July 14-15 weekend period. Iraqi provider Earthlink is based in Baghdad and is one of Iraq’s largest ISPs. It was also down during the same period, as seen in the Traffic Shifts graph below.



On July 9, Twitter user @Abdalla_Salmi posted the following Tweet:

<script async="async" charset="utf-8" src=""></script>

The Country Statistics graphs below show that there was no change in the number of routed networks geolocated to Eritrea, but there were significant declines in the traceroute completion ratio and DNS query rate metrics during the time period highlighted in the above Tweet. As @Abdalla_Salmi noted, the Internet disruption in Eritrea was coincident with a visit from the Ethiopian Prime Minister. (The visit marked a shift in relations between Ethiopia and Eritrea, which have been locked in two decades of conflict.) In some instances in the past, we have observed state-ordered disruptions to a country’s Internet connectivity as a means of limiting their citizens from being able to organize protests around political events of this type. However, such government involvement in an Internet shutdown is often reported in the press and/or on social media; in this case, no such reports have been found.


The Eritrea Telecommunication Service Corporation (EriTel) is the national telecommunications service provider, and is the state-owned monopoly for fixed and mobile connectivity. As the graph below shows, the number of completed traceroutes into EriTel dropped to approximately 10% of their previous rate during the period of disruption. While no publicly available information on a root cause has been found for the issues observed at a country level and with EriTel, the disruptions were corroborated by colleagues at Akamai and CAIDA through data they collect and analyze.

AS30987-Eritrea Telecommunication Service Corporation (EriTel)

On July 12/13 and again on July 17/18, the Internet Intelligence Map highlighted Internet disruptions in Bhutan, as shown in the figure below. Although the observed issues appeared to last less than a day in each case, they left artifacts across all three metrics. Unfortunately, the root cause of these disruptions is unknown, as there were no published reports found on state involvement, fiber cuts, power outages, or the like.


Just after midnight GMT on July 23, the Internet Intelligence Map Country Statistics view for Syria showed an approximately 30% decline in the traceroute completion ratio metric, as seen in the graph below. This reduced ratio persisted through the end of the month and may represent the “new normal”, although the reduced rate of DNS queries from Syrian resolvers returned to previous levels after a few days; the number of routed networks from Syria remained unchanged. This type of profile is often indicative of last mile access issues or catastrophic technical failure closer to the edge of the network. However, in this case we believe that this observed disruption may have been due to a change in network configuration at Syrian Telecom.


The impact of this possible network configuration change on traceroutes into AS29256 (Syrian Telecom) can be seen in the Traffic Shifts graph below. In this case, the number of completed traceroutes into Syrian Telecom appears to drop right before midnight GMT on July 23 – just ahead of the significant drop in the country-level traceroute completion ratio graph above.

AS29256-Syrian Telecom


July was a busy month for Internet disruptions around the world as observed within Oracle’s Internet Intelligence Map. For better or worse, the disruptions were largely due to familiar causes, with related information found in local or international press coverage, on Twitter, or on telecommunications provider Web sites. However, some had impacts large enough to leave artifacts in the Internet Intelligence graphs, but without correlated press coverage, provider apologies, or user complaints on Twitter. Although root cause information can be hard to find, we feel that it is valuable to highlight all significant Internet disruptions in support of #keepiton efforts around the world.

by David Belson at August 06, 2018 12:32 PM Blog (Ivan Pepelnjak)

New Design

During the last weeks I migrated the whole site (apart from the workgroup administration pages) to the new design. Most of the changes should be transparent (apart from the pages looking better than before ;); I also made a few more significant changes:

Read more ...

by Ivan Pepelnjak ( at August 06, 2018 05:04 AM

XKCD Comics

August 03, 2018

Dyn Research (Was Renesys Blog)

BGP/DNS Hijacks Target Payment Systems

In April 2018, we detailed a brazen BGP hijack of Amazon’s authoritative DNS service in order to redirect users of a crypto currency wallet service to a fraudulent website ready to steal their money.

In the past month, we have observed additional BGP hijacks of authoritative DNS servers with a technique similar to what was used in April. This time the targets included US payment processing companies.

As in the Amazon case, these more recent BGP hijacks enabled imposter DNS servers to return forged DNS responses, misdirecting unsuspecting users to malicious sites.  By using long TTL values in the forged responses, recursive DNS servers held these bogus DNS entries in their caches long after the BGP hijack had disappeared — maximizing the duration of the attack.

The Hijacks

At 23:37:18 UTC on 6 July 2018, Digital Wireless Indonesia (AS38146) announced the following prefixes for about thirty minutes.  These prefixes didn’t propagate very far and were only seen by a handful of our peers.

> Savvis
> Vantiv, LLC
> Vantiv, LLC
> Q9 Networks Inc.
> Q9 Networks Inc.

Three were more-specific announcements (,, of existing routes.

Then at 22:17:37 UTC on 10 July 2018, Malaysian operator Extreme Broadband (AS38182) announced the exact same five prefixes listed above.  For about 30 minutes, these hijack prefixes weren’t propagated very far.  Then they were announced again at 23:37:47 UTC for about 15 minutes but to a larger set of peers — 48 peers instead of 3 peers in the previous hour.  It appears a change of BGP communities from 24218:1120 to 24218:1 increased the route propagation.

According to a brochure on the company’s website, Datawire is a “patented connectivity service that transports financial transactions securely and reliably over the public Internet to payment processing systems.”  Datawire’s nameservers, and, resolve to and respectively, addresses that were in the hijacked networks shown above.

Vantiv and First Third Processing are former names of Worldpay, a major US payment processing service.  Vantiv’s nameservers, and, resolve to and respectively, addresses in the hijacked networks shown above.

At 00:29:24 UTC on 11 July 2018, AS38182 began hijacking a new set of prefixes in two separate incidents for minutes each time.

> Mercury Payment Systems
> Mercury Payment Systems
> Level 3
> CERFnet

Mercury Payment Systems is a credit card processing service also owned by Worldpay (formerly Vantiv).  Mercury’s nameservers, and, resolve to and  These IP addresses were hijacked as part of and, both more-specifics of their normal routes.

This at 21:51:36 UTC on 12 July 2018, AS38182 began hijacking the same five routes as had been targeted twice previously.

> Savvis
> Vantiv, LLC
> Vantiv, LLC
> Q9 Networks Inc.
> Q9 Networks Inc.

These hijacks lasted for almost three hours.  The Vantiv hijacks are visualized below:

Illustrated below, Q9 evidently noticed their routes were being hijacked and started announcing the same routes in an effort to regain control of the IP address space.

Then at 23:06:32 UTC on 12 July 2018, AS38182 began hijacking various routes, including two for major DNS service provider UltraDNS (owned by Neustar),  for approximately 10 minutes.

> UltraDNS Corporation
> UltraDNS Corporation
> UltraDNS Corp
> Internet Media Network
> Internet Media Network
> Internet Media Network
> CenturyLink

Forged DNS responses
Users of these payment systems began to report problems as early as 10 July.  Participants on the Outages email distribution list reported problems connecting to Datawire shortly after the first hijack.

Passive DNS observations between the 10th and 13th of July showed * domains resolving to – IP address space registered as being in Dutch Caribbean island of Curaçao, but routed out of breakaway region of Luhansk in eastern Ukraine.

Similarly the hijack of Amazon’s Route53 service in April was directed to, which is registered as being German IP space, but is also routed out of Luhansk in eastern Ukraine.

These similarities indicate that these two BGP hijacks of authoritative DNS servers may be related.

In last month’s hijacks, the perpetrators showed attention to detail, setting the TTL of the forged response to ~5 days.  The normal TTL for the targeted domains was 10 minutes (600 seconds).  By configuring a very long TTL, the forged record could persist in the DNS caching layer for an extended period of time, long after the BGP hijack had stopped.


If previous hijacks were shots across the bow, these incidents show the Internet infrastructure is now taking direct hits. Unfortunately, there is no reason not to expect to see more of these types of attacks against the Internet.

As Job Snijders of NTT Communications has suggested, our only hope is the use the consolidation of the Internet industry to our advantage. He wrote to me recently:

If the major DNS service providers (both on the authoritative and recursive side of the house) sign their routes using RPKI, and validate routes received via EBGP, the impact of attacks like these would be reduced because a protected paths are formed back and forth. Only a small specific group of densely connected organizations needs deploys RPKI based BGP Origin Validation to positively impact the Internet experience for billions of end users.

by Doug Madory at August 03, 2018 07:26 PM

My Etherealmind
The Networking Nerd

Cisco and the Two-Factor Two-Step

In case you missed the news, Cisco announced yesterday that they are buying Duo Security. This is a great move on Cisco’s part. They need to beef up their security portfolio to compete against not only Palo Alto Networks but also against all the up-and-coming startups that are trying to solve problems that are largely being ignored by large enterprise security vendors. But how does an authentication vendor help Cisco?

Who Are You?

The world relies on passwords to run. Banks, email, and even your mobile device has some kind of passcode. We memorize them, write them down, or sometimes just use a password manager (like 1Password) to keep them safe. But passwords can be guessed. Trivial passwords are especially vulnerable. And when you factor in things like rainbow tables, it gets even scarier.

The most secure systems require you to have some additional form of authentication. You may have heard this termed as Two Factor Authentication (2FA). 2FA makes sure that no one is just going to be able to guess your password. The most commonly accepted forms of multi-factor authentication are:

  • Something You Know – Password, PIN, etc
  • Something You Have – Credit Card, Auth token, etc
  • Something You Are – Biometrics

You need at least two of these in order to successfully log into a system. Not having an additional form means you’re locked out. And that also means that the individual components of the scheme are useless in isolation. Knowing someone’s password without having their security token means little. Stealing a token without having their fingerprint is worthless.

But, people are starting to get more and more sophisticated with their attacks. One of the most popular forms of 2FA is the SMS authentication. It combines What You Know, in this case you password for your account, with Something You Have, which is a phone capable of receiving an SMS text message. When you log in, the authentication system sends an SMS to the authorized number and you have to type in the short-lived code to get into the system.

Ask Reddit how that worked out for them recently. A hacker (or group) was able to intercept the 2FA SMS codes for certain accounts and use both factors to log in and gather account data. It’s actually not as trivial as one might think to intercept SMS codes. It’s much, much harder to crack the algorithm of something like a security token. You’d need access to the source code and months to download everything. Like exactly what happened in 2011 to RSA.

In order for 2FA to work effectively, it needs to be something like an app on your mobile device that can be updated and changed when necessary to validate new algorithms and expire old credentials. It needs to be modern. It needs to be something that people don’t think twice about. That’s what Duo Security is all about. And, from their customer base and the fact that Cisco payed about $2.3 billion for them, they must do it well.

Won’t Get Fooled Again

How does Duo help Cisco? Well, first and foremost I hope that Duo puts an end to telnet access to routers forever. Telnet is the lazy way we enable remote access to devices. SSH is ten times better and a thousand times more secure. But setting it up properly to authenticate with certificate authentication is a huge pain. People want it to work when they need it to work. And tying it to a specific machine or location isn’t the easiest or more convenient thing.

Duo can give Cisco the ability to introduce real 2FA login security to their devices. IOS could be modified to require Duo Security app login authentication. That means that only users authorized to log into that device would get the login codes. No more guessed remote passwords!

Think about integrating Duo with Cisco ISE. That could be a huge boon for systems that need additional security. You could have groups of system that need 2FA and others that don’t. You could easily manage those lists and move systems in and out as needed. Or, you could start a policy that all systems needs 2FA and phase in the requirements over time to make people understand how important it is and give them time to download the app and get it set up. The ISE possibilities are endless.

One caveat is that Duo is a program that works with a large number of third party programs right now. Including integrations with Juniper Networks. As you can imagine, that list might change once Cisco takes control of the company. Some organizations that use Duo will probably see a price increase and will continue to offer the service to their users. Others, possibly Juniper as an example, may be frozen out as Cisco tries to keep the best parts of the company for their own use. If Cisco is smart, they’ll keep Duo available for any third party that wants to use the platform or integrate. It’s the best solution out there for solving this problem and everyone deserves to have good security.

Tom’s Take

Cisco buying a security company is no shock. They need the horsepower to compete in a world where firewalls are impediments at best and hackers have long since figured out how to get around static defenses. They need to get involved in software too. Security isn’t fought in silicon any more. It’s all in code and beefing up the software side of the equation. Duo gives them a component to compete in the broader authentication market. And the acquisition strategy is straight out of the Chambers playbook.

A plea to Cisco: Don’t lock everyone out of the best parts of Duo because you want to bundle them with recurring Cisco software revenue. Let people integrate. Take a page from the Samsung playbook. Just because you compete with Apple doesn’t mean you can’t make chips for them. Keep your competitors close and make they use your software and you’ll make more money than freezing everyone out and claiming your software is the best and least used of the bunch.

by networkingnerd at August 03, 2018 02:33 PM

XKCD Comics

August 02, 2018

My Etherealmind

August 01, 2018

XKCD Comics

July 31, 2018

SNOsoft Research Team

Gizmodo Interviews Netragard-Snake Oil Salesmen Plague the Security Industry, But Not Everyone Is Staying Quiet

Adriel Desautels was suddenly in a serious mess, and it was entirely his fault.

Sitting in his college dorm room back in the mid-1990s, Desautels let his curiosity run rampant. He had a hunch that his school’s network was woefully insecure, so he took it upon himself to test it and find out.

“My thoughts at the time were, ‘Hey, it’s university. I’m here to learn. How much harm can there really be in doing it?’” Desautels says in a recent phone call, the hint of a tremor in his voice.

It wasn’t long before he found himself in a dull faculty conference room, university officials hammering him with questions as a pair of ominous-looking men—Desautels says he still doesn’t know who they were, but it’s hard not to assume they had badges in their pockets—stood quietly listening on the sidelines.

Penetrating the school’s network proved simple, he says, and thanks to Desautels’ affable arrogance, talking his way out of trouble was easier still. Forensically speaking, he argued to the school officials, there was no way to prove he did it. It could’ve just as easily been another student, at another computer, in a dorm room that wasn’t his. And he was right; they couldn’t prove shit, Desautels recalls. One of the mystery men smiled knowingly.

Read the full article here

The post Gizmodo Interviews Netragard-Snake Oil Salesmen Plague the Security Industry, But Not Everyone Is Staying Quiet appeared first on Netragard.

by Adriel Desautels at July 31, 2018 03:43 PM

July 30, 2018

My Etherealmind
XKCD Comics

July 27, 2018

The Networking Nerd

It’s About Time and Project Management

I stumbled across a Reddit thread today from /u/Magician_Hiker that posed a question I’ve always found fascinating. When we work on projects, it always seems like there is a disconnect between the project management team and the engineering team doing the work. The statement posted at the top of this thread is as follows:

Project Managers only plan for when things go right.

Engineers always plan for when things go wrong.

How did we get here? And can anything be done about it?

Projecting Management

I’ve had a turn or two at project management. I got my Project+ many years back, and even more years before that I had to learn all about project management in college. The science behind project management is storied and deep. The idea of having someone assigned to keep things running on task and making sure all the little details get taken care of is a huge boon as the size of projects grow.

As an engineer, can you imagine trying to juggle three different installations across 5 different sites that all need to be coordinated together? Can you think about the effort needed to make sure that everything works together and is done on time? The thought alone probably gives you hives.

Project managers are capable of juggling lots of things in their professional capabilities. That means keeping all the dishes cooking at the same time and making sure that everything is done on time to eat dinner. It also means that people need to know about timelines and how those timelines intersect and can impact the execution of multiple phases of a project. Sure, it’s easy to figure out that we can’t start installing the equipment until it arrives on the dock. But how about coordinating the installers to be on-site on the right day knowing that the company is drop shipping the equipment to three different receiving docks? That’s a bit harder.

Project managers need to know timelines for things because they have to juggle everything together. If you’ve ever had the misfortune to need to use a Gantt chart you’ll know what I’m talking about. These little jewels have all the timeline needs of a project visualized for everyone to figure out how to make things happen. Stable time is key to a project. Estimates need to make sense. You can’t just spitball something and hope it works. If part of your project timeline is off in either direction, you’re going to get messed up further down the line.


Project timelines need to be consistent. Most people try to err on the side of caution when trying to make them work. They fudge the numbers and pad things out a bit so that everything will work out in the end. Even if that means that there may be a few hours when someone is sitting around with nothing to do.

I worked with a project manager that jokingly told me that the way he figured out the timing for an installation project was to take the units from his engineers and double it and move to the next time unit. So hours became days, and days became weeks. We chuckled about this at the time, but it also wasn’t surprising when their projects always seemed to talk a LOT longer than most people budgeted for.

The problem with inflated numbers is that no customer is going to want to pay for wasted time. If you think it’s hard to get a customer to buy off on an installation that might take 30 hours try getting them to pay when they are telling you your engineers were sitting around for 10 of those hours. Customers only want to pay for the hours worked, not the hours spent on the phone trying to locate shipments or trying to figure out what this weird error message is.

Likewise, trying to go the other direction and get things done more quickly than the estimate is a recipe for disaster too. There’s even a specific term for it: crashing (sounds great, eh?). Crashing a project means adding resources to a project or removing items from the critical execution path to make a deadline or complete something earlier. If you want a textbook example of why messing with a project timeline is a bad idea, go read or watch The Martian. The first resupply mission is a prime example of this practice in action and why it can go horribly wrong.

These are all great reasons why cloud is so appealing to people. Justin Warren (@JPWarren) did a great presentation a couple of years ago about what happens when projects run late and why cloud fixes that:

<iframe allowfullscreen="true" class="youtube-player" height="329" src=";rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" type="text/html" width="584"></iframe>

Watch that whole video and you’ll understand things from a project manager’s point of view. Cloud is predictable and stable and it always works the same way. The variance on things is tight. You don’t have to worry about projects slipping up or taking too much time. Cloud removes uncertainty and doubt about execution. That’s something that project managers love.

Tom’s Take

I used to get asked to quote my projected installation times to the sales managers for projects. Most of the time, I’d give them an estimate that I felt comfortable with and that would be the end of it. One day, I asked them about why a 10-hour project was quoted as 14 on an order. The sales manager told me that they’d developed “Tom Time”, which was 1.4 times the amount of whatever I quoted. So, 10 hours became 14 and 20 hours became 28, and so on. When I asked why I was told that engineers often run into problems and don’t think to account for it. So project managers need to build in the time somehow. Perhaps that’s one of the reasons why software defined and cloud are more attractive. Because there isn’t any Tom Time involved.

by networkingnerd at July 27, 2018 07:12 PM

XKCD Comics

July 26, 2018 Blog (Ivan Pepelnjak)

New on Interviews and Guest Podcasts

You can find most of the interviews and guest podcasts I did in the last few years on this web page (also accessible as Resources → Interviews from the new menu).

During the summer break, I’m publishing blog posts about the projects I’m working on – as you can see, they include web site maintenance and other janitorial tasks. Regular blog posts will return in autumn.

by Ivan Pepelnjak ( at July 26, 2018 05:17 AM

July 25, 2018

About Networks

How to deploy a Cisco Meraki vMX100 into Microsoft Azure

Recently, I was involved into a project where we had to deploy a Cisco Meraki vMX100 into Microsoft Azure cloud and build site-to-site and clients VPNs. The setup process on Azure is relatively simple, however, I lost quite a lot of time on basic issues because the documentation provided by Cisco is not 100% accurate.
Read More »

The post How to deploy a Cisco Meraki vMX100 into Microsoft Azure appeared first on

by Jerome Tissieres at July 25, 2018 02:26 PM

XKCD Comics

July 24, 2018

Potaroo blog

An Update on Securing BGP from IETF 102

In this article I’d like to look at some BGP security topics that have come up during the July 2018 meeting of the Internet Engineering Task Force (IETF) and try to place these items into some bigger context of routing security.

July 24, 2018 11:15 PM

July 23, 2018 Blog (Ivan Pepelnjak)

Overview of Training Options

Describe the differences between various training options has been on my to-do list for ages, but I successfully managed to ignore it till I deployed the new top-level menu that contains training category.

Our designers never considered menu items without a corresponding link, so I got an ugly mess that needed to be cleaned up either by fixing the CSS or writing the overview document.

End result: a high-level document describing how webinars, courses and workshops fit into the bigger picture.

During the summer break, I’m publishing blog posts about the projects I’m working on. Regular blog posts will return in autumn.

by Ivan Pepelnjak ( at July 23, 2018 08:00 AM

Aaron's Worthless Words

Automating My World

I’ve told this story 984828934 time in the past year, but bear with me.  We got a new director-type last year, and he has challenged all of us to do things differently.  As in everything.  Anything that we’re doing today should be done differently by next year.  This isn’t saying that we’re doing things wrong.  This is just a challenge mix things up, integrate new tools, and get rid of the noise.  Our group has responded big-time, and we’re now doing most of our day-to-day tasks with a tool of some kind.  A couple weeks ago, I realized that I did a whole day’s work without logging directly into any gear — everything was through a tool.  It was a proud moment for me and the group.

To kick off this new adventure, we’re starting with writing all our own stuff in-house; we’re obviously not talking about a full, commercial orchestration deployment here.  We’ve talking about taking care of the menial tasks that we are way too expensive to be doing.  Simple tasks.  Common tasks.  Repeatable tasks.  All game.  What’s the MAC address of that host?  Need a new host added to an existing object-group in the firewall?  Adding a new VPN tunnel to a customer?  All easily scripted out.

As a group, we got together an decided on a few standard tools to use.  Don’t read too much into that, though — we didn’t involve a full RFP process.  It’s more of a handshake agreement, but it allows us to have a common base of skills so we can help each other out on the way.

Let’s talk about those tools for a bit.

Python – For a basic language, we’ve decided to use Python.  This is a no-brainer since Python is easy to use, is very useful, has lots of functionality.  Just look around the automation world for 10 seconds and you’ll see it in wide use.  An item of interest : Netmiko

Ansible – For doing a set of tasks across a big list of hosts, Ansible meets our needs.  It’s got a bunch (maybe all) of Python on the backend, and there are all sorts of modules already available.  This took just about as much thought as selecting Python since it’s one of the most popular automation tools out there.  An item of interest : Ansible Vault

Rundeck – This is a web-interface tool for running scripts either on-demand or on a schedule. We use it mostly to enable other groups to do tasks without our input since it has a decent way to control execution and inputs.

With these tools in place, the team and I are in great shape to change the way we’re doing things.  Is this the end?  God, no.  This is very much the beginning while we get acquainted with automating ourselves out of our jobs.

This seems like a new line of blog posts, doesn’t it?  Yep.

Send any syntax errors questions to me.

by Aaron Conaway at July 23, 2018 01:22 AM

XKCD Comics

July 20, 2018

My Etherealmind
Network Design and Architecture

CCDE Written 352-001 Exam Experience – 2018

My recent experience on CCDE Written Exam.    If you are reading this post , probably you know that CCDE Written (Qualification) Exam is the only prerequisite for the CCDE Practical exam.   Also when you pass CCDE Practical exam and get the magical number, you need to retake every 2 years CCDE Written or […]

The post CCDE Written 352-001 Exam Experience – 2018 appeared first on Cisco Network Design and Architecture | CCDE Bootcamp |

by Orhan Ergun at July 20, 2018 05:32 PM

The Networking Nerd

Friday Musings on Network Analytics

I’ve been at Networking Field Day this week, and as always the conversations have been great and focused around a variety of networking topics. One that keeps jumping out at me is network analytics. There’s a few things that have come up that were especially interesting to me:

  • Don’t ask yourself if networking monitoring is not worth your time. Odds are good you’re already monitoring stuff in your network and you don’t even realize it. Many networking vendors enable basic analytics for troubleshooting purposes. You need to figure out how to build that into a bigger part of your workflows.
  • Remember that analytics can integrate with platforms you’re already using. If you’re using ServiceNow you can integrate everything into it. No better way to learn how analytics can help you than to setup some kind of ticket generation for down networks. And, if that automation causes you to get overloaded with link flaps you’ll have even more motivation to figure out why your provider can’t keep things running.
  • Don’t discount open source tools. The world has come a long way since MRTG and Cacti. In fact, a lot of the flagship analytics platforms are built with open source tools as a starting point. If you can figure out how to use the “free” versions, you can figure out how to implement the bigger stuff too. The paid versions may look nicer or have deeper integrations, but you can bet that they all work mostly the same under the hood.
  • Finally, remember that you can’t possible deal with all this data yourself. You can collect it but parsing it is like trying to drink from a firehose of pond water. You need to treat the data and then analyze that result. Find tools (probably open source) that help you understand what you’re seeing. If it saves you 10 minutes of looking, it’s worth it.

Tom’s Take

Be sure to say tuned to our Gestalt IT On-Premise IT Roundtable podcast in the coming weeks for more great discussion on the analytics topic. We’ve got an episode that should be out soon that will take the discussion of the “expense” of networking analytics in a new direction.

by networkingnerd at July 20, 2018 02:00 PM