How to Think About Virtual Machines: Part Two

To understand virtual machines you have to understand real ones.

Posted on by

Posts in this series:
Part 1
Part 2
Part 3 Part 4 Part 5

The great thing about us being technically inclined, by us being so curious about how things work that we wind up in some sort of Information Technology field, or as I like to call it, Data Technology--it isn’t information until someone reads and groks it--is that we can read. By read, I mean we can turn words into understanding, into working knowledge in our minds. We can “see” it. By reading a book we can visualize something through our mind’s eyes and even play with it. All this within our minds.

We can wiggle all the bits and parts and comprehend how something would behave in such a situation. And, like the lid on a Sucrets Throat Lozenge can, we take great pleasure in working it back and forth, of opening and closing it. We drift off at night, sometimes, with some neat thing we have learned opening and closing in our mind. This kind of understanding, the mental visualization of mechanism, of seeing how it works, that enables people to create new technologies. It’s hard to beat the low price of mentally constructed mechanisms.

It’s that kind of understanding of the technological pyramid, atop of which Virtual Machines (VM) sit, I hope to convey in this blog series. I enjoy running my mind over the surface of all the ideas and workings that create a VM, the modern data system, ideas so powerful, that when we study them from the bottom up, from the intuitive understandings of the working of a field effect transistor, (which I won’t be talking about; fear not) up through the layers of ideas and concepts, to the combinational logic, with its ability to electrically do IF THEN ELSE functions, to tiny bits of memory we call registers constructed of the lower logic, to the state machines of digital design, to the instructions sets that underlie all the software, we gain a complete understanding, at least at some level, of the entire stack of the system.

At the top of our pyramid sits the VM, a cyborg, part machine hardware part software. The full breadth and height of virtualization technology (none of which is rocket science, just a great number of little things) is a two dimensional conceptual area of true Computer Science (CS) and invention. Almost every CS concept is in there, in those modern VM software/hardware mechanisms. Covering all the concepts from the base of solid state physics to running your Exchange Server, would be a joyous epic journey through the entire landscape of Data Technology.

But I won’t do all that in this series. I’ll try to just build on the ideas that you will need to have a better understanding of what this VM thing is all about. Then you’ll be able to better understand how the VMs work and how better to deploy them. I hope to shed light on why one would sometimes use containers instead.

Last week, I talked a bit about the history of OSs, with an eye to understanding the first bit of history that lead to the rise of the Hypervisor. Today, I’ll look a bit closer at the details of how a machine works, with an eye to explaining how a particular machine architecture works. This will help us be concrete and less abstract. By learning a real architecture you will be able to visualize what’s going on with not just this machine, but any machine. This opens the question of what machine to use as an example. I am convinced that the best first machine to learn is the one I started with back in the 1970s: the IBM System/360.

I know what some may be thinking. What?! An old mainframe architecture! Why would I want to understand that? Why not tell me more about the Intel?

For those few individuals, let me just say that the problem with not knowing something is that you don’t know what to learn and in what order to know something that you don’t know. If you did, you would already know it. At one time, that was the reason why we went to an institution of higher education. (The Intel architecture as described in 6,000 page tome is possibly the worst machine in the universe to learn on. The S/360 architecture is completely described in about 80 pages of the Principles of Operation.) Trust me. Simpler is better, even in CISC (Complex Instruction Set Computer). I use the System/360 as an example for three reasons.

First, it played a very important role throughout the history of virtualization, as you will see as our blogs unfold. Second, it is a very good instruction set for beginners. The System/360 instruction set is orthogonal, small, commonsensical, and covers all the relevant things that modern architectures do. It was designed to make assembler programming easy.

One might ask why not a RISC (Reduced Instruction Set Computer) processor, like the ARM? Hasn’t the RISC proven to be a faster way to go? Not really. The ARM is a good instruction set, no doubt about that, but it’s most attractive feature today is a low gate count and therefore long battery life. Reduced instruction sets seemed faster in the 1980s because they fit nicely on the 100,000 transistor limit of the silicon dies of the time. The performance advantage of staying on the die is huge. The tiny instruction sets lets a good part of the silicon real estate go to larger caches instead of complex logic, and the speed ups from more cache hits were significant. But as Intel has shown, one can make even the world’s most complex instruction set and make it go really fast. Today we can put a billion transistor and eight levels of interconnect on a somewhat inexpensive die. The advantages of RISC are actually now dated. (I’ll get email on that one!)

Thirdly, it was the first real instruction set I learned. Almost. Actually, my very first instruction set was the delightfully small one in the Motorola 6800 (not 68000, which I learned a bit later.) implemented in 1974 on a 5 mm square chip covered with 4,000 transistors. This 8-bit processor taught me a lot about the very basics of machines. But the first real architecture of sophistication I studied was the IBM 370/158, essentially a System/360 with a handful of instructions added, most of which are used by the supervisor. (Version 0 of the 370 PrincOps was only 32 pages long, an addendum to the 360 PrincOps.) And the 370/158 was physically fascinating. The three large equipment bays, filled with MST modules, stood head high, was arranged in a big T-shape as seen from above, one end of which had a panel filled with blinking lights. It was decidedly not the tiny 4,000 transistor processor of the 6800.

The very first thing to understand about a computer is that it is merely a finite state automaton. The Central Processing Unit (CPU), the “brains” of the computer, is a simple, at least in concept, mechanism that repeats four steps over and over. Like the four stroke (cycle) engine in your car, it just does those four steps repeatedly and very, very quickly. It fetches, decodes, executes, and stores results. It fetches what are called instructions. It fetches them from main memory, which are the DDR sticks in your server. Main memory has been implemented many different ways over the years: tubes of mercury, spirals of square wires, dots on the face of a cathode ray tubes, tiny iron filing filled ceramic doughnuts strung on small copper wires, six cleverly arranged transistors on a die, and, finally, electron-microscopic capacitors, billions of them, served on a 25 mm square silicon plate.

In our architecture, an instruction is made up of two, four, or six bytes. The first byte is the operation code. In our techno-nerd manner of shortening words until they have the maximum terse obscurity, yet without loosing their meaning altogether, we refer to them as opcodes. They determine the length of the rest of the instruction. They also control what happens in the machine once they get to the CPU.

Where does this instruction get fetched to? There are memory-like devices in the CPU called registers. Constructed of transistors in a configuration called flip-flops, they offer a very high-speed place to store data. The instruction is loaded into a special register called, naturally enough, the instruction register. As you will see, this instruction register is connected to almost all the parts of the CPU.

Another register that we should mention right at this point is the program counter (PC). It contains the address of next the instruction to be fetched. The PC gets incremented with each instruction, so you can think of the fetching as simply reading the instruction from memory and incrementing the PC so that in the next fetch-decode-execute-store cycle the new instruction in sequence will be fetched.

And there are other registers visible to the architecture. In the System/360 they are called the general purpose registers. They serve an important function in the getting faster speed from the CPU. As I said, registers are very high speed memories implemented with all the design compromises tilted toward being fast rather than taking up less space. Main memory is all about how many bits you can cram into a small space. A general purpose register is about how fast you can go. You can visually imagine a register as a short and wide rectangle; that’s the way we draw them on the whatieboard. The sixteen general purpose registers in the System/360 are address by a number 0 through 15. In a way that makes them a different, tiny memory, word addressed instead of byte addressed, that are implemented inside the CPU.

Meanwhile, back at the instruction register... I like to envision the bottom of the instruction register as connected to memory. On the top side, I imagine lines running off in all directions throughout the CPU. The rest of the machine is controlled by these bits. These signals control actions of the machine in the next two phases. You can think of the instruction register as a function call, with the operation code controlling the function and the operands as the parameters of the function.

Why am I spending so much time on this? Because without understanding all this you really can’t grok VMs. To understand why a VM might be taking too long, you need to understand the principles behind a running machine. To understand those principles, it helps to understand a real architecture, even if it’s not the Intel one. Now, where was I?

After the instruction is fetched, in the first of our fetch-decode-execute-store cycles, the logic gates in the CPU decodes the instruction. You can think of the time it takes the instruction to travel from main memory to the instruction register as the fetch phase, and the time it takes all the logic gates connected to the other side of the instruction register to settle down as the decode phase. Those logic gates, made from CMOS transistors have something called a propagation delay. Set the input from false to true, and it takes a certain amount of time, usually less than a nanosecond per gate, for the outputs of the logic to reflect the inputs.

This is as good a place as any to talk about the processor clock. Every CPU from the very first time the EDSAC booted in 1949 has had a clock, a master timer that counted down steps in this four phase process. Each step, fetch or decode or execute or store, will take a variable number of clock periods. The fetch might take four cycles of the clock, the decode two, and the execute phase, which we will talk about below, will take a variable number of cycles depending on which instruction is being executed. The speed of this clock is what we talk about when we talk about the clock speed of a processor. (Currently the processors in Coraid EtherDrive Media Arrays lope along at a leisurely 1.6 GHz. We go fast because our efficient software uses so many less of those cycles.) The clock speed for the original models of the System/360 varied, from 1 microsecond all the way to 60 nanoseconds. A single instruction would take a variable number of clock cycles to execute, depending on what it had to do. There was a document, IBM document number A22-6824, that specified how may microseconds each instruction would take. You could calculate how long your software would take down to the millionth of a second.

After the decode logic has had time to settle, the execution phase begins. First, values from the general purpose registers, or from main memory, are made available to the two inputs of what is called the Arithmetic/Logical Unit. It’s the ALU that does all the math. It adds, it negates and adds (that is to say, subtracts), it does bit-wise ANDs, ORs, and XORs. And it shifts, left and right. After it’s done, it sets some condition codes to reflect the result of the operation. You’ll see how we use these codes when we look at branching.

The final step in our electronic analogy to the four-stroke internal combustion engine’s suck-squeeze-bang-blow cycles, is the store phase. Once we have something emerging from the ALU, we have to put it in some place, usually the general purpose register. For our architecture, one moves the values into one of the sixteen general purpose registers, each holding 32-bit values. While both input values to the ALU can come from a register, or one can come from memory, the result has to go back into a register. This is called a two address architecture. To return a value back into main memory, a store instruction is used.

So, as you can see, the execute phase takes values from a register and memory, or from a register and another register, passes it through the swiss army knife of the ALU and saves the output somewhere. You can see the power and performance enhancement of the general purpose registers. The instruction set has sixteen of them, which seems to be the right number. You find sixteen in the ARM processors, for example. Move data into registers, do math on them from memory, then save the results back into main memory. This forms the basis of all software.

So the Fetch-Decode-Execute-Store cycle repeats over and over and over again, billions of times a second. While the actual deep down internal way the modern CPUs work, something called microarchitecture, is more complex than this description would imply, the model, from the instruction stream point of view, is completely correct.

It’s what we need to fully grok before we can see how we can make this machine out of pure software, before we can make a VM.

Next week we’ll talk about specific machine instructions and learn what’s underneath all the software in the world.

About the Author

Brantley CoileInventor, coder, and entrepreneur, Brantley Coile invented Stateful packet inspection, network address translation, and Web load balancing used in the Cisco LocalDirector. He went on to create the Coraid line of storage appliances, a product he continues to improve today.

Sign up to have interesting musings delivered direct to your inbox.