So, as was mentioned a few seconds ago, I spoke at the previous Merge in 2014. I talked about continuous validation, and continuous validation for me really means how do we ensure that our code lines are safe and sane? How do we ensure that a developer is not going to submit something that is going to break a code line?
So I work for this particular company called Xilinx; you may or may not have heard of us before. We make these. These are FPGAs or field programmable gate arrays. They are programmable chips. They differ from a traditional chip in that you can change the logic on them over and over again. With the traditional integrated circuit, you spend a whole bunch of time upfront designing what this thing is going to do, and then you're going to hand it over to somebody who's going to fab a bunch of these things for you. In the case of our product will give you the flexibility to change that design over and over.
And for those who are interested, the way that we get away with this is… If we remember from college, the way an integrated circuit works is you have a bunch of logical gates. Those logical gates are interconnected together, and that gives you your functionality. With an FPGA, we have a bunch of logical blocks. Those logical blocks are all interconnected together, and a logical block can mimic the behavior of a logical gate. And so we're thus able to mimic the same behavior you would get in your traditional integrated circuit.
So what kind of applications are you going to find this in? Basically areas where flexibility is paramount. That's the most important thing. Where you're not so concerned about the absolute cost of a chip. You're not so concerned about perhaps power usage or heat, where flexibility and the ability to change your design is the most important thing to you.
So we're in a lot of things you probably use, but you weren't aware that we are there. We're in medical devices, the automotive industry, aerospace and defense, telecommunications, up in space, and, my personal favorite, we're on the Mars rover — Curiosity.
So what are some of the challenges that we face at Xilinx? So we are a hardware company, we produce hardware, but we're also a software company. You need to be able to program these chips. And to program this chip, you're going to actually writing code. You're going to write any hardware-specific programming language, Verilog or VHDL. And then we're going to take that code and we're going to pass it through a tool chain. We're going to analyze it and make sure that what you've designed is going to work with this chip. And then we're going to go through a series of transformations, ultimately resulting in something the chip can understand.
So this is similar to what you would do in normal software engineering. You're going to code, and you're going to pass it to a tool chain. You're going to go through a compiler and a linker, and you're ultimately going to get something that goes to a CPU. And so the software we produce is very similar to an IDE that you would work in as a normal software engineer. This is an IDE specifically designed for the hardware engineer — designed for them to be able to write their code and do their analysis and work with that tool chain.
We're a reasonably sized software organization of about 350+ active developers, companies about 3000. Lots of folks working in hardware and other areas. We have a large code base on the order of tens of millions of lines of code. It's all checked into Perforce. We follow the main line model, and we have tons of release branches, tons of development branches, integration branches, and patch branches.
So our challenge is just figuring out, “How do we make sure that everything that is going into these branches is ready to go?” So I'm going to start by telling a little bit of a story. This is a bit of a recap from my last talk. I'm going to walk through the different stages of validation that we attempted at Xilinx.
So when I first started at Xilinx back in 2008, I joined a small team of folks that had been acquired in through an acquisition a few years earlier. They made a software product that was specifically tailored for one part of an FPGA. And shortly after I got there, Xilinx decided that they were going to sunset their existing software solution (the way that people program chips) and create something new, that they felt was going to better meet the needs of future customers. And so they looked around the organization, saw this little team, and said, "We like what you guys do. We like the methodologies that you follow. Would really like it if you took on the role of building this next generation solution."
So this graph here represents the growth that we saw in developers. Starting from very small to where we are today. So that did not turn out very bright.
When we first started, we were able to get away with nothing more than a nightly cycle. And that worked because we had this very small group of engineers. They knew the code base really well. Each one of them had already developed their own validation strategy. They'd already kind of figured out how they were going to check their code to ensure they weren't going to break their neighbor.
And that was fine when you're small, but very quickly we're getting bigger; we're adding in new people. These are new people coming in from outside the company — folks coming from the old legacy software platform. They're coming into a new code base, new programming languages, new software paradigm, a totally different version control system. So they're just in a new space. And so we get lots of breakages as a result of that.
So the next step of evolution for us, carrying us into the twenty tens, was Continuous Integration. We added Continuous Integration, and this solved a lot of our early problems.
Now we're able to see the failures as soon as they come in, react to them, either back out that change, or maybe get the developer to submit a fix. But the problem with Continuous Integration is that it doesn't prevent bad code from getting into your code line. It just tells you that it's there, and you have to react to it.
So as we get bigger and more failures are coming in, we, unfortunately, get to that point where all we have is a continuous cycle that is always broken.
So how can we fix this? So carrying ourselves forward we decided we would look at the failures in our continuous cycle, and see if they fit some sort of broad categories and maybe we could respond to this with some tooling. So we quickly recognized that there were sort of two big problems that developers were hitting. The first is they were working in one dominant platform day to day, and there were several different platforms that we were releasing on. And so all of the breakages, of course, were occurring on these platforms that they weren't building on day in and day out. So we create some tools — make it very easy for them to build and test across all the platforms we support.
The other thing we saw was issues with version control. In the old legacy software platform, folks were used to a modified version of CVS. And so they were very accustomed to being able to touch their files, and then when they did a submit, the version control system would crawl everything that was in their workspace and ensure that all the code they touched came in with Perforce. They had to do edits, ads, deletes.
This was new to them, and as a result of that, they often missed the file during a submit. So we created some tooling around that to improve workspace integrity. We put these two things together. Create one single workflow, one single command. We call it pre-commit, and we mandate that every single developer run it. And that was what carried us up into 2013. And that worked really well.
It really reduced the number of breakages we saw coming in, but it wasn't perfect. And the reason it wasn't perfect is because not everybody ran it. Sometimes they wouldn't run it because they forgot. Sometimes they wouldn't run it because they didn't want to. Sometimes they didn't run it because they felt that the particular code change was benign, and they didn't think it was important. They didn't think it was going to break anything.
And so we still had this small amount of breakage going on. And as we get bigger, there's just enough breakage to cause us pain. So we got to take this to the next level. We have to move from just mandating that people run this workflow to enforcing that they do it.
So we need to create some automation. We've got developers and they want to submit changes, and they want to submit changes to Perforce, and we have to get in the middle. So we create some automation, which we lovingly call the wall. And the wall takes the developers change, does validation on it, and if it's successful, submits it to Perforce on their behalf. And if unsuccessful, rejects that change, throwing it back to them, letting them know what's broken.
And let's dive in, and see a little bit more about what's going on here. When the developer runs the submit command, instead of creating a changelist, we create a shelf. That shelf is handed to the wall. The wall manages a series of workspaces and spawns off workers. If workers take workspace and the shelf combines them together, does conflict resolution, sync up build test. And then, if successful, submits it to Perforce on their behalf. If unsuccessful, throws the change back to the user, letting him know what was broken.
So that was kind of the overview of what I did in 2014. Just talking about the different stages we went to, ultimately the automation and the wall. I'm here to say that we're still using the wall. We still love the wall. The wall was great.
So what did we've been doing over these past few years? We're trying to deal with the fact that Xilinx keeps getting bigger. There's more stuff to deal with, and we have to move faster. So since I last spoke, we've seen a slight increase in the total number of active developers. So right there at 2014 was when I last spoke, and we see it as a slight increase, not a huge increase, but a reasonable increase.
More interestingly, if I look at year-over-year submissions, also up to 2014, we go through sort of plateaus and peaks and plateaus and peaks. And we hit this plateau phase when I was done with the last talk. And this past year, we've hit another increase in year-over-year submissions — one that seems to be continuing to grow.
So we keep getting bigger. And we've got to move faster in light of the fact that we've got more developers, more code. Those tens of millions of lines of code just keep growing day in and day out.
We've got more change, a lot of code churn, a lot of refactoring going on, many more releases, more products leading us to stricter timelines. And, of course, we've got tighter budgets to deal with. We always have tighter budgets. So we've got to figure out how to do more with less.
So what could we do to our existing automation to make it better? So we decided to take one of our software principles “DRY,” (do not repeat yourself) and look for duplication in our automation and see how we could remove it.
So going back to this picture of what we have, I want to focus first on the workspaces. So in the very first version of the wall, we had what we refer to as static workspaces. And what that meant is that as we had a queue of shelves coming in, when a worker started, it would take a shelf and an unavailable workspace, combine them together, do conflict resolution, sync up build test, and either put it into Perforce or send it back to the user. And when it was done, the shelf would be thrown away, and the workspace will return back to the pool.
Of course, we continue to go through this process with each new shelf that arrives. The problem with static workspaces is we're constantly alternating back and forth between the workspaces that we have.
We never know which one we're going to get. It depends on how long validation took on the previous change. We're doing an incremental build on each one of these workspaces. If a workspace has been sitting around for a long time, become very stale, then the code delta we have to build has increased, so this may take longer in validation. And then finally, when we encounter a failure, we want to make sure that failure doesn't infect something else later on. So we take that workspace, take it out of the pool, peel back everything that we did, rebuild it at a known good state, put it back in.
So this is wildly inefficient and causes lots of slowdowns in the wall, especially with huge numbers of failures coming in. So where can we go? Well, luckily for us, our storage vendor got ahold of us. Let us know that they had some cloning technology they wanted us to try, the ability for us to take bits on disk, instantaneously clone them to somewhere else, and so we took advantage of that. Sorry, here we go.
We got rid of the static pool of workspaces that we had before. And instead, what we created was what we refer to as a master cycle.
So on each branch we have this master cycle. The master cycle is checking out the latest and greatest code on that branch, doing build test, creating a snapshot. The snapshot is instantaneously clonable. So now, when a worker runs, it simply clones a snapshot, takes the shelf, combines them together, conflict resolution, build test. And then when it's finished, we get rid of the shelf, and we get rid of the workspace. So now we no longer have to deal with the cleanup cycle we had before. We have a near inexhaustible number of these clonable workspaces, so cloning is a big advantage to what we had in the past.
So with cloning in place, we have to look at what else we can improve on. And we recognize that the faster that master cycle is, the better it is for us. We want those snapshots, those things that we’re cloning to be as close to the latest and greatest code in a branch. And if we look at our validation cycle, we realized that during the master build, the build phase is what's taking all the time.
We're doing building, we're doing tests, but test is just a tiny fraction of what's going on. So for us, building is the long part. And when we say building, we really mean compilation. Our building phase is all just a bunch of compiles going on.
We're building across multiple different products or platforms. Some of those platforms are rather slow. So how do we speed up this compilation? We introduced an object cache. So what is an object cache?
Fortunately, there we go. If I know what my source file is, and I know what my tools are — and my compilers, my linkers, if I know what my environment is, Os, Patrivision, environmental variables, command line options — I can take all of this information, I can feed it into a checksum. I can get a unique string, and I can use that to map to my objects, shared object or DLL.
So this gives us that ability to build once use everywhere.
If we think back to what we had in our previous cloning model, we're building twice. The first time the change would come in, we would go through the validation process, build it. Then once that went into Perforce, the master cycle would pick it up, and we build it again. So now with caching, we get down to build once. Next step for us is improving on validation.
So in the very first version of the wall, we had a very naive validation approach. We aimed for safety with this solution. We said that we needed to make sure that developers trust what they're using. If they don't trust that, they're not going to want to use it.
So when we caught a shelf, we looked at all of the different products we knew how to build, and we built and tested every single one of them, which is wildly inefficient. In all likelihood, this particular shelf may not actually impact all of these different products. So as soon as we got the opportunity, as soon as we had this thing in place and we had the ability to add some more features, we started to increase the information we're able to get out of our make system. Wave a better understanding of our inner product dependencies.
And so we started looking at the code that was coming in the shelf and making a determination so that we can reduce our product matrix to just what needs to be built and tested. So in this case, we see that we just have to build and test one product, and there's a runtime dependency between two other products, so they must be tested together.
So that sort of wraps up the three main changes we introduced to the wall over these past 16 months or so. And that's given us a huge improvement in throughput, and a reduction in the total number of resources that we consume.
So now I want to spend just a little bit of time talking about our next target. What is the next thing that we want to do with our automation? And it's all around infrastructure. We want to improve the infrastructure that we're currently working with.
So at Xilinx we have huge pools of resources, a bunch of compute resources. And those resources are used for either testing or they're used for compilation. Our problem at the moment is that we are constantly, we don't have the ability to shift back and forth between how much testing resource we have versus how much computing resources we have.
So we have no elasticity and no ability to scale out to these different clusters. They're tightly bound to the hardware we have, even though the hardware is basically all the same. So we want to improve on that. We also want to improve on our host management and utilization. In testing, we find that there's often cases where a host hits the tests in a strange way, we get an interesting behavior, and we really want to be able to analyze that. We want a developer to be able to look at it. And so when that happens, we have to take that host out of our pool of resources, which is bad.
On the utilization front, we know that our tests, again, use lots of CPU, lots of memory. But if you profile them, they're not using all that CPU and memory the whole time. So we need a better way to ensure that we utilize all the hardware we have. And then, the current real pain point for us at the moment is on the storage front.
We use centralized storage for everything. These are all compute resources, they're all communicating with our central storage and we're all doing that through remote protocols, whether it's NFS or SMB. And we find that to be rather painful, especially with SMB. We do a lot of work in the windows world to try and speed up our build and test cycles, and we have a lot of pain with the SMB protocol being very noisy, being very flaky under incredible amounts of load. So we'd love to find a way to reduce that.
So our next direction is virtualization. I think that's probably obvious to anyone who saw those last couple of bullets. We know that with virtualization we're going to get our scalability and elasticity that we want, that's the whole point of virtualization. We've abstracted away the hardware, makes it very easy to shuffle things around.
We know we're going get that management and utilization. We know that it's going to be much easier to pull off a VM and let somebody look at it without impacting hardware. And in the case of things like VM ware, we get V motion, so we get the ability to take our hot VM. And when it stops using resource, swap it out for another hot VM. So that means we're going to maximize our hardware, which we're very much looking forward to.
And for me, I'm most excited about the storage front. So with virtualization, we get the ability to clone these VM images, so we get that cloning ability. And the neat thing is that the image that's cloned when attached to the OS VM, it looks like a physical disc, and it's interacting with a native file system. So the images themselves are actually still on shared storage and we're still going through some sort of remote protocol. But at the OS level where the testing is occurring, we're going through and using the native file system. And for us, in our testing, this is a big reduction in total IO operations across everything that the wall is using. It also greatly increases our stability. We don't have all of the problems that we were seeing in the past, especially on Windows.
So we're super excited about where we can go on the storage front.
And so the last thing I wanted to do here is, I just kind of wanted to finish up and talk a little bit about my experience in building this, and any advice I can give to anyone wants to do the same thing. There is a question that has come to me in the past. Some folks who've seen my last talk have asked me what recommendations I would have.
And the first thing is to kind of figure out when you need to add some sort of automation that is really gating every single change coming into version control. I personally wouldn't do it too early and, of course, I wouldn't wait too late. When you're really small, there's lots of different strategies you can employ that don't require something quite so heavy headed. And then if you wait too late, then obviously you're going to be fighting a lot of fires. It's going to make this more difficult, so you got to do it somewhere in the middle.
I can't really say where, it depends on your organization and your development community. The one thing I can say, though, is don't repeat a mistake that we made. We came up with this particular idea in early 2011, and we didn't actually implement it until the middle of 2013, and that was a really long time. It was far longer than we should've waited.
We looked at what we were trying to create. We said, “This is really complicated. There must be an easier way to do it.” In our case, there wasn't. And so dragging our feet just caused us a lot of pain and a lot of frustration with our developers. So my main recommendation here is don't wait too long when you think you need it.
And then, lastly, make sure that you're building on some solid foundation. I have had several folks come up in the past who have talked to me and said, “Hey, we tried to build something very similar, and it did not work out. And it didn't work out because we didn't have incremental compiles on all of our platforms, or we didn't have fast full compiles, or we couldn't build on all of our platforms, or we didn't understand all the different dependencies.” So you can't do something like this until you have that really reliable build system.
We did get a very big advantage in that we decided we were going to go through this big revolution with changing our software base, and when we did that, we got the opportunity to change our build system. And we got the opportunity to guarantee that we had really fast incremental builds that we didn't have in the past. So if we hadn't spent that upfront time, we would not have been able to build something like the wall.
You also need that reliable testing system. Reliable testing system with all of those well-defined tests that are going to guarantee that all the functionality is there when you build this thing to meet whatever qualification you're after.
So you need to have those two things in place. And of course you need to have a staff of people for this thing that you're now going to support. I think perhaps sometimes missed is the fact that you're building something tightly coupled with Perforce. Your developers expect it to be up like Perforce. So we are a worldwide company, and we have to have staff around the world, responding to stuff 24/7. Developer calls up says, “I have a weird problem.” We have to be able to respond to that, and help them make sure that working with this new automation for submitting their code is a very, very smooth process.
So that pretty much concludes my talk.
I want to thank everyone for coming out to listen to me speak, and thank Perforce for inviting me to speak at Merge again. And I want to thank my team. This was not done by one person, there were a large number of men and women who both built the tools, maintain the infrastructure, and helped create this automation. So I want to thank them too — thank you.
Speaker 2: Anybody, any questions for Joshua here?
Speaker 3: So I have two questions about this. How frequently do you update your work workspace that you're cloning from? Every time?
Joshua: We're cloning from? Basically as fast as we possibly can. So it depends on how long it takes us to go through that and build and test cycle. We basically have just one of those running. So as fast as we can go through that process. Let's say it takes 30 minutes, and there'll be a 30-minute delay from what's currently the latest and greatest code.
Speaker 3: Okay. You said that for the smart validation that you are checking, actually it's change and see the type of, does it need to test or build every product or only this one product? How do you define this dependency?
Joshua: So all of those dependent… What we basically did is, we extended our make systems so that all of those dependencies are defined in the make system. And then when we go to build this change, we basically say, “Make system, here's the change, tell me the products, take those products, make that decision, build and test them.”
Speaker 4: Yeah. You mentioned that storage was one of the pain areas. And I was wondering if you've been doing anything or trying any bonding or maybe utilizing SMB3 to try to get around those?
Joshua: We have tried using SMB3. I would say lots of problems that we have with remote protocols are vendor specific. So each vendor has their own implementation of how that works, and under extreme load, we see interesting behavior uncover interesting bugs. So we moved to SMB3, but that doesn't necessarily mean we won't hit something that's vendor specific implementation issue. So ultimately, we want to get away from that, and say, “Let's get something native that Microsoft is 100% behind, like using native NTFS.”
Speaker 4: Oh, I don't know if I missed it, but can you tell us how you do your clones?
Joshua: So the current cloning system we use today is via NetApp FlexClone technology. So that's how we do our cloning today. Our workspaces are about 350 gigabytes in size, and they get cloned instantaneously. Where we're headed, as I mentioned earlier, is probably cloning through VM Ware.
Speaker 4: After you clone are you guys IO bound or CPU bound?
Joshua: After we clone, predominantly during the building phase, we are CPU bound.
Speaker 4: And how long does it take for you average built to run?
Joshua: Full or incremental?
Speaker 4: Full.
Joshua: With or without our object cache?
Speaker 4: With the object cache.
Joshua: With the object cache, we can probably build everything in under 15 minutes on Linux.
Speaker 4: And the object cache is copied in into the clone?
Speaker 4: And how long does net copy take?
Joshua: It's during that 15 minutes, is how long it takes. So most of those objects are just being copied in, and you end up just building whatever has not already in the cache. So you're getting an incremental compile, and you're copying in all of these surrounding objects.
Speaker 5: If you had infinite IO availability, how much faster would your build run?
Joshua: Infinite IO. I will have to think about that one.
Speaker 2: There's always the option to take that offline. Any more questions? Then thank you for the questions. Thank you Joshua for the great presentation.
Joshua: Thank you.