Scanning Code for Malware: A Conversation with Robert Wood

Hugh Taylor:                       Tell us about yourself. Software is a big part of cyber policy. We really wanted to get a software perspective.

Robert Wood:                    I’m the Chief Security Officer at a company called SourceClear and the founder of Hack Your Cyber Career. At SourceClear we are focused on software composition analysis, basically trying to figure out what makes up a piece of software, including all of the open source libraries and frameworks that are in use, and then help manage risk and governance around this domain. My background is actually in red teaming at various security consultancies. So, I started my career at a very boutique consultancy in upstate New York where I did everything from red teaming to digital forensics to network penetration testing. I then moved to a much bigger consultancy focused more on application security. I did some system-wide security tests, hardware assessments, threat modeling, static analysis, and spent a lot more time red teaming. I then moved into the product company space, building and leading security programs where I was working for a healthcare company, and then transitioned over to SourceClear.

How Code Scanning Works

Hugh Taylor:                      How does code scanning work?

Robert Wood:                    We have a platform and then an agent that you would hook into your build process. Let’s say you’re building a new web application or a new native application or something. You’d hook it into your build process and it would build two main things; it would build a call graph, and it would build a dependency graph. The dependency graph is going to tell you all of the libraries that developers have included in the development of that application, and then past that, that’s the first order set of dependencies. And then the dependencies, libraries in this case, usually rely on themselves more libraries, and that pattern continues for quite a while. What we find is that a developer who includes 15 libraries in a system may end up with over 300 at build-time, that’s a lot of code, licenses, and attack surface that they had no idea existed. And so, basically, we strive to give you that insight into everything going within the application that you’re building, hence the term software composition analysis.

And then, the call graph is where we would map out all of the various control flows and code paths that your particular software is generating. And then, if there are any vulnerabilities associated with any of the dependencies that you’re using, we would be able to match up the vulnerable API calls with any known or proprietary vulnerability in our database with the vulnerable method, increasing the confidence in any particular finding. What that ends up doing for us, instead of just saying to a developer, “Hey, Mr. or Mrs. Developer, you are running this vulnerable version of this library, go fix it.” That, oftentimes, creates a lot of contention and friction between security and development teams when they respond with, “Well, we’re not using the actual code that’s vulnerable, therefore, we’re not vulnerable.” And both teams will typically bicker back and forth and try to say the burden of proof is on the other. But it ends up just creating needless friction in the end.

What we can do in that case is, because of the call graph, we can say with a very high degree of confidence that you either are or are not using the actual code that is vulnerable in your particular system. And so, it adds a little bit of context and confidence to the whole process.

Embedded Software Threats

Hugh Taylor:                       Do you feel that there are security threats embedded in software that are not well detected?

Robert Wood:                    Oh, 100%. If you think about the pace at which software moves, it’s really hard for any person or any computer to keep up. And usually technology will come out first, and then security has to catch up after the fact, adapting to that new technology or that new design pattern, or whatever.

Security is almost always at least one step behind on that front. And with technology moving at iterative speeds faster than we’ve ever seen before, it’s really hard for the industry to keep pace. So, there’s that part of it. And then, secondarily, in the open source world, and this is where things are really, really interesting. Vulnerable code can be copied, forked, or outright included as a library itself. Malicious code can also find its way into frameworks in very subtle ways, consider the case of a malicious ransomware library being hidden within a seemingly benign Java library. If that library is included by popular frameworks, then it could spread very quickly.

Basically, the fascinating thing about open source is there’s all these ways for vulnerable code, and legitimate code, to spread and be shared; this is its strength, but also its weakness. And when it replicates the good stuff, it does a lot of good. When bad stuff is gets out there it can permeate and spread really quickly.

Hugh Taylor:                       A recent story reported that code from a Kremlin-connected software company was found in a biometric analysis tool used by the Transportation of Security Administration. The company that made the software is French. The Russian code had not been disclosed by the maker. It’s not really clear whether it was some kind of back door, but it certainly presented a risk that Russian intelligence could access a lot of fingerprint and biometric data from the TSA.

Robert Wood:                    Yeah, and raise some eyebrows.

Hugh Taylor:                       I’m curious. Let’s say you were the TSA, could you have taken a step to try and mitigate that risk in advance, or would you have sent it to SourceClear to be scanned? What are processes or polices you can use to mitigate that sort of risk?

Robert Wood:                    In this case, you would call it your supply chain. And so, there’s a few angles that I think you would want to take. I’m not quite sure if the code in question here was a library, so a company like ours could help if that code was packaged up as an open source library. Let’s say it was a fingerprint reading library. (A library is, just to back up and just make sure it’s all out there, typically noted as a set of code that performs a very particular function that is packaged up and distributed through what’s called a package manager, like NPM or Maven. Basically, it makes it very easy for a developer to download and install and start using that library to get that very specific functionality out of it, whether it’s distributed computation, machine learning, encryption, or session management, whatever it happens to be.)

So, if it was a library, and I think any system owner should be investing in trying to figure out what is actually in their libraries. From a general perspective, I think that’s a good idea. And library inclusion is one big aspect of your software supply chain. From a vendor supply chain, I think a big part of this boils down to, like the TSA in this case, going to the organizations who are writing software for them, whether they’re contractors or they’re just purchasing it, etc., and basically trying to go through a set of security checks on that particular vendor, and then ask them to demonstrate that they are going through the same practices, and highlight what their own supply chain looks like. And so, they can note any particular anomalies or potential risks or things like that. This won’t get you the full picture, but you as a security leader can see several steps farther into the supply chain than you otherwise would by just reviewing your primary vendor. This process should also be tailored based on the risk of the system you’re building or maintaining.

In this case, if TSA were to go to that French company and say, “I would like you to demonstrate, where are you getting all of your code from? What kind of vendors are you working with?” And if they are telling the truth and they are doing things properly, you could ask for some technical reports to be generated to demonstrate this, but they would then disclose that they are affiliated in some way, shape, or form, or they’re consuming code from this Russian-based firm, potentially. And if the TSA has a problem with that from their own risk profile then they can choose to take action on it or not, for example.

A lot of supply chain issues, I believe, boil down to basically having visibility into what is actually in your supply chain, and then figuring out what your own risk threshold is going to be. This needs to be relative to what’s in the various things that are going into your supply chain and your particular business context.

Malware in Mobile Devices, A Key Cyber Policy Issue

Hugh Taylor:                       Yes. Are you familiar with malware that’s embedded on firmware?

Robert Wood:                    I have not worked directly with firmware-based malware, but I’m certainly familiar with the concept.

Hugh Taylor:                       Okay. I’ll give an example, I’m not sure if this was malware or software, but are you familiar with the issue of Chinese-made smartphones sending data back to Chinese servers?

Robert Wood:                    I have not read about that in particular, but I guess it wouldn’t surprise me.

Hugh Taylor:                       A lot of Chinese-made cell phones are sending data back to China, and most people think it’s for marketing purposes, like they’re collected usage data, but there’s lots of concern that a lot of these big manufacturers are connected to the Chinese intelligence services, and they’re just getting data on Americans without permission.

I’m trying to connect this to what you’re talking about with software scan. They’re running Android. Can you scan an Android instance before it gets put onto a device?

Robert Wood:                    Yes. This would be on the manufacturer of the device, or whoever, let’s say it’s Samsung, for example. The way that they would distribute a device is they have their own download repository or source that connects to their devices. They would package up a version of Android with all the stuff that they want on it, like when you buy a new phone from Verizon, it has those bundled Verizon apps on it. That’s exactly the process that they’re going through, you would have those Samsung-specific video apps or whatever they choose to distribute.

The manufacturer would bundle everything up together, and then they would install it on the device and ship it. And they can really put whatever they want on it. The Android part of this is just the underlying operating system that’s stitching everything together and handling the file management and the encryption and the way the software interacts with the hardware, etc.

So, any manufacturer could have absolutely go through that same process and validate everything that’s coming into the app, because Android is, indeed, open source, you can audit the hell out of it, it’s all there for review.

Hugh Taylor:                       But when it’s on the device, has it been compiled already?

Robert Wood:                    Yes.

Hugh Taylor:                       Like if someone hands you a phone, are you able to scan it? To what extent can you scan a code that’s already been installed?

Robert Wood:                    So, the consumer takes the device, they buy it, and now they want to do something to assess their own risk profile. Is that what you’re getting at?

Hugh Taylor:                       Let’s give a hypothetical scenario. Let’s say that the U.S. Navy PX is selling thousands of Chinese-made cell phones to sailors, and they want to know, are these devices telling the Chinese Government the names of our sailors, where they’re deployed?

Robert Wood:                    Yeah, so, that is definitely possible, but assessing compiled code is more difficult. For instance, when Android code is packaged up, one of the things that both legitimate and illegitimate developers, like malware developers, can do is put it through this process called obfuscation. That is basically where you’re taking a piece of the code, whether it’s a proprietary algorithm or potentially something malicious that you want to hide, and you can make it harder to analyze. You can encrypt the class names, you can scramble things, you can mislabel things. There’s a whole bunch of obfuscation techniques. And it makes the analysis of that code much harder.

And so, following this hypothetical scenario, if the Navy wanted to do what you described they would have a lot more work on their hands if they were to analyze every little thing on those devices before they went out. And so, what they would probably do in practice is outsource that to a consultancy or something, or they could do it themselves, I guess. But they would take a risk-based approach since there’s no such thing as unlimited time in security.

The first thing is they would likely do is a combination of both static analysis, which is analyzing the packages and all of that on the device itself and dynamic analysis. In this case, if they were worried about information being leaked back they may take the devices and set them up with a few mock profiles, like fake user profiles, and they would start using them. And then they would just monitor what they’re doing and how they’re sending data back to central services. They would basically do a trap and trace of everything that those devices are doing.

In theory, you’re doing very scenario-driven analysis. And so, they would use various apps on the phone, they would go various places, and all of this stuff, trying to see what the phone was actually doing and what it was sending out. And if it happened to be sending out data to telecommunications firms or maybe something back to China, they could take that particular network traffic, and maybe it’s encrypted, maybe it’s not. If it’s encrypted that would lead them down a rabbit hole they could investigate further. If it wasn’t or if they get to the bottom of that they can see what’s actually being sent to whom, and you end up playing detective at that point.