GitHub’s Product Security Engineering team writes code and implements tools that help secure the code that powers GitHub. We use GitHub Advanced Security (GHAS) to discover, track, and remediate vulnerabilities and enforce secure coding standards at scale. One tool we rely heavily on to analyze our code at scale is CodeQL.
CodeQL is GitHub’s static analysis engine that powers automated security analyses. You can use it to query code in much the same way you would query a database. It provides a much more robust way to analyze code and uncover problems than an old-fashioned text search through a codebase.
The following post will detail how we use CodeQL to keep GitHub secure and how you can apply these lessons to your own organization. You will learn why and how we use:
- Custom query packs (and how we create and manage them).
- Custom queries.
- Variant analysis to uncover potentially insecure programming practices.
Enabling CodeQL at scale
We employ CodeQL in a variety of ways at GitHub.
- Default setup with the default and security-extended query suites
Default setup with the default and security-extended query suites meets the needs of the vast majority of our over 10,000 repositories. With these settings, pull requests automatically get a security review from CodeQL. - Advanced setup with a custom query pack
A few repositories, like our large Ruby monolith, need extra special attention, so we use advanced setup with a query pack containing custom queries to really tailor to our needs. - Multi-repository variant analysis (MRVA)
To conduct variant analysis and quick auditing, we use MRVA. We also write custom CodeQL queries to detect code patterns that are either specific to GitHub’s codebases or patterns we want a security engineer to manually review.
The specific custom Actions workflow step we use on our monolith is pretty simple. It looks like this:
- name: Initialize CodeQL
uses: github/codeql-action/init@v3
with:
languages: ${{ matrix.language }}
config-file: ./.github/codeql/${{ matrix.language }}/codeql-config.yml
Our Ruby configuration is pretty standard, but advanced setup offers a variety of configuration options using custom configuration files. The interesting part is the packs
option, which is how we enable our custom query pack as part of the CodeQL analysis. This pack contains a collection of CodeQL queries we have written for Ruby, specifically for the GitHub codebase.
So, let’s dive deeper into why we did that—and how!
Publishing our CodeQL query pack
Initially, we published CodeQL query files directly to the GitHub monolith repository, but we moved away from this approach for several reasons:
- It required going through the production deployment process for each new or updated query.
- Queries not included in a query pack were not pre-compiled, which slowed down CodeQL analysis in CI.
- Our test suite for CodeQL queries ran as part of the monolith’s CI jobs. When a new version of the CodeQL CLI was released, it sometimes caused the query tests to fail because of changes in the query output, even when there were no changes to the code in the pull request. This often led to confusion and frustration among engineers, as the failure wasn’t related to their pull request changes.
By switching to publishing a query pack to GitHub Container Registry (GCR), we’ve simplified our process and eliminated many of these pain points, making it easier to ship and maintain our CodeQL queries. So while it’s possible to deploy custom CodeQL query files directly to a repository, we recommend publishing CodeQL queries as a query pack to the GCR for easier deployment and faster iteration.
Creating our query pack
When setting up our custom query pack, we faced several considerations, particularly around managing dependencies like the ruby-all
package.
To ensure our custom queries remain maintainable and concise, we extend classes from the default query suite, such as the ruby-all
library. This allows us to leverage existing functionality rather than reinventing the wheel, keeping our queries concise and maintainable. However, changes to the CodeQL library API can introduce breaking changes, potentially deprecating our queries or causing errors. Since CodeQL runs as part of our CI, we wanted to minimize the chance of this happening, as this can lead to frustration and loss of trust from developers.
We develop our queries against the latest version of the ruby-all
package, ensuring we’re always working with the most up-to-date functionality. To mitigate the risk of breaking changes affecting CI, we pin the ruby-all
version when we’re ready to release, locking it in the codeql-pack.lock.yml
file. This guarantees that when our queries are deployed, they will run with the specific version of ruby-all
we’ve tested, avoiding potential issues from unintentional updates.
Here’s how we manage this setup:
- In our qlpack.yml, we set the dependency to use the latest version of
ruby-all
- During development, this configuration pulls in the latest version) of
ruby-all
when runningcodeql pack init
, ensuring we’re always up to date.// Our custom query pack's qlpack.yml library: false name: github/internal-ruby-codeql version: 0.2.3 extractor: 'ruby' dependencies: codeql/ruby-all: "*" tests: 'test' description: "Ruby CodeQL queries used internally at GitHub"
- Before releasing, we lock the version in the
codeql-pack.lock.yml
file, specifying the exact version to ensure stability and prevent issues in CI.// Our custom query pack's codeql-pack.lock.yml lockVersion: 1.0.0 dependencies: ... codeql/ruby-all: version: 1.0.6
This approach allows us to balance developing against the latest features of the ruby-all
package while ensuring stability when we release.
We also have a set of CodeQL unit tests that exercise our queries against sample code snippets, which helps us quickly determine if any query will cause errors before we publish our pack. These tests are run as part of the CI process in our query pack repository, providing an early check for issues. We strongly recommend writing unit tests for your custom CodeQL queries to ensure stability and reliability.
Altogether, the basic flow for releasing new CodeQL queries via our pack is as follows:
- Open a pull request with the new query.
- Write unit tests for the new query.
- Merge the pull request.
- Increment the pack version in a new pull request.
- Run
codeql pack init
to resolve dependencies. - Correct unit tests as needed.
- Publish the query pack to the GitHub Container Registry (GCR).
- Repositories with the query pack in their config will start using the updated queries.
We have found this flow balances our team’s development experience while ensuring stability in our published query pack.
Configuring our repository to use our custom query pack
We won’t provide a general recommendation on configuration here, given that it ultimately depends on how your organization deploys code. We opted against locking our pack to a particular version in our CodeQL configuration file (see above). Instead, we chose to manage our versioning by publishing the CodeQL package in GCR. This results in the GitHub monolith retrieving the latest published version of the query pack. To roll back changes, we simply have to republish the package. In one instance, we released a query that had a high number of false positives and we were able to publish a new version of the pack that removed that query in less than 15 minutes. This is faster than the time it would have taken us to merge a pull request on the monolith repository to roll back the version in the CodeQL configuration file.
One of the problems we encountered with publishing the query pack in GCR was how to easily make the package available to multiple repositories within our enterprise. There are several approaches we explored.
- Grant access permissions for individual repositories. On the package management page, you can grant permissions for individual repositories to access your package. This was not a good solution for us since we have too many repositories for it to be feasible to do manually, yet there is not currently a way to configure programmatically using an API.
- Mint a personal access token for the CodeQL action runner. We could have minted a personal access token (PAT) that has access to read all packages for our organization and added that to the CodeQL action runner. However, this would have required managing a new token, and it seemed a bit more permissive than we wanted because it could read all of our private packages rather than ones we explicitly allow it to have access to.
- Provide access permissions via a linked repository. We ended up implementing the third solution that we explored. We link a repository to the package and allow the package to inherit access permissions from the linked repository.
CodeQL query pack queries
We write a variety of custom queries to be used in our custom query packs. These cover GitHub-specific patterns that aren’t included in the default CodeQL query pack. This allows us to tailor the analysis to patterns and preferences that are specific to our company and codebase. Some of the types of things we alert on using our custom query pack include:
- High-risk APIs specific to GitHub’s code that can be dangerous if they receive unsanitized user input.
- Use of specific built-in Rails methods for which we have safer, custom methods or functions.
- Required authorization methods not being used in our REST API endpoint definitions and GraphQL object/mutation definitions.
- REST API endpoints and GraphQL mutations that require engineers to define access control methods to determine which actors can access them. (Specifically, the query detects the absence of this method definition to ensure that the actors’ permissions are being checked for these endpoints.)
- Use of signed tokens so we can nudge engineers to include Product Security as a reviewer when using them.
Custom queries can be used more for educational purposes rather than being blockers to shipping code. For example, we want to alert engineers when they use the ActiveRecord::decrypt method. This method should generally not be used in production code, as it will cause an encrypted column to become decrypted. We use the recommendation severity in the query metadata so these alerts are treated as more of an informational alert. That means this may trigger an alert in a pull request, but it won’t cause the CodeQL CI job to fail. We use this lower severity level to allow engineers to assess the impact of new queries without immediate blocking. Additionally, this alert level isn’t tracked through our Fundamentals program, meaning it doesn’t require immediate action, reflecting the query’s maturity as we continue to refine its relevance and risk assessment.
/**
* @id rb/github/use-of-activerecord-decrypt
* @description Do not use the .decrypt method on AR models, this will decrypt all encrypted attributes and save
* them unencrypted, effectively undoing encryption and possibly making the attributes inaccessible.
* If you need to access the unencrypted value of any attribute, you can do so by calling my_model.attribute_name.
* @kind problem
* @severity recommendation
* @name Use of ActiveRecord decrypt method
* @tags security
* github-internal
*/
import ruby
import DataFlow
import codeql.ruby.DataFlow
import codeql.ruby.frameworks.ActiveRecord
/** Match against .decrypt method calls where the receiver may be an ActiveRecord object */
class ActiveRecordDecryptMethodCall extends ActiveRecordInstanceMethodCall {
ActiveRecordDecryptMethodCall() { this.getMethodName() = "decrypt" }
}
from ActiveRecordDecryptMethodCall call
select call,
"Do not use the .decrypt method on AR models, this will decrypt all encrypted attributes and save them unencrypted.
Another educational query is the one mentioned above in which we detect the absence of the `control_access` method in a class that defines a REST API endpoint. If a pull request introduces a new endpoint without `control_access`, a comment will appear on the pull request saying that the `control_access` method wasn’t found and it’s a requirement for REST API endpoints. This will notify the reviewer of a potential issue and prompt the developer to fix it.
/**
* @id rb/github/api-control-access
* @name Rest API Without 'control_access'
* @description All REST API endpoints must call the 'control_access' method, to ensure that only specified actor types are able to access the given endpoint.
* @kind problem
* @tags security
* github-internal
* @precision high
* @problem.severity recommendation
*/
import codeql.ruby.AST
import codeql.ruby.DataFlow
import codeql.ruby.TaintTracking
import codeql.ruby.ApiGraphs
// Api::App REST API endpoints should generally call the control_access method
private DataFlow::ModuleNode appModule() {
result = API::getTopLevelMember("Api").getMember("App").getADescendentModule() and
not result = protectedApiModule() and
not result = staffAppApiModule()
}
// Api::Admin, Api::Staff, Api::Internal, and Api::ThirdParty REST API endpoints do not need to call the control_access method
private DataFlow::ModuleNode protectedApiModule() {
result =
API::getTopLevelMember(["Api"])
.getMember(["Admin", "Staff", "Internal", "ThirdParty"])
.getADescendentModule()
}
// Api::Staff::App REST API endpoints do not need to call the control_access method
private DataFlow::ModuleNode staffAppApiModule() {
result =
API::getTopLevelMember(["Api"]).getMember("Staff").getMember("App").getADescendentModule()
}
private class ApiRouteWithoutControlAccess extends DataFlow::CallNode {
ApiRouteWithoutControlAccess() {
this = appModule().getAModuleLevelCall(["get", "post", "delete", "patch", "put"]) and
not performsAccessControl(this.getBlock())
}
}
predicate performsAccessControl(DataFlow::BlockNode blocknode) {
accessControlCalled(blocknode.asExpr().getExpr())
}
predicate accessControlCalled(Block block) {
// the method `control_access` is called somewhere inside `block`
block.getAStmt().getAChild*().(MethodCall).getMethodName() = "control_access"
}
from ApiRouteWithoutControlAccess api
select api.getLocation(),
"The control_access method was not detected in this REST API endpoint. All REST API endpoints must call this method to ensure that the endpoint is only accessible to the specified actor types."
Variant analysis
Variant analysis (VA) refers to the process of searching for variants of security vulnerabilities. This is particularly useful when we’re responding to a bug bounty submission or a security incident. We use a combination of tools to do this, including GitHub’s code search functionality, custom scripts, and CodeQL. We will often start by using code search to find patterns similar to the one that caused a particular vulnerability across numerous repositories. This is sometimes not good enough, as code search is not semantically aware, meaning that it cannot determine whether a given variable is an Active Record object or whether it is being used in an `if` expression. To answer those types of questions we turn to CodeQL.
When we write CodeQL queries for variant analysis we are much less concerned about false positives, since the goal is to provide results for security engineers to analyze. The quality of the code is also not quite as important, as these queries will only be used for the duration of the VA effort. Some of the types of things we use CodeQL for during VAs are:
- Where are we using SHA1 hashes?
- One of our internal API endpoints was vulnerable to SQLi according to a recent bug bounty report. Where are we passing user input to that API endpoint?
- There is a problem with how some HTTP request libraries in Ruby handle the proxy setting. Can we look at places we are instantiating our HTTP request libraries with a proxy setting?
One recent example involved a subtle vulnerability in Rails. We wanted to detect when the following condition was present in our code:
- A parameter was used to look up an Active Record object.
- That parameter is later reused after the Active Record object is looked up.
The concern with this condition is that it could lead to an insecure direct object reference (IDOR) vulnerability because Active Record finder methods can accept an array. If the code looks up an Active Record object in one call to determine if a given entity has access to a resource, but later uses a different element from that array to find an object reference, that can lead to an IDOR vulnerability. It would be difficult to write a query to detect all vulnerable instances of this pattern, but we were able to write a query that found potential vulnerabilities that gave us a list of code paths to manually analyze. We ran the query against a large number of our Ruby codebases using CodeQL’s MRVA.
The query, which is a bit hacky and not quite production grade, is below:
/**
* @name wip array query
* @description an array is passed to an AR finder object
*/
import ruby
import codeql.ruby.AST
import codeql.ruby.ApiGraphs
import codeql.ruby.frameworks.Rails
import codeql.ruby.frameworks.ActiveRecord
import codeql.ruby.frameworks.ActionController
import codeql.ruby.DataFlow
import codeql.ruby.Frameworks
import codeql.ruby.TaintTracking
// Gets the "final" receiver in a chain of method calls.
// For example, in `Foo.bar`, this would give the `Foo` access, and in
// `foo.bar.baz("arg")` it would give the `foo` variable access
private Expr getUltimateReceiver(MethodCall call) {
exists(Expr recv |
recv = call.getReceiver() and
(
result = getUltimateReceiver(recv)
or
not recv instanceof MethodCall and result = recv
)
)
}
// Names of class methods on ActiveRecord models that may return one or more
// instances of that model. This also includes the `initialize` method.
// See https://api.rubyonrails.org/classes/ActiveRecord/FinderMethods.html
private string staticFinderMethodName() {
exists(string baseName |
baseName = ["find_by", "find_or_create_by", "find_or_initialize_by", "where"] and
result = baseName + ["", "!"]
)
// or
// result = ["new", "create"]
}
private class ActiveRecordModelFinderCall extends ActiveRecordModelInstantiation, DataFlow::CallNode
{
private ActiveRecordModelClass cls;
ActiveRecordModelFinderCall() {
exists(MethodCall call, Expr recv |
call = this.asExpr().getExpr() and
recv = getUltimateReceiver(call) and
(
// The receiver refers to an `ActiveRecordModelClass` by name
recv.(ConstantReadAccess).getAQualifiedName() = cls.getAQualifiedName()
or
// The receiver is self, and the call is within a singleton method of
// the `ActiveRecordModelClass`
recv instanceof SelfVariableAccess and
exists(SingletonMethod callScope |
callScope = call.getCfgScope() and
callScope = cls.getAMethod()
)
) and
(
call.getMethodName() = staticFinderMethodName()
or
// dynamically generated finder methods
call.getMethodName().indexOf("find_by_") = 0
)
)
}
final override ActiveRecordModelClass getClass() { result = cls }
}
class FinderCallArgument extends DataFlow::Node {
private ActiveRecordModelFinderCall finderCallNode;
FinderCallArgument() { this = finderCallNode.getArgument(_) }
}
class ParamsHashReference extends DataFlow::CallNode {
private Rails::ParamsCall params;
// TODO: only direct element references against `params` calls are considered
ParamsHashReference() { this.getReceiver().asExpr().getExpr() = params }
string getArgString() {
result = this.getArgument(0).asExpr().getConstantValue().getStringlikeValue()
}
}
class ArrayPassedToActiveRecordFinder extends TaintTracking::Configuration {
ArrayPassedToActiveRecordFinder() { this = "ArrayPassedToActiveRecordFinder" }
override predicate isSource(DataFlow::Node source) { source instanceof ParamsHashReference }
override predicate isSink(DataFlow::Node sink) {
sink instanceof FinderCallArgument
}
string getParamsArg(DataFlow::CallNode paramsCall) {
result = paramsCall.getArgument(0).asExpr().getConstantValue().getStringlikeValue()
}
// this doesn't check for anything fancy like whether it's reuse in a if/else
// only intended for quick manual audit filtering of interesting candidates
// so remains fairly broad to not induce false negatives
predicate paramsUsedAfterLookups(DataFlow::Node source) {
exists(DataFlow::CallNode y | y instanceof ParamsHashReference
and source.getEnclosingMethod() = y.getEnclosingMethod()
and source != y
and getParamsArg(source) = getParamsArg(y)
// we only care if it's used again AFTER an object lookup
and y.getLocation().getStartLine() > source.getLocation().getStartLine())
}
}
from ArrayPassedToActiveRecordFinder config, DataFlow::Node source, DataFlow::Node sink
where config.hasFlow(source, sink) and config.paramsUsedAfterLookups(source)
select source, sink.getLocation()
Conclusion
CodeQL can be very useful for product security engineering teams to detect and prevent vulnerabilities at scale. We use a combination of queries that run in CI using our query pack and one-off queries run through MRVA to find potential vulnerabilities and communicate them to engineers. CodeQL isn’t only useful for finding security vulnerabilities, though; it is also useful for detecting the presence or absence of security controls that are defined in code. This saves our security team time by surfacing certain security problems automatically, and saves our engineers time by detecting them earlier in the development process.
Writing custom CodeQL queries
Tips for getting started
We have a large number of articles and resources for writing custom CodeQL queries. If you haven’t written custom CodeQL queries before, here are some resources to help get you started:
Improve the security of your applications today by enabling CodeQL for free on your public repositories, or try GitHub Advanced Security for your organization.
Michael Recachinas, GitHub Staff Security Engineer, also contributed to this blog post.