Today we’ll take a close look at a common feature in web applications: downloading content from user-submitted URLs. While this seems like a simple task on its surface, there are some nasty pitfalls that must be handled in order to secure the implementation and ensure its scalability.

For the purposes of this post, we’ll pretend that we’re writing a REST endpoint that allows users to set their avatar by entering the avatar’s web address into a form. I’ll be providing Java code samples, however the core concepts can (and should) be applied to any language.

Part 1: Validating the URL

One of the central rules of writing secure software is to consider all user input untrusted. This is particularly important for our application, as we will be using a user-inputted URL to perform expensive network calls to remote servers. We’ll start securing our system by validating that the URL is in fact a URL. RFC-1738 defines the structure of web URLs, and guides the implementation of most language-specific URL libraries. We will use these libraries to immediately reject any malformed URLs. Sample Java code follows:

public class URLResolver {
	public byte[] resolveUrl(String url) throws URLResolverException {
		// stub for now
	}

	private static URL parseUrl(String url) {
		try {
			return new URL(url);
		} catch (MalformedURLException e) {
			throw new URLResolverException(e);
		}
	}
	
	private class URLResolverException extends Exception {
        public URLResolverException(String message) {
            super(message);
        }

        public URLResolverException(Throwable cause) {
            super(cause);
        }
    }
}

Java implementation note: Be careful with the URL class. Its equals() and hashCode() methods will make blocking network calls, so avoid putting instances in collections.

We now know that we’re dealing with a well-formed URL. However, we can’t stop there. As per the URL spec, a URL’s scheme part can be much more than just http or https. All of the following are valid URL schemes:

There are many more valid schemes than just those listed above. Thus, it’s important that we whitelist the URL schemes that our application will support. For our purposes, just http and https will do. Our implementation now becomes:

public class URLResolver {
	private static Set<String> allowedSchemes = new HashSet<>();

    static {
        allowedSchemes.add("http");
        allowedSchemes.add("https");
    }
	
	public byte[] resolveUrl(String url) throws URLResolverException {
		// stub for now
	}

    private URL parseUrl(String url) throws URLResolverException {
        URL uri;

        try {
            uri = new URL(url);
        } catch (MalformedURLException e) {
            throw new URLResolverException(e);
        }

        if (!allowedSchemes.contains(uri.getProtocol())) {
            throw new URLResolverException("Invalid scheme.");
        }

        return uri;
    }
	
	public static class URLResolverException extends Exception {
		URLResolverException(Throwable e) {
			super(e);
		}
		
		URLResolverException(String message) {
			super(message);
		}
	}
}

Now for some fun stuff. Our URL is both well-formed and describes the location of an HTTP resource, however we have no idea about where that resource actually exists on the web. A malicious user could provide an internal IP address or a domain name whose A record points to one - a gaping security hole that punctures the defenses of your firewall. So, let’s fix this. Java, luckily, makes it easy via the InetAddress.isSiteLocalAddress() and InetAddress.isLoopbackAddress() methods:

InetAddress address = InetAddress.getByName("192.168.0.1");
assert address.isSiteLocalAddress();
InetAddress address2 = InetAddress.getByName("localhost");
assert address2.isLoopbackAddress();

isSiteLocalAddress()’s implementation can easily be translated to other languages. All it does under the hood is return true if the remote IP address is in the ranges 10/8, 172.16/12 and 192.168/16.

Putting it all together:

public class URLResolver {
	private static Set<String> allowedSchemes = new HashSet<>();
	
	static {
		allowedSchemes.put("http");
		allowedSchemes.put("https");
	}
	
	public byte[] resolveUrl(String url) throws URLResolverException {
		// stub for now
	}

    private URL parseUrl(String url) throws URLResolverException {
        URL uri;

        try {
            uri = new URL(url);
        } catch (MalformedURLException e) {
            throw new URLResolverException(e);
        }

        if (!allowedSchemes.contains(uri.getProtocol())) {
            throw new URLResolverException("Invalid scheme.");
        }

		 try {
            InetAddress address = InetAddress.getByName(url.getHost());

            if (address.isSiteLocalAddress() || address.isLoopbackAddress()) {
                throw new URLResolverException("Refusing to download site-local or loopback address.");
            }
        } catch (UnknownHostException e) {
            throw new URLResolverException(e);
        }

        return uri;
    }
	
	public static class URLResolverException extends Exception {
		URLResolverException(Throwable e) {
			super(e);
		}
		
		URLResolverException(String message) {
			super(message);
		}
	}
}

We now have a valid URL that we can use to open a connection the remote server and begin downloading resources. The next section will tackle doing that securely.

Part 2: Resolving the URL

We’ll start by opening our connection to the remote server. In Java, this is easy - simply call openConnection on any URL object to get back an HttpUrlConnection object:

// rest of class elided for brevity
public byte[] resolveUrl(String url) throws URLResolverException {
	try {
	    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        throw new URLResolverException(e);
    }
}

We’ve now reached out to the remote server. Some computer floating around in the ether is now communicating directly with our backend. Everything it sends us is untrusted user input too. We’ll start validating its input by armoring against the null case - what happens if the server never sends us anything at all? Let’s set some timeouts so that it can’t tie up our server’s network threads for too long:

// rest of class elided for brevity
public byte[] resolveUrl(String url) throws URLResolverException {
	try {
	    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
		conn.setConnectTimeout(5000);
        conn.setReadTimeout(5000);
    } catch (IOException e) {
        throw new URLResolverException(e);
    }
}

Now let’s start handling the response code from the remote server. HTTP response codes can be boiled down into the following general categories: everything is OK, something bad happened, or this thing exists somewhere else. That final case - the redirect case - is the most problematic for us. Let’s tackle it now.

Safely Handling Redirects

I’ll start off by saying that it’s likely your production application doesn’t need to handle redirects at all and can simply reject any URLs that attempt to redirect. In fact, I’d say that our profile picture application should do that. However, following redirects makes for a better user experience and is common enough that I’ll cover it.

First, we should always set a hard limit on the maximum number of redirects that we’ll accept in order to avoid falling into a redirect loop. RFC-1945, which defines the HTTP protocol, recommends never automatically redirecting more than 5 times. Let’s roll with that number.

On to implementation. HttpUrlConnection does have the ability to automatically follow redirects - one simply needs to call setInstanceFollowRedirects(true) on the HttpUrlConnection instance. HttpUrlConnection will read its max redirects number from the Java system property http.maxRedirects, which is read once on startup and can be set via System.setProperty() or via a command line parameter. However, I personally view this solution as a bit inflexible. First, it sets the max redirects parameter for your entire JVM instance, which likely isn’t desired. Second, you lose access to any intermediate URLs in a chain of redirects. This is an issue - it’s possible that one of those redirects could in fact be to a non-site-local address or unsupported protocol! To combat this, we’ll roll our own redirection logic. I’ll post the code first, followed by an explanation:

// rest of class elided for brevity
private static final int MAX_REDIRECTS = 5;

private HttpUrlConnection connect(URL url, int redirectCount) throws URLResolverException {
	try {
	    HttpURLConnection conn = (HttpURLConnection) url.openConnection();
	
	    conn.setConnectTimeout(5000);
	    conn.setReadTimeout(5000);
	    conn.setInstanceFollowRedirects(false);
	    conn.setRequestProperty("User-Agent", "your-user-agent");
	
	    int status = conn.getResponseCode();
	
	    switch (status) {
	        case HTTP_OK:
	            return conn;
	        case HTTP_MOVED_TEMP:
	        case HTTP_MOVED_PERM:
	        case HTTP_SEE_OTHER:
	            return handleRedirect(conn, redirectCount);
	        default:
	            throw new URLResolverException("Received non-OK or redirect status code: " + status);
	    }
	} catch (IOException e) {
	    throw new URLResolverException(e);
	}
}

private HttpUrlConnection handleRedirect(HttpURLConnection conn, int redirectCount) throws URLResolverException {
    if (redirectCount == MAX_REDIRECTS) {
        throw new URLResolverException("Reached max redirects.");
    }

    String location = conn.getHeaderField("Location");
    
    if (location == null) {
    	throw new URLResolverException("Received null redirect.");
    }

    URL newUrl;

    try {
        if (location.startsWith("/")) {
            newUrl = new URL(conn.getURL(), location);
        } else {
            newUrl = new URL(location);
        }
    } catch (MalformedURLException e) {
        throw new URLResolverException(e);
    }

    return connect(newUrl, redirectCount + 1);
}

Here’s a breakdown of what’s happening in the snippet above.

  1. First, we open our connect as described earlier.
  2. We set instanceFollowRedirects to false so that we can leverage our custom redirection logic.
  3. Next, we get the connection’s response code and branch over its value. We’re explicit about which status codes we’ll support. Any unsupported status code will result in an error being thrown.
  4. When we encounter a 301 (Moved Temporarily), 302 (Moved Permanently), or 303 (See Other) redirect code, we extract the Location header and recursively open a new connection to the URL it provides us. Note that some servers provide location headers as absolute paths rather than fully-qualified URLs, so if the location starts with / we rebuild a fully-qualified URL with the new location.
  5. For each recursive call, we increment the value of redirectCount. If the redirect count is exceeded, we throw an error.

It’s entirely possible to rewrite the above snippet without recursion, however I’ll leave that open as an exercise for the reader. However you decide to implement the above, however, it’s important that the same checks that we perform for the user-inputted URL are also performed for any URLs provided as part of a redirect.

Handling the Response Body

At the end of the redirect journey, we’re left with a URL that points to a real resource somewhere on the web. It’s time to do the work we originally signed up for - downloading that resource. Luckily, HttpURLConnection gives us an InputStream to work with, so it’s easy to get started:

// rest of class elided for brevity

private byte[] handleOk(HttpURLConnection conn) throws URLResolverException {
    String contentType = conn.getContentType();

    try (
        InputStream cStream = conn.getInputStream();
        BufferedInputStream bStream = new BufferedInputStream(cStream)
    ) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();

        int read;

        while ((read = bStream.read()) != -1) {
            baos.write(read);
        }

        return baos.toByteArray();
    } catch (IOException e) {
        throw new URLResolverException(e);
    }
}

Here, we’re creating a BufferedInputStream and copying the resultant bytes to a ByteArrayOutputStream. Then, we return the OutputStream’s bytes. Voilá! We have a byte array consisting of content at some remote location.

We’re not done yet, though. The above method can easily be improved. First, it’s common practice for web servers to compress content with GZIP in order to reduce bandwidth usage. Let’s add the ability to handle GZIP’d content to our implementation:

// rest of class elided for brevity

private byte[] handleOk(HttpURLConnection conn) throws URLResolverException {
    String contentType = conn.getContentType();

    try (
        InputStream cStream = conn.getInputStream();
        InputStream iStream = isGzip(contentType) ? new GZIPInputStream(cStream)
            : cStream;
        BufferedInputStream bStream = new BufferedInputStream(iStream)
    ) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();

        int read;

        while ((read = bStream.read()) != -1) {
            baos.write(read);
        }

        return baos.toByteArray();
    } catch (IOException e) {
        throw new URLResolverException(e);
    }
}

GZIPInputStream will decompress our InputStream on-the-fly as we read from it. However, we’ve now magnified an as-of-yet unhandled attack vector: large files. With the above implementation, attackers can tie up our application by asking it to download extremely large files. Adding GZIP support makes this even easier, since attackers create Zip Bombs that inflate to many orders of magnitude larger than their original size. To fix this, we need to add a maximum file size and enforce by counting the number of bytes read from our output stream. We can’t rely on file size headers, since those can be forged. Here’s the final code:

// rest of class elided for brevity
// 10 MB maximum size
private static final long MAX_SIZE = 10000000L;

private byte[] handleOk(HttpURLConnection conn) throws URLResolverException {
    String contentType = conn.getContentType();

    try (
        InputStream cStream = conn.getInputStream();
        InputStream iStream = isGzip(contentType) ? new GZIPInputStream(cStream)
            : cStream;
        BufferedInputStream bStream = new BufferedInputStream(iStream)
    ) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();

        int read;

        while ((read = bStream.read()) != -1) {
            if (baos.size() == MAX_SIZE) {
                bStream.close();
                throw new URLResolverException("Reached maximum file size.");
            }

            baos.write(read);
        }

        return baos.toByteArray();
    } catch (IOException e) {
        throw new URLResolverException(e);
    }
}

That should do it. In the next section, we’ll refactor our class for usage in our application.

Part 3. Putting It All Together

We’ve written a fair amount of code thus far. Here’s the entire class, including imports:

package com.mslipper.urlresolver;

import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.*;
import java.util.HashSet;
import java.util.Set;
import java.util.zip.GZIPInputStream;

import static java.net.HttpURLConnection.*;

public class URLResolver {
    private static final long MAX_SIZE = 10000000L;

    private static final int MAX_REDIRECTS = 5;

    private static Set<String> allowedSchemes = new HashSet<>();

    static {
        allowedSchemes.add("http");
        allowedSchemes.add("https");
    }

    public byte[] resolveUrl(String urlStr) throws URLResolverException {
        URL url = parseUrl(urlStr);
        HttpURLConnection conn = connect(url, 0);
        return handleOk(conn);
    }

    private HttpURLConnection connect(URL url, int redirectCount) throws URLResolverException {
        try {
            HttpURLConnection conn = (HttpURLConnection) url.openConnection();

            conn.setConnectTimeout(5000);
            conn.setReadTimeout(5000);
            conn.setInstanceFollowRedirects(false);
            conn.setRequestProperty("User-Agent", "your-user-agent");

            int status = conn.getResponseCode();

            switch (status) {
                case HTTP_OK:
                    return conn;
                case HTTP_MOVED_TEMP:
                case HTTP_MOVED_PERM:
                case HTTP_SEE_OTHER:
                    return handleRedirect(conn, redirectCount);
                default:
                    throw new URLResolverException("Received non-OK or redirect status code: " + status);
            }
        } catch (IOException e) {
            throw new URLResolverException(e);
        }
    }

    private HttpURLConnection handleRedirect(HttpURLConnection conn, int redirectCount) throws URLResolverException {
        if (redirectCount == MAX_REDIRECTS) {
            throw new URLResolverException("Reached max redirects.");
        }

        String location = conn.getHeaderField("Location");

        if (location == null) {
            throw new URLResolverException("Received null redirect.");
        }

        URL newUrl;

        try {
            if (location.startsWith("/")) {
                newUrl = new URL(conn.getURL(), location);
            } else {
                newUrl = new URL(location);
            }
        } catch (MalformedURLException e) {
            throw new URLResolverException(e);
        }

        return connect(newUrl, redirectCount + 1);
    }

    private URL parseUrl(String urlStr) throws URLResolverException {
        URL url;

        try {
            url = new URL(urlStr);
        } catch (MalformedURLException e) {
            throw new URLResolverException(e);
        }

        if (!allowedSchemes.contains(url.getProtocol())) {
            throw new URLResolverException("Invalid scheme.");
        }

 		 try {
            InetAddress address = InetAddress.getByName(url.getHost());

            if (address.isSiteLocalAddress() || address.isLoopbackAddress()) {
                throw new URLResolverException("Refusing to download site-local or loopback address.");
            }
        } catch (UnknownHostException e) {
            throw new URLResolverException(e);
        }

        return url;
    }

    private byte[] handleOk(HttpURLConnection conn) throws URLResolverException {
        String contentType = conn.getContentType();

        try (
            InputStream cStream = conn.getInputStream();
            InputStream iStream = isGzip(contentType) ? new GZIPInputStream(cStream)
                : cStream;
            BufferedInputStream bStream = new BufferedInputStream(iStream)
        ) {
            ByteArrayOutputStream baos = new ByteArrayOutputStream();

            int read;

            while ((read = bStream.read()) != -1) {
                if (baos.size() == MAX_SIZE) {
                    bStream.close();
                    throw new URLResolverException("Reached maximum file size.");
                }

                baos.write(read);
            }

            return baos.toByteArray();
        } catch (IOException e) {
            throw new URLResolverException(e);
        }
    }

    private boolean isGzip(String contentType) {
        return contentType != null && contentType.contains("gzip");
    }

    private class URLResolverException extends Exception {
        public URLResolverException(String message) {
            super(message);
        }

        public URLResolverException(Throwable cause) {
            super(cause);
        }
    }
}

If you’re so inclined, you could refactor this class to take in some of the constants we’ve defined (i.e., our timeouts and maximum file sizes) to as part of the constructor in order to provide more configurability.

I hope this post was of use. If you have any questions or comments, please don’t hesitate to send me an e-mail at inquiries@matthewslipper.com and I’d be happy to answer them.