Boost.URL Logo

PrevUpHomeNext

URL

URLs, URIs, and URNs

Uniform Resource Locators (URL - rfc1738), also informally called "web addresses", are able to describe the name and location of a resource. A URL scheme, such as http, identifies the method used to access the resource. A URL host, such as www.boost.org, is used to identify where the resource is located. The interpretation of a URL might depend on scheme-specific requirements.

Table 1.1. Example: URLs

URL

Scheme

Host

Resource

https://www.boost.org/index.html

https

www.boost.org

index.html

ftp://host.dom/etc/motd

ftp

host.dom

etc/motd


Classical View

URLs are often compared to Uniform Resource Names (URN - rfc1738), a scheme whose primary purpose is labeling resources with location-independent identifiers. URNs, as other schemes, have their own syntax. The scheme urn: is reserved to URNs, which do not specify how to locate a resource:

Table 1.2. Example: URN

URN

Resource

Namespace

Identifier

urn:isbn:045145052

isbn:045145052

isbn

045145052


Uniform Resource Identifiers (URI - rfc3986) define a general scheme-independent syntax for references to abstract or physical resources. The initial URI specification (rfc2396) described them as either URLs and URNs (rfc2396 section 1.2). The current specifications (rfc3986) refer to this hierarchy as the Classical View (rfc3305, Section 2.1) of URI partitioning:

Table 1.3. URIs: Classical View

URI

Category

https://www.boost.org

URL

mailto:mduerst@ifi.unizh.ch

URL

telnet://melvyl.ucop.edu/

URL

urn:isbn:0-486-27557-4

URN


The following are examples of invalid URIs:

Table 1.4. Invalid URIs

Component

Example

Note

Protocol-Relative Link (PRL)

www.boost.org [a]

Missing scheme.

URI-reference

index.html

Missing scheme. Missing urn: scheme and requirements.

[a] Formally, www.boost.org is either a URI-reference with path www.boost.org (rfc3986 Section 4.1) or a Protocol-Relative Link (PRL). It is not a URI according to rfc3986, although often described as such in some sources.


Contemporary View

The Classical View of URI partitioning, where a URI is either a URI or a URL, caused enough confusion to justify a specification about URI partitioning (rfc3305).

Common sources of confusion in the Classical View were:

  1. Most possible URIs were also URLs.
  2. URLs and Relative references are not required to locate a resource, while they are still not URNs.
  3. Scheme-independent URLs and URIs have the same grammar. A single algorithm is used for parsing both.
  4. URNs have scheme-specific requirements beyond the URI specification.

Thus, the URL/URN hierarchy became less relevant and the Contemporary View of URI partitioning (rfc3305, Section 2.2) is now that:

  1. URLs don't refer to a formal partition of the URI space.
  2. A scheme does not have to be classified into the discrete URL/URN categories.
  3. The uri: scheme is one of many possible URI schemes.
  4. All schemes can define subspaces and urn: namespaces are URN subspaces.
  5. Any URI can be a locator, a name, or both.

In this view, the terms URLs and URIs have the same grammar and are used interchangeably in that regard.

Table 1.5. URLs (or URIs): Contemporary View

Example

Scheme

Host (Locator Component)

Path (Name Component)

https://www.boost.org/index.html

https

www.boost.org

index.html

telnet://melvyl.ucop.edu/

telnet

melvyl.ucop.edu

mailto:mduerst@ifi.unizh.ch

mailto

mduerst@ifi.unizh.ch

urn:isbn:0-486-27557-4

urn

isbn:0-486-27557-4


The Contemporary View has been endorsed by rfc3305 (Section 5), and has been in use in all other specifications since then, including the current URI grammar (rfc3986, Section 1.1.3).

Although URIs and URLs have the same grammar, it's often useful to standardize on one of these terms. Recent RFC documents standardize on the term URI rather than the most restrictive term URL. However, the term URL is almost omnipresent in any other contexts for being more specific, which provides more communication clarity.

This library also adheres to this Contemporary View of URI partitioning and standardizes on the term "URL".

Notation

Following the syntax in rfc3986, a single algorithm is used for URLs, URIs and IRIs. When discussing particular grammars, its rules are presented exactly as it appears in the literature.

A URL string can be parsed using one of the parsing functions.

Table 1.6. Parsing Functions

Function

Grammar

Example

Notes

parse_uri

URI

http://www.boost.org/index.html?field=value#downloads

Supports fragment #downloads

parse_absolute_uri

absolute-URI

http://www.boost.org/index.html?field=value

Does not support fragment

parse_relative_ref

relative-ref

//www.boost.org/index.html?field=value#downloads

Does not require scheme

parse_uri_reference

URI-reference

http://www.boost.org/index.html

Any URI or relative-ref


The library uses the convention that each function parse_<component> operates according to the particular grammar rule <component> specified in rfc3986. The document inherits from rfc2396, where there are no URL, absolute-URL, URL-reference rules. Thus, for consistency, the main parsing functions also make reference to uris rather than urls.

The collective grammars parsed by these algorithms are specified below.

absolute-URI    = scheme ":" hier-part [ "?" query ]

relative-ref    = relative-part [ "?" query ] [ "#" fragment ]

URI             = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

URI-reference   = URI / relative-ref

hier-part       = "//" authority path-abempty
                / path-absolute
                / path-rootless
                / path-empty

relative-part   = "//" authority path-abempty
                / path-absolute
                / path-noscheme
                / path-empty
Example

The following is an example URI and its main parts:

     foo://example.com:8042/over/there?name=ferret#nose
     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
scheme     authority       path        query   fragment

For the complete specification please refer to rfc3986:

[Note] Note

This documentation refers to the Augmented Backus-Naur Form (ABNF) notation of rfc2234 to specify particular grammars used by algorithms and containers. While a complete understanding of the notation is not a requirement for using the library, it may help for understanding how valid components of URLs are defined. In particular, this will be of interest to users who wish to compose parsing algorithms using the combinators provided by the library.

Functions

All parsing functions accept a string_view and return a result<url_view>. The following example parses a string literal containing a URI:

urls::result< urls::url_view > r = urls::parse_uri( "https://www.example.com/path/to/file.txt" );

if( r.has_value() )                         // parsing was successful
{
    urls::url_view u = r.value();                 // extract the urls::url_view

    std::cout << u;                         // format the URL to cout
}
else
{
    std::cout << r.error().message();       // parsing failure; print error
}

While the parsing function refers to the URI grammar rule, the result refers to a url_view. The convention parse_<component> produces parse_uri for the URI grammar rule defined in rfc3986. However, as the library adheres to the Contemporary View of URI partitioning and standardizes on the term "URL", it makes reference to the term "URL" elsewhere.

When the input does not match the URL grammar, the error is reported through a result<url_view>. The result in a variant-like object which holds a url_view or an error_code in the case where the parsing failed. Note that like a string view, the URL view does not own the underlying character buffer. Instead, it references the string passed to the parsing function. The caller is required to ensure that the lifetime of the string extends until the view is destroyed.

Copying

The function url_view::collect may be used to create a copy of the underlying character buffer and attach ownership of the buffer to a newly returned view, which is wrapped in a shared pointer. The following code calls collect to create a read-only copy:

// This will hold our copy
std::shared_ptr<urls::url_view const> sp;
{
    std::string s = "/path/to/file.txt";

    // result::value() will throw an exception if an error occurs
    urls::url_view u = urls::parse_relative_ref( s ).value();

    // create a copy with ownership and string lifetime extension
    sp = u.collect();

    // At this point the string goes out of scope
}

// but `*sp` remains valid since it has its own copy
std::cout << *sp << "\n";

The interface of url_view decomposes the URL into its individual parts and allows for inspection of the various parts as well as returning metadata about the URL itself. These non-modifying observer operations are described in the sections that follow.

To create a mutable copy of the url_view, one can just create a url:

// This will hold our mutable copy
urls::url v;
{
    std::string s = "/path/to/file.txt";

    // result::value() will throw an exception if an error occurs
    v = urls::parse_relative_ref(s).value();

    // At this point the string goes out of scope
}

// but `v` remains valid since it has its own copy
std::cout << v << "\n";

// and it's mutable
v.set_encoded_fragment("anchor");

std::cout << v << "\n";
Return Type

In many places, functions in the library have a return type which uses the result alias template. This class allows the parsing algorithms to report errors without referring to exceptions.

The functions result::has_value and result::has_error can be used to check if the result contains an error.

urls::result< urls::url_view > r = urls::parse_uri( "https://www.example.com/path/to/file.txt" );

if( r.has_value() )                         // parsing was successful
{
    urls::url_view u = r.value();                 // extract the urls::url_view

    std::cout << u;                         // format the URL to cout
}
else
{
    std::cout << r.error().message();       // parsing failure; print error
}

This ensures result::value will not throw an error. In contexts where it is acceptable to throw errors, result::value can be used directly.

Check the reference for result for a synopsis of the type. For complete information please consult the full result documentation in Boost.System.


PrevUpHomeNext