Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

RFC 3986

Functions like parse_uri are sufficient for converting URLs but they require that the entire string is consumed. When URLs appear as components of a larger grammar, it is desired to use composition of rules based parsing to process these along with other elements potentially unrelated to resource locators. To achieve this, the library provides rules for the top-level BNF productions found in rfc3986 and a rule for matching percent-encoded strings.

Percent Encoding

The percent-encoding mechanism is used to represent a data octet in a component when the corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. An encoded octet (also called an escape) is encoded as the percent character ('%') followed by a two-digit hexadecimal number representing the octet's numeric value:

pct-encoded   = "%" HEXDIG HEXDIG

URL components with possible percent-encoded characters are specified in the components BNF using the pct-encoded production, along with the characters which are not considered to be part of the reserved set. For example, this is how path characters are described in 3.3 Path of the RFC:

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

The library provides the pct_encoded_rule for matching strings which are percent-encoded. This function is passed the set of characters that may be used without escapes and returns a suitable rule. If the input is valid; that is, if there are no invalid escape sequences, the rule returns a decode_view. This is a forward range of characters which performs percent-decoding when iterated. It also supports equality and comparison to unencoded strings, without allocating memory. In the example below we parse the string s as a series of zero or more pchars:

result< pct_string_view > rv = parse( s, pct_encoded_rule( pchars ) );

These constants are used and provided by the library to specify rules for percent-encoded URL components:

Table 1.8. URL Character Sets

Name

BNF

gen_delim_chars

gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"

pchars

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

reserved_chars

(everything but unreserved_chars)

sub_delim_chars

sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

unreserved_chars

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

URL Rules

When a URL can appear in the context of a larger grammar, it may be desired to express the enclosing grammar in a single rule that incoporates the URL as an element. To achieve this, the library makes public the rules used to implement high-level parsing of complete strings as URL components, so that these components may be parsed as part of a larger string containing non-URL elements. Here we present a rule suitable for parsing the the HTTP request-line:

// request-line   = method SP request-target SP HTTP-version CRLF

constexpr auto request_line_rule = tuple_rule(
    not_empty_rule( token_rule( alpha_chars ) ),    // method
    squelch( delim_rule( ' ' ) ),                   // SP
    variant_rule(
        absolute_uri_rule,                          // absolute-uri or
        relative_ref_rule),                         // relative-ref
    squelch( delim_rule( ' ' ) ),
    squelch( literal_rule( "HTTP/" ) ),             // "HTTP/"
    delim_rule( digit_chars ),                      // DIGIT
    squelch( delim_rule( '.' ) ),                   // "."
    delim_rule( digit_chars ),                      // DIGIT
    squelch( literal_rule( "\r\n" ) ) );            // CRLF

The library offers these rules to allow custom rule definitions to integrate the various styles of valid URL rules:

Table 1.9. RFC3986 Rules

Name

BNF

absolute_uri_rule

absolute-URI    = scheme ":" hier-part [ "?" query ]

hier-part       = "//" authority path-abempty
                / path-absolute
                / path-rootless
                / path-empty

authority_rule

authority       = [ userinfo "@" ] host [ ":" port ]

origin_form_rule

origin-form    = absolute-path [ "?" query ]

absolute-path = 1*( "/" segment )

relative_ref_rule

relative-ref    = relative-part [ "?" query ] [ "#" fragment ]

uri_reference_rule

URI-reference   = URI / relative-ref

uri_rule

URI             = scheme ":" hier-part [ "?" query ] [ "#" fragment ]


PrevUpHomeNext