What exactly is the URL?
За да прочетете статията на български, кликнете тук.
Since the beginning of the internet, URLs have provided a uniform method of clearly identifying network resources: the URL – an RFC standard since 1994 – provides internet users with general syntax that can localize and retrieve public content on demand. This makes the URL one of the most basic technologies of the internet. Internet users can use URLs on a daily basis to access resources through a browser, and it isn’t just limited to addressing web pages.
What is it?
The abbreviation “URL” stands for “Uniform Resource Locator”. It is a subspecies of uniform resource identifiers (URIs). URL structure also corresponds to URI syntax.
Identifiers make it possible to locate resources using a unique identifier, both locally and worldwide on the internet. As an “Identifier” subspecies, URLs are sometimes used interchangeably with the term “Internet address”. This is because of the URLs main use: addressing web pages. However, URLs are not limited just to this function. Files in the local file system can be localized using URLs, for example. This means that every internet address is a URL but not every URL is an internet address.
The abbreviation URL stands for “Uniform Resource Locator”. URLs allow you to uniquely address resources and request them as needed. For example, internet users use URLs in the browser to access web pages from the address bar, or download files.
Every URL consists of a formula and a formula-specific part:
Formula: the URL formula specifies both the kind of resource and the method needed to access it. The URL often has the same name as the accessor’s protocol at the application level. Common formulae are mailto, file, ftp or http/https
Formula-specific part: depending on the kind of formula, the formula-specific part of the URL is made up of a number of segments that contain the resource’s location as well as optional processing parameters.
The separator between the formula and the formula-specific part is a colon. Depending on the formula, you may also need two slashes, which were commonly used in the early days of the internet, but have no specific function today.
A URL is based on the following URI-syntax:
Each segment of the formula specific part has its own function. The user, password, host and port sections are called “Authorities”. The authority indicates which computer a resource can be found on and what name is assigned to it.
- user and password: the user and password sections contain the username and password of the person authorized to access the resource and they are separated by a colon. Both details are only required if the resource requests authentication. Username and password are separated by an @ sign from the host URI segment
- host: the URI host segment usually includes a Domain including a top, second- and third-level domain, indicating which specific host should retrieve the resource. Alternatively, the computer’s name can be specified in the form of an IP address
- port: by specifying a port number, you can control a specific TCP/IP port in the network. Since most formulae already have a standard port, a separate entry is optional. For example, standard ports are 80 for HTTP, 443 for HTTPS or 21 for FTP. A port number should only be given if no general port is defined or if a non-default port is being used for standard transmissions. The port number is separated from the host section by a colon
The “authority” domain is usually specified in human-readable form. Computers, on the other hand, work with IP addresses. Visiting a website requires an intermediate step, imperceptible to the user: the name resolution based on the Domain Name System (DNS).
Note: DNS refers to an IP-based network service that is responsible for the domain name resolution in an IP address. Internet service providers require a DNS-Server. When an internet user visits a web page, their router forwards the request to the responsible DNS server first. The DNS server then looks for the matching IP address for the requested domain and sends it back. Once the router has received the chosen IP address, the corresponding web server can be addressed.
The URI’s authority is followed by an indication of where the resource is located on the computer, as well as the optional components: query string and fragment identifier.
- path: the URI segment path contains the resource file reference and reveals its location on the target computer. The file path always starts with a slash (/)
- query: some websites contain executable components and, in addition to the file path, expect a “query string” (also called a query part). This includes parameters (such as user input) that are passed along with the URL and processed by the server. This is customary for dynamic web pages that are only created at the time of retrieval from database data records. The query string is always initiated with a question mark (?)
- fragment: if a specific location in a resource needs to be referenced, the URI ends with a fragment identifier. This is separated with a hashtag (#) and usually refers to a label uniquely identified by an index in an HTML document – like a subheading, for example
The elements of URI syntax that contain a URL depends on the formula. The URL build is determined by the type of resource. The following list includes the most common URL types:
Web pages are retrieved using the HTTP Protocol (Hypertext Transfer Protocol) or HTTPS (Hypertext Transfer Protocol over SSL). The latter transmits data over a secure connection and URL structure is the same for both protocols.
There is usually no authentication required when retrieving a URL. The “authority” only includes the domain where the chosen website can be accessed. The username and password are omitted.
Mailto is a URL formula for email addresses that allows website operators to include hyperlinks to their website. When an internet user clicks on a mailto link, most browsers open the system’s default email program and a new email window. The email address is specified in the formula-specific part and is entered as the recipient address in the email window. The user does not have to start the program themselves, nor do they have to transmit the email address manually.
In URLs that include the mailto formula, the addressee’s email address is listed in the formula specific part. The formula and formula specific part are also separated by a colon, eliminating the double slash. Using a query string, you can set mail headers to fill the subject and text of the email, for example.
The formula file is used to call specific files on your own computer. If you enter the correct file path as a URL in the address bar of a web browser, it will call up the requested directory or file.
Since the formula file refers to a local resource, the authority specification is omitted. The file path always starts with a slash. This results in a URL with three consecutive slashes.
URLs that have the FTP (File Transfer Protocol) formula allow access to files located on another machine (remote access). The file transfer protocol FTP of the same name is used for transmission.
A user who wants to access files in a remote file system using FTP usually has to authenticate itself. Therefore, URLs that reference FTP resources usually contain access data (username and password).
Permitted characters in a URL
The URL standard only supports a limited character set of selected American Code for Information Interchange (ASCII) characters. In addition, various characters already have certain functions, like identifying individual segments and subsequently allowing a URL to decompose or be processed.
The following characters have already been assigned a specific function in the URL standard:
- : / ? # [ ] @ $ & ‘ ( ) * + , ; =
For example, the question mark (?) initiates a query string. Various parameters in the query string are delimited with the ampersand (&). The separator between parameter name and value is the equal sign (=). The hash (#) initiates the jump label.
Characters without a predefined function include all letters and digits and the special characters mentioned below:
- A-Z, a-z
- -. _ ~
Other than the ASCII characters listed here, non-ASCII characters may now be used in URLs and must be rewritten. It is also possible to rewrite one of the reserved characters to prevent it from being interpreted by its predefined meaning. To convert ASCII characters, the URL standard uses the masking character % (percent) and the ASCII value table in hexadecimal notation. Non-ASCII characters are also rewritten using percent representation. RFC 3986 recommends ASCII-compatible encoding based on UTF-8. This recommendation is not binding and the service providers ultimately decide which encoding is used.
In contrast, domain special characters are converted to ASCII-compatible strings using punycode.
Absolute and relative URLs
URLs can be absolute or relative URLs. Absolute URLs are universally valid and include all segments required for the given formula. Relative URLs, on the other hand, are only valid in specific contexts and inherit certain properties from them, so that corresponding URL sections become redundant and can be omitted. The information that context provides includes the protocol, domain or even path to the resource.
Relative URLs are used in webpage hyperlinks that lead to different subpages of a website. The link URL is the data from the webpage it’s leading to.
The following examples show a link from www.example.org/index/page1 to www.example.org/index/page2 with absolute or relative URLs.
Hyperlink with an absolute URL:
- <a href=”http://www.example.org/index/seite2″>Linktext</a>
Hyperlink with a relative URL:
- <a href=”/index/seite2″>Linktext</a>
Relative URLs have the advantage that they are significantly shorter and contribute to a streamlined, clear source code. In addition, hyperlinks with relative URLs facilitate domain relocation. If a website domain changes, it must be exchanged manually with an internal link that has an absolute URL or redirected using redirects. This effort is unnecessary for relative URLs that don’t have an “authority”, and thus, don’t need domain information.