Communication Networks/HTTP Protocol

HTTP请求的格式，大概是这样的一个字符串

<METHOD> <URL> <VERSION> \r\n
<Header1>: <HeaderValue1>\r\n
<Header2>: <HeaderValue2>\r\n
...
<HeaderN>: <HeaderValueN>\r\n
\r\n
<Body Data....>

METHOD常见为GET、POST、PUT、DELETE。<VERSION>如HTTP/1.1

url只能支持ASCII的说法源自于RFC1738规定ASCII的子集[a-zA-Z0-9$-_.+!*'(),]。它们是可以“不经编码”在url中使用。比如尽管空格也是ASCII字符，但是不能直接用在url里。百分号编码（Percent-encoding，URL encoding）。尽管在浏览器地址栏可以看到中文。但这种url在发送请求过程中，浏览器会把中文用字符编码+Percent Encode翻译为真正的url，再发给服务器。浏览器地址栏里的中文只是想让用户体验好些而已。

HTTP Body的编码可以用Content-Type来比较明确的定义。如：

POST xxxxxx HTTP/1.1
...
Content-Type: application/x-www-form-urlencoded ; charset=UTF-8

这里Content-Type会同时定义请求body的格式（application/x-www-form-urlencoded）和字符编码（UTF-8）。

浏览器直接发出的POST请求就是表单提交，而表单提交只有application/x-www-form-urlencoded针对简单的key-value场景；和multipart/form-data，针对只有文件提交，或者同时有文件和key-value的混合提交表单的场景。如果是Ajax或者其他HTTP Client发出去的POST请求，其body格式就非常自由了，常用的有json，xml，文本，csv……甚至是你自己发明的格式。只要前后端能约定好即可。

使用HTTP时大家会有一个约定，即所有的“控制类”信息应该放在请求头中，具体的数据放在请求体里“。于是服务器端在解析时，总是会先完全解析全部的请求头部。这样，服务器端总是希望能够了解请求的控制信息后，就能决定这个请求怎么进一步处理，是拒绝，还是根据content-type去调用相应的解析器处理数据，或者直接用zero copy转发。为了进一步优化，客户端可以利用HTTP的Continued协议来这样做：客户端总是先发送所有请求头给服务器，让服务器校验。如果通过了，服务器回复“100 - Continue”，客户端再把剩下的数据发给服务器。如果请求被拒了，服务器就回复个400之类的错误，这个交互就终止了。这样，就可以避免浪费带宽传请求体。但是代价就是会多一次Round Trip。如果刚好请求体的数据也不多，那么一次性全部发给服务器可能反而更好。

超文本传输协议 (HTTP)

The Hypertext Transfer Protocol (HTTP) is an application layer protocol that is used to transmit virtually all files and other data on the World Wide Web, whether they're HTML files, image files, query results, or anything else. Usually, HTTP takes place through TCP/IP sockets.

A browser is an HTTP client because it sends requests to an HTTP server (Web server), which then sends responses back to the client. The standard (and default) port for HTTP servers to listen on is 80, though they can use any port.

HTTP is based on the TCP/IP protocols, and is used commonly on the Internet for transmitting web-pages from servers to browsers.

Network Application:

Client Server Paradigm

The client and server are the end systems also known as Hosts. The Client initiates contact with the server to request for a service. Such as for Web, the client is implemented in the web browser and for e-mail, it is mail reader. And in similar fashion, the Server provides the requested service to client by providing with the web page requested and mail server delivers e-mail.

Peer to Peer Paradigm

In the network, a peer can come anytime and leave the network anytime. So a peer can be a Client or Server. So the Scalability is the advantage in this peer to peer network. Along with Client Server and Peer to Peer Paradigm, it also supports hybrid Peer to Peer and Client Server in real world.

The HTTP protocol also supports one or more application protocols which we use in our day to day life. For e-mails, we use the SMTP protocol, or when we talk to other person on telephone via the web world (Voip). So these applications are through the web world. They define the type of message and the syntax used. Also it gives information on the results or actions taken.

Identifying Applications

When the process of communication has to be performed, there are two main things which are important to know, they are: 1. IP Address: This IP address is of the host running the process. It is an 32 bit address which is unique ID. From this IP address, the host is recognized and used to communicate to the web world. 2. Port Number: The combination of IP address and Port number is called as Socket. Hence , Socket = ( IP address, Port number)

So whenever the client or web user application communicates with the web server, it needs four important components also called as TCP Connection tuple. This tuple consists of: 1. Client IP address 2. Client Port number 3. Source IP address 4. Source Port number

HTTP protocol uses TCP protocol to create an established, reliable connection between the client (the web browser) and the server (wikibooks.org). All HTTP commands are in plain text, and almost all HTTP requests are sent using TCP port 80, of course any port can be used. HTTP protocol asks that each request be in IP address form, not DNS format. So if we want to load www.wikibooks.org, we need to first resolve the wikibooks.org IP address from a DNS server, and then send out that request. Let's say (and this is impossible) that the IP address for wikibooks.org is 192.168.1.1. Then, to load this very page, we would create a TCP packet with the following text:

GET 192.168.1.1/wiki/Communication_Systems/HTTP_Protocol HTTP/1.1

The first part of the request, the word "GET", is our HTTP command. The middle part of the request is the URL (Universal Resource Locator) of the page we want to load, and the last part of the request ("HTTP/1.1") tells the server which version of HTTP the request is going to use.

When the server gets the request, it will reply with a status code, that is defined in the HTTP standard. For instance:

HTTP/1.1 200 OK

or the infamous

HTTP/1.1 404 Not Found

The first part of the reply is the version of HTTP being used, the second part of the reply is the error code number, and the last part of the reply is the message in plain, human-readable text.

Web

The web world consists of numerous web pages which consists of objects which are addressed by URL. The web pages mostly consist of HTML pages with a few referenced objects. The URL known as Uniform Resource Locator consists of host name and path name.

The host name is www.sjsu.edu/student/tower.gif

The Web User sends request to the Web Server through agent such as Internet Explorer or Firefox. This User agent handles all the HTTP request to Web server. The same applies to Web Server when it send information to Web User through servers known as Apache server or MS Internet Information Server.

The HTTP is the Web’s application layer protocol works on the Client-Server technology. The client request for the HTML pages towards server and the server responses with HTML pages. In this, the client requests pages and objects through its agent and Server responses them with the requested objects by displaying.

How this works??

The HTTP:TCP transport service uses sockets to transfer the data. The client initiates the TCP connection by using sockets on port 80 to the server. Then the server accepts the connection from the client. The client requests with the HTML pages and the objects which are then exchanged between the client browser and web server. After completing the request, the TCP connection is closed.

As HTTP is a stateless protocol. It does not keep user information about the previous client requests. So, this protocol is simple but if you have to maintain the past client records then it is complex. Since the server will maintain all the client requests and when the server crashes, it is very difficult to get the information back and makes the system very complex.

HTTP 连接

The web page consists of objects and URL’s. As there can be one or many objects or URL’s, the type of HTTP connection determines the order in which the objects are requested.

Since the HTTP is constantly evolving to improve its performance, there are two types of connections: • Non-Persistent (HTTP/1.0) • Persistent (HTTP/1.1)

The major difference between non-persistent and persistent connection is the number of TCP connection(s) required for transmitting the objects.

Non-Persistent HTTP – This connection requires that each object be delivered by an individually established TCP connection. Usually there is one round trip time delay for the initial TCP connection. Suppose user requests the page that contains text as well as 5 images. The number of TCP connections will be as follows:

Persistent HTTP - This connection is also called as HTTP keep-alive, or HTTP Reuse. The idea is to use the same TCP connection to send and receive multiple HTTP requests/responses using the same connection. Using persistent connection is important to improve performance.

Persistent HTTP without Pipelining – In this connection, each client has to wait for the previously requested object received before issuing a new request for another object. Thus, not counting the initial TCP establishment (one RTT time), each object needs at least one RTT plus the transmission time of the object by the server.

Persistent HTTP with Pipelining – Can allow client to send out all (multiple) requests, so servers can receive all requests at once, and then sending responses (objects) one after another. The Pipelining method in HTTP/1.1 is default. The shortest time in pipelining is one initial RTT, RTT for request and response and the transmission time of all the objects by the server.

Thus, we can say that the number of RTT’s required in all the above types considering for some text and three objects would be:

1. Non-persistent HTTP:

2. Persistent HTTP:

a) Without Pipelining:

b) With Pipelining:

响应时间 (建模)

Round Trip Time (RTT): The time taken to send a packet to remote host and receive a response: used to measure delay on the network at a given time.

Response time:

The response time denotes the time required to initiate the TCP connection and the next response and requests to receive back along with the file transmission time.

The following example denotes the response time -

From the above fig. we can state that the round trip time is :

2 RTT + File transmit time.

HTTP消息格式：

HTTP使用两种消息 – 1. Request请求消息 2. Response响应消息

1. 请求消息:

请求行包括3部分内容：方法名字、被请求资源的本地路径、HTTP的版本；之间用空格分开。一般要ASCII编码以便人可读。

例如：

GET /path/to/the/file.html HTTP/1.0

HTTP 请求消息的一般格式：

方法名字如 GET, POST，HEAD。 The method is the type of method used to request the URL. Like. The URL block contains the requested URL. Version denotes the HTTP version. Either HTTP/1.0 or HTTP/1.1. The header lines include the browser type, host , number of objects and file name and the type of language in the requested page. For e.g.:

The entity body is used by the POST method. When user enters information on to the page, the entity body contains that information.

The HTTP 1.0 has GET, POST and HEAD methods. While the HTTP 1.1 has along with GET, POST and HEAD, PUT and DELETE.

Uploading the information in Web pages

The POST method

The Web pages which ask for input from user uses the POST method. The information filled by the web user is uploaded in server’s entity body.

The typical form submission in POST method. The content type is usually the application/x-www-form-urlencoded and the content-type is the length of the URL encoded form data.

The URL method

The URL method uses the GET method to get user input from the user. It appends the user information to be uploaded to server to the URL field.

2. 响应消息:

The HTTP message response line also has three parts separated by spaces: the HTTP version, a response status code giving result of the request and English phrase of the status code. This first line is also called as Status line.

The HTTP Response message format is shown below:

E.g.:

Below are the some HTTP response status codes:

200 OK The request succeeded, and the resulting resource (e.g. file or script output) is returned in the message body.

404 Not Found The requested resource doesn't exist.

301 Moved Permanently

302 Moved Temporarily

303 See Other (HTTP 1.1 only)

The resource has moved to another URL (given by the Location: response header), and should be automatically retrieved by the client. This is often used by a CGI script to redirect the browser to an existing file.

500 Server Error

An unexpected server error. The most common cause is a server-side script that has bad syntax, fails, or otherwise can't run correctly.

用户服务器Identification

The HTTP Protocol is a stateless protocol. So there should be an mechanism to identify the user using the web server. There are various techniques used:

1. Authentication 2. Cookies

1. Authentication:

The Client whenever request for the web page from the web server, the server authenticates the user. So each time whenever the web user or client requests any object, it has to provide a name and password to be identified by server. The need arises for the authentication so that the server has control over the documents. And, Since the HTTP protocol is stateless, it has to provide information each time it requests for web page. The authorization is done at the header line of the request. Generally, the cache is used to store the name and password of the web user. So that, each time it doesn’t have to provide the same information.

2. Cookies

Cookies are used by web servers to identify the web user. They are small piece of data stored into the web users disk. It is used in all major websites. As they have relatively much more importance in the web world. As said earlier, cookie is a small piece of data and not a code. This piece of small information is stored into web users machine whenever the browser visits the server site’s.

So how do the Cookie function exactly?

When the web users browser requests a file from web server, it send the file along with a cookie. So the next time whenever the web browser requests the same server a file, it sends the previous cookie to the web server so that it identifies that the browser had previously requested for a file. And so the web server coordinates your access to different pages from its website.

A typical example can be when you do online shopping., where cookie is used to traced your shopping basket .

The major four component of Cookie are:

1. Cookie header line in the HTTP response message. 2. Cookie header line in the HTTP request message. 3. Cookie file stored in User’s host and managed by User’s browser. 4. Back end database at the web site. So we can say that Cookies are used to keep the State of the web browser. Since HTTP is a stateless, so there should be some means for server to remember the state of the client’s request.

Cookies are in two flavors, one is persistent and the other is non-persistent. Persistent Cookies remain in the web browser’s machine memory for the specified time when it was first created. While non-persistent cookies are the ones which are deleted as soon as the web user’s browser is closed.

Cookies bring a number of useful applications in today’s Internet world. With the help of cookie, you can have: • User accounts • Online Shopping • Web Portals • Advertising

But with these cookies, you can secretly track down the web user’s habits. As whenever a web browser sends a request to web server, it includes its IP address, the type of browser you use and your operating system. So this information is also logged into the server’s file.

The Advertising is the main issue in cookies. Since, it is less admirable because of its use as a tracking of individual’s browsing and buying habits. As the server’s log file has all your information, so it becomes easier to track you. The advertisement firm has many clients which includes another several advertising firms. So it has contracts with many other agencies. They place an image file on their web site. Once you click on them, you are not clicking on the image but a link to another advertising firm’s site. So it sends a cookie to you when you request for that page. And thus, your IP address is tracked down. Whenever you request to their site’s page, they can track your number of visits to their site, which page you have visited and how often. Therefore, they come to know about your interests. So this important piece of information is valuable to them to track your preferences and target you based on your inferences.

Web缓存(Proxy服务器)

The proxy server’s main goal is to satisfy client’s request without involving the original web server. It is a server acting like as a buffer between the Client’s web browser and the Web server. It accepts the requests from the user and responses to them if it contains the requested page. If it doesn’t have the requested page, then it requests to Original Web Server and responses to the client. It uses cache to store the pages. If the requested web page is in cache, it fulfills the request speedily.

Working of Proxy Server

The two main purpose of proxy server are:

1. Improve Performance –

As it saves the result for a particular period of time. If the same result is requested again and if it is present in the cache of proxy server then the request can be fulfilled in less time. So it drastically improves the performance. The major online searches do have an array of proxy servers to serve for a large number of web users.

2. Filter Requests -

Proxy server’s can also be used to filter requests. Suppose an company wants to restrict the user from accessing a specific set of sites, it can be done with the help of proxy servers.

Conditional GET

The conditional GET is the same as GET method. It only differs by including the If-modified-since, If-unmodified-since, If-match, If-None-Match or If-Range header field. The conditional GET method requests can be satisfied under the given conditions. This is method is used to reduce the network usage so that cache entities can be utilized to fulfill requests if they are not modified and avoids unnecessary transferring of data.

Working of Conditional GET:

Whenever the client request to the server for html page, the proxy server checks the requested page in its cache. It checks the last-modified date entry in the header

If the requested page is modified in the cache entry, the proxy server requests the updated page to main server. And then, the main server responds to that requests and sends updates page to proxy server which is then forwarded to the client. In mean while, the proxy server stores the modified page into its cache.

HTTPS

The HTTPS is a secure version of HTTP. It indicates the port 443 should be used instead of port 80. It is widely used in security concern areas such as online payment transaction. The protocol identifier HTTPS tells the server that client is connecting to a secure connection.

The HTTPS follows a procedure to follow secure connection in the network. The secure connection is done automatically. The steps are: 1. The client authenticates the server using the server’s digital certificate. 2. The client and server negotiate with the cipher suite ( a set of security protocols) they will use for the connection. 3. The client and server generate session keys for encrypting and decrypting data. 4. The Client and server establish a secure encrypted connection.

The HTTPS ends its session whenever the client or server cannot negotiate with the cipher suite. The cipher suite can be based on any of the following: 1. Digest Based –

• Message Digest 5 (MD5) • Secure Hash Algorithm 1 (SHA-1)

2. Public Key-Based –

• Rivest-Shamir-Adelman (RSA ) encryption/decryption. • Digital Signature Algorithm (DSA) • Diffie-Hellman Key-exchange/Key-generation.

3. X.509 digital certificates.

总结

This chapter includes HTTP how the Web World is all about. The various methods used in TCP connection to transfer the data in the web world. It has detail information on requests and response methods. Also it contains the User Identification methods like Authentication and Cookies. The cookies part is explained in detail. Also the various connections in HTTP, like non-persistent and persistent connections. We have also mentioned about the proxy server and how does it works. And lastly, the secure version of HTTP.i.e. HTTPS.