Overview of the Apache EBCDIC Port
As of Version 1.3, the Apache HTTP Server includes a port to
(non-ASCII) mainframe machines which use the EBCDIC character
set as their native codeset.
(Initially, that support covered only the Fujitsu-Siemens
family of mainframes running the
BS2000/OSD operating system, a mainframe OS which features
a SVR4-derived POSIX subsystem. Later, the two IBM mainframe
operating systems TPF and OS/390 were added).
The EBCDIC related directives EBCDICConvert, EBCDICConvertByType,
and EBCDICKludge are
available only if the platform's character set is EBCDIC
(This is currently only the case on Fujitsu-Siemens' BS2000/OSD
and IBM's OS/390 and TPF operating systems). EBCDIC stands for
Extended Binary-Coded-Decimal Interchange Code and is
the codeset used on mainframe machines, in contrast to ASCII
which is ubiquitous on almost all micro computers today. ASCII
(or its extension latin1) is the basis for the HTTP
transfer protocol, therefore all EBCDIC-based platforms need a
way to configure the code set conversion rules required between
the EBCDIC based mainframe host and the HTTP socket
protocol.
On an EBCDIC based system, HTML files and other text files
are usually saved encoded in the native EBCDIC code set, while
image files and other binary data are stored with identical
encoding as on ASCII based machines. When the Apache server
accesses documents, it must therefore make a distinction
between text files (to be converted to/from ASCII, depending on
the transfer direction) and binary files (to be delivered
unconverted). Such a distinction can be made based on the
assigned MIME type, or based on the file extension
(i.e., files sharing a common file suffix).
By default, the configuration is symmetric for input and
output (i.e., when a PUT request is executed for a
document which was returned by a previous GET request, then the
resulting uploaded copy should be identical to the original
file). However, the conversion directives allow for specifying
different conversions for input and output.
The directives EBCDICConvert and EBCDICConvertByType
are used to assign the conversion setting (On or Off) based on
file extensions or MIME types. Each configuration setting can
be defined for input only (e.g., PUT method), output
only (e.g., GET method), or both input and output. By
default, the conversion setting is applied for input and
output.
Note that after modifying the conversion settings for a
group of files, it is not sufficient to restart the server. The
reason for this is the fact that a cached copy of a document
(in a browser or proxy cache) will not get revalidated by
contents, but only by date. Since the modification time of the
document did not change, browsers will assume they can reuse
the cached copy.
To recover from this situation, you must either clear all
cached copies (browser and proxy cache!), or update the
modification time of the documents (using the
touch
command on the server).
Note also that server-parsed documents (CGI scripts, .shtml
files, and other interpreted files like PHP scripts etc.) are
not subject to any input conversion and must therefore be
stored in EBCDIC form on the server side.
In absense of any EBCDICConvertByType
directive, and if no matching EBCDICConvert was found,
Apache falls back to an internal heuristic which assumes that
all documents with MIME types starting with
"text/", "message/" or
"multipart/" as well as the MIME type
"application/x-www-form-urlencoded" are text
documents stored in EBCDIC, whereas all other documents are
binary files.
In order to provide backward compatibility with older
versions of apache, the EBCDICKludge directive
allows for a less powerful mechanism to control the conversion
of documents to and from EBCDIC.
Note:
The EBCDICKludge directive is deprecated, since its
functionality is superseded by the more powerful EBCDICConvert and EBCDICConvertByType
directives.
The directives are applied in the following order:
- First, the configured EBCDICConvert
directives in the current context are evaluated in
configuration file order. As soon as a matching file
extension is found, the search stops and the configured
conversion is applied.
EBCDICConvert settings inherited from parent directories are
tested after the more specific (deeper) directory
levels.
- If the EBCDICKludge is in
effect, the next step tests for a MIME type of the format
type/x-ascii-subtype. If
the document has such a type, then the
"x-ascii-" substring is removed and the
conversion set to Off.
- In the next step, the configured EBCDICConvertByType
directives are evaluated in configuration file order. If the
document has a matching MIME type, the search stops and the
configured conversion is applied.
EBCDICConvertByType settings inherited from parent
directories are tested after the more specific (deeper)
directory levels.
If no EBCDICConvertByType
directive at all exists in the current context, the server
falls back to the simple heuristics which assume that MIME
types starting with "text/", "message/" or "multipart/" (plus
the special type "application/x-www-form-urlencoded" used in
simple POST requests) imply a conversion, while all the rest
is delivered unconverted (i.e., binary).
Since all Apache input and output is based upon the BUFF
data type and its methods, the easiest solution was to add the
actual conversion to the BUFF handling routines. The conversion
must be settable at any time, so BUFF flags were added which
define whether a BUFF object has currently enabled conversion
or not. Two such flags exist: one for data read from the client
(ASCII to EBCDIC conversion) and one for data returned to the
client (EBCDIC to ASCII conversion).
During sending of the header, Apache determines (based on
the returned MIME type for the request) whether conversion
should be used or the document returned unconverted. It uses
this decision to initialize the BUFF flag when the response
output begins. Modules should therefore determine the MIME type
for the current request before initiating the response by
calling ap_send_http_headers().
The BUFF flag is modified at several points in the HTTP
protocol:
- set (In and Out) before a request is
received (because the request and the request header lines
are always in ASCII format)
- set/unset (for Input data) when the
request body is received - depending on the content type of
the request body (because the request body may contain ASCII
text or a binary file)
- set (for returned Output) before a
response header is sent (because the response header lines
are always in ASCII format)
- set/unset (for returned Output) when the
response body is sent - depending on the content type of the
response body (because the response body may contain text or
a binary file)
Additional transparent transitions may occur for
extracting/inserting the HTTP/1.1 chunking information
from/into the input/output body data stream, and for generating
multipart headers for range requests. (See
RFC2616 and src/main/http_protocol.c for details.)
-
The relevant changes in the source are #ifdef'ed into two
categories:
#ifdef
CHARSET_EBCDIC
- Code which is needed for any EBCDIC based machine.
This includes character translations, differences in
contiguity of the two character sets, flags which
indicate which part of the HTTP protocol has to be
converted and which part doesn't etc.
#ifdef _OSD_POSIX | TPF |
OS390
- Code which is needed for the Fujitsu-Siemens
BS2000/OSD | IBM TPF | IBM OS390 mainframe platforms
only. This deals with include file differences and socket
and fork implementation topics which are only required on
the respective platform.
- The possibility to translate between ASCII and EBCDIC at
the socket level (on BS2000 POSIX, there is a socket option
which supports this) was intentionally not chosen,
because the byte stream at the HTTP protocol level consists
of a mixture of protocol related strings and non-protocol
related raw file data. HTTP protocol strings are always
encoded in ASCII (the GET request, any Header: lines, the
chunking information etc.) whereas the file transfer
parts (i.e., GIF images, CGI output etc.)
should usually be just "passed through" by the server. This
separation between "protocol string" and "raw data" is
reflected in the server code by functions like bgets() or
rvputs() for strings, and functions like bwrite() for binary
data. A global translation of everything would therefore be
inadequate.
(In the case of text files of course, provisions must be
made so that EBCDIC documents are always served in
ASCII)
This port therefore features a built-in protocol level
conversion for the server-internal strings (which the
compiler translated to EBCDIC strings) and thus for all
server-generated documents.
- By examining the call hierarchy for the BUFF management
routines, I added an "ebcdic/ascii conversion layer" which
would be crossed on every puts/write/get/gets, and conversion
flags which allowed enabling/disabling the conversions
on-the-fly. Usually, a document crosses this layer twice from
its origin source (a file or CGI output) to its destination
(the requesting client): file -> Apache, and
Apache -> client.
The server can now read the header lines of a CGI-script
output in EBCDIC format, and then find out that the remainder
of the script's output is in ASCII (like in the case of the
output of a WWW Counter program: the document body contains a
GIF image). All header processing is done in the native
EBCDIC format; the server then determines, based on the type
of document being served, whether the document body (except
for the chunking information, of course) is in ASCII already
or must be converted from EBCDIC.
-
By default, Apache assumes that documents with the MIME
types "text/*", "message/*", "multipart/*" and
"application/x-www-form-urlencoded" are text documents and
are stored as EBCDIC files, whereas all other files are
binary files (and stored in a byte-identical encoding as on
an ASCII machine).
These defaults can be overridden on a by-MIME-type
and/or by-file-extension
basis, using the directives
EBCDICConvertByType {On|Off}[={In|Out|InOut}] mimetype [...]
EBCDICConvert {On|Off}[={In|Out|InOut}] fileext [...]
where the mimetype argument may contain
wildcards.
- Before adding the flexible conversion, non-text documents
were always served "binary" without conversion. This seemed
to be the most sensible choice for, .e.g.,
GIF/ZIP/AU file types (It of course requires the user to copy
them to the mainframe host using the "rcp -b" binary switch),
but proved to be inadequate for MIME types like
model/vrml, application/postscript
and application/x-javascript.
- Server parsed files are always assumed to be in native
(i.e., EBCDIC) format as used on the machine
(because they do not cross the conversion layer when being
read), and are converted after processing.
- For CGI output, the CGI script determines whether a
conversion is needed or not: by setting the appropriate
Content-Type, text files can be converted, or GIF output can
be passed through unmodified (depending on the conversion
configured in the script's context).
Binary Files
When exchanging binary files between the mainframe host and
a Unix machine or Windows PC, be sure to use the ftp "binary"
(TYPE I) command, or use the
rcp -b command from the mainframe host (the
-b switch is not supported in unix rcp's).
Text Documents
The default assumption of the server is that Text Files
(i.e., all files whose Content-Type:
starts with text/) are stored in the native
character set of the host, EBCDIC.
Server Side Included Documents
SSI documents must currently be stored in EBCDIC only. No
provision is made to convert them from ASCII before processing.
The same holds for other interpreted languages, like mod_perl
or mod_php.