objcopy: Embedding a Text File in a C Executable with 'objcopy' or 'xxd'

Description

Embed text/binary files as objects in C executables using either:

ld(1) and objcopy(1).

xxd(1) and gcc/clang.

Quick start

make demo

Check Makefile.demo for details how to create linkable object files from text files.

Notes about `objcopy` and `xxd`

Using either ld+objcopy or xxd+gcc doesn't matter, both imply limitations.
One limit with ld+objcopy is the that the output format is binary, so the generated symbol name and type cannot be change.
On the other hand, xxd (with the -i option) outputs C code to stdout, which makes it possible to modify the output before passing it to the compiler.
Example:

    echo hello > hello.txt
    xxd -i hello.txt

The output:

    unsigned char hello_txt[] = {
      0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a
    };
    unsigned int hello_txt_len = 6;

Notes about `objcopy` and `xxd`: Risk for conflicting names

Both ld+objcopy and xxd use the file name (possibly including a path) to create the symbol name.
Both hyphens and slashes are converted to underscores (i.e. - to _, and / to _).
This means that files with different names may create the very same symbol name:

echo dummy > dummy_hello.txt
echo dummy > dummy-hello.txt
mkdir dummy
echo dummy > dummy/hello.txt

xxd -i dummy_hello.txt
xxd -i dummy-hello.txt
xxd -i dummy/hello.txt

Same output for all 3 files:

unsigned char dummy_hello_txt[] = {
    0x64, 0x75, 0x6d, 0x6d, 0x79, 0x0a
};
unsigned int dummy_hello_txt_len = 6;

An easy way to avoid conflicts is to never use neither - nor _ in filenames.

Notes about objcopy and xxd: Only for text files handled as C strings - modify the `xxd` output

As pointed out here, if you want a more "C string-friendly" approach, and just want to printf() the text, it would be more convenient to handle the text as a NULL-terminated string (char pointer):

    const char *hello_txt[] = {
      0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a, 0x00
    };
    const size_t hello_txt_len = 7;

The string length has to be incremented by one, and should be of type size_t to match C string functions such as strlen().

This "C string-friendly" approach has its disadvantages though, as when compiling from stdin (convenient to avoid creating an intermediate C file):

    xxd -i hello.txt | cc -c -xc -g - -o hello.o
    (no output)

    xxd -i hello.txt | sed -s 's/unsigned int/size_t/' | cc -c -xc -g - -o hello.o
    <stdin>:4:1: error: unknown type name ‘size_t’

Oops. The output from xxd does not include any C headers, so only core C language definitions may be used.
This can be resolved by including a header, but it also makes things more complicated:

    (echo '#include <stddef.h>' && xxd -i hello.txt) | sed -s 's/unsigned int/size_t/' | cc -c -xc -g - -o hello.o
    (no output)

Let's keep it simple, and use the output from xxd "as is". Or, at least, almost. Declaring the variables as const makes sense:

    xxd -i hello.txt | sed -s 's/unsigned/const unsigned/'

    const unsigned char hello_txt[] = {
      0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a
    };
    const unsigned int hello_txt_len = 6;

And as we know the length of the text, we can still use printf(), even if the string isn't NULL-terminated:

    printf("hello_txt=%s\n", hello_txt);
    printf("hello_txt=%.*s\n", hello_txt_len, hello_txt);

Notes (xxd only): Speed up parsing using C

Parsing the input files for this project in particular is made by a Makefile rule (see the rule 'struct' in the Makefile).
The Makefile rule works, but is really big and not very elegant.
It is also probably quite slow for a big number of input files, as it takes heave use of shell commands:
- Lots of printf's
- find (both files and directories)
- xxd
- file -b --mime-type to get the MIME type
- Piped output to either gcc or clang (uses cc to make it portable)
To speed things up (and make the Makefile rule much smaller), replace the shell commands with a C program:
- Use readdir(3) instead of find to get files and directories.
- Use the xxd source code to embed xxd inside the C program.
- Use the libmagic(3) library to get the MIME type instead of using file -b --mime-type
- Use popen(3) to pipe output to cc.

Notes about objcopy and xxd: Summary

Use xxd instead of objcopy, as it allows us to tweak the output.
Modify the xxd output as little as possible (but do declare variables as const unsigned ...).
Pipe xxd output to gcc/clang to avoid creating intermediate C files.
Parse the last line of xxd to get the symbol name, i.e. hello_txt_len.
Avoid using filenames containing neither - nor _, to avoid symbol conflicts when linking many files together.
Optionally, append output (C code) to xxd before piping to gcc/clang (see below for details).
Consider writing a C program to speed up file parsing.

Library with web content as in-memory objects: `libwwwmem`

Step 1: Static HTML

One example of using embedded files is web content files (HTML/CSS/JS/JPEG/PNG/GIF/SVG) compiled
into a C web server as object files, thus the term "wwwmem".
An advantage of using embedded web content is the speed, avoiding disk access to get the web contents.
A disadvantage is the limited flexibility, as all web contents has to be available at compile time, but this is not necessarily a problem for simple web servers.

Step 2: Lookup methods

One problem to solve is how to map a file path in a URI request to an C symbol which points to the requested content.
Each file path must be saved together in a C struct together with the pointer to the file contents.
Directory paths must also be saved, to be able to handle Directory Index and Directory Listings.
This C struct must be generated by either a script (in our case, done by the Makefile), or the C program which parses the files.

The lookup method to find a specific path may vary, from a simple loop (slow and inefficient, but simple) to more advanced techniques such MPFH (Minimal Perfect Hash Function), which are faster, but may be considered overkill for a web server serving just a few pages.

Step 3: Directory index and listing

By default, trying to access a URI which is a directory returns either the directory's index.html file, or 404 Access Denied.
Use either a global configuration (command-line arguments or configuration file), or per-directory basis .htaccess files to change the default behaviour.

Step 4: Using HTML templates

If the web server is CGI-enabled, using HTML templates instead of static HTML pages offers more flexibility.

Step 5: Dynamically loaded library

To make web contents maintenance as flexible as possible, separate the web server code from the libwwwmem library, which is dynamically loaded at run-time.
The libwwwmem library contains a list of functions, one function to access each web content object, and the web content objects themselves.
Being a dynamically loaded library, the web content may be modified/added/deleted and then recompiled and reloaded without recompiling, or even stopping, the web server itself.

Step 6: Test suite

As libwwwmem is compiled, test routines may be added to validate both static HTML, htaccess directives, and template syntax at compile time.

`libwwwmem` details

Step 1: Static HTML

"Objectify" a file, for example www/css/style.css:

Using objcopy:

    rm -rf build; mkdir build
    ld -r -o build/css/style.o [-z execstack] --format=binary www/css/style.css
    objcopy --rename-section .data=.rodata,alloc,load,readonly,data,contents build/www_css_style_css.o

    nm build/www_css_style_css.o
        00000000000000b2 D _binary_www_css_style_css_end
        00000000000000b2 A _binary_www_css_style_css_size
        0000000000000000 D _binary_www_css_style_css_start

Using xxd:

    rm -rf build; mkdir build
    xxd -i www/css/style.css | cc -c -xc -g -Wall -Wextra -Werror -ansi -pedantic - -o build/www_css_style_css.o

    nm build/www_css_style_css.o
        0000000000000000 D www_css_style_css
        00000000000000b4 D www_css_style_css_len

Map the file www/css/style.css to the C symbol www_css_style_css.
Lets assume that the web server root directory is www/.
A HTTP request to get www/css/style.css would look like this:

        GET /css/style.css HTTP/1.0

The HTTP response:

        HTTP/1.0 200 OK
        Content-Length: 178
        Content-Type: text/html
        Date: Fri, 31 Dec 1999 23:59:59 GMT

        body {
            font-family: Arial;
            background-color: #bbf0ff;
        }
        .
        .
        .

The problem is that there is no way to map the path string /css/style.css to the symbol _binary_www_css_style_css_start from within the web server program at run time.

Some additional information is required to map the file to the symbol:

To be able to return either the HTTP/1.0 200 OK or the HTTP/1.0 404 Not Found, there must be a way to search for a symbol, and return 404 if the symbol does not exist. The search is done using a hash table.

The index for each entry (bucket) in the hash table entry is the hash value of the full path (i.e. hash_function(/css/style.css)).
Each bucket is a linked list, but normally with one single element (unless collisions occured).

The element is a key/value pair.
The key is the string of full path to the file (i.e. /css/style.css).
The key is only used when there is more than one element in the linked list, to search for the matching element in the list.
The value is a structure:

A pointer to the web contents (i.e. _binary_www_css_style_css_start).
The Content-Length header (i.e. _binary_www_css_style_css_size, or _binary_www_css_style_css_end - _binary_www_css_style_css_start).
The Content-Type header, to indicate the MIME type (i.e. text/plain).
A pointer to the next element in the linked list.

This results in 3 structures:

    /* The web content object struct */
    typedef struct web_content_t
    {
      const unsigned char *content_start; /* _binary_example_jpg_start[]    */
      const size_t content_len;           /* _binary_www_css_style_css_size */
      const char *mime;                   /* MIME type, i.e. "text/plain"   */
      web_content_t *next;
    } web_content_t;


    /* The hash element (bucket) struct */
    typedef struct hash_element_t
    {
      const char *key;                    /* The file path, i.e. "/css/style.css" */
      char *value;
      hash_element_t *next;
    } hash_element_t;

    /* The hash table struct */
    typedef struct hash_table_t
    {
      hash_element_t *bucket;             /* The dynamic array of buckets of keys/values */
    } hash_table_t;

MIME

The MIME type may be achieved using the file tool:

    file -b --mime-type www/css/style.css
        text/plain

The commonly used MIME types on the web:

text/plain
text/html
text/css
text/javascript
image/png
image/gif
image/jpeg
image/svg+xml

https://stackoverflow.com/questions/15594988/objcopy-prepends-directory-pathname-to-symbol-name

xxd -i input.txt | sed 's/input_txt/test/' | gcc -c -xc - -o obj.o

Step 2: Lookup methods

The path for all web content files and directories must be included in a list. This list of paths are used as keys for the lookup method.

The lookup method to find a specific path may vary, from a loop using strcmp (slow and inefficient, but simple) to more advanced techniques such MPFH (Minimal Perfect Hash Function), which are faster, but may be considered overkill for a web server serving just a few pages.

To be able to handle Directory index and Directory listing, all directory paths must be included as separate keys in the list. That is, if /path/to/an/image/pic1.png is a file, the following entries have to be in the list:

/path/to/an/image/pic1.png
/path/to/an/image/
/path/to/an/
/path/to/
/path/

There are probably more efficient ways to solve this problem.

The lookup methods:

|-------------------|-----------------|-------------------------------------------------------------------------------------|
| Number of keys    | Search function | Description                                                                         |
|-------------------|-----------------|-------------------------------------------------------------------------------------|
| very small (< 10) | loop            | Not using hash function at all.                                                     |
| very small (< 10) | hf_oa_minimal   | Open-addressing probing, with collisions.                                           |
| small (< 100)     | hf_oa_perfect   | Open-addressing probing, without collisions.                                        |
| medium (< 10 000) | hf_mphf_bob     | MPHF by Bob Jenkins.                                                                |
| big (+10000)      | hf_bdz_ph       | BDZ_PH, extracted from the CMPH library.                                            |
|-------------------|-----------------|-------------------------------------------------------------------------------------|

Step 3: Directory index and listing

Directory index file

The default behaviour for accessing the URI /path/to/ is to return /path/to/index.html and status 200 OK.
Use DirectoryIndex to use another filename instead of index.html, or a list of filenames, checking for existance from left to right.

Directory listing

The default behaviour when /path/to/index.html (or whatever indicated by DirectoryIndex) does not exist, is to return status 403 Access Denied.
Use Options +Indexes to return and status 200 OK and list all files in the directory.

Usage

To use these options, do one of the following:

Create a .htaccess file in any or all directories:

    -----------------------------------
    DirectoryIndex index.html index.cgi
    Options +Indexes
    -----------------------------------

Enable .htaccess parsing from the command line:

   ./PRG --parse-htaccess=1

Use command-line arguments to set the options globally (this ignores .htaccess parsing):
```
   ./PRG --directory-index='index.html index.cgi' --directory-listing=1
```

Combine global options with per-directory .htaccess parsing:

   ./PRG --directory-index='index.html index.cgi' --directory-listing=1 --parse-htaccess=1

Step 4: Using HTML templates

The HTML template: the tags

The HTML template is inspired by these two libraries:

The Perl HTML::Template module
The C Template Library 1.0

These are the original HTML::Template template tags:

TMPL_VAR
TMPL_LOOP
TMPL_INCLUDE
TMPL_IF
TMPL_ELSE
TMPL_UNLESS

The C Template Library offers some additional tags, but to keep things simple (and compatible with the HTML::Template), the additional tags are not supported:

TMPL_ELSIF
TMPL_BREAK
TMPL_CONTINUE

The HTML template: the tags - details

Tags may be written as HTML comments (useful when validating templates as valid HTML):

<TMPL_VAR NAME="PARAM1">
<!-- TMPL_VAR NAME=PARAM1 -->

Variables may be escaped as HTML, JS, URL:

<TMPL_VAR NAME="PARAM1" ESCAPE=HTML>
<TMPL_VAR NAME="PARAM1" ESCAPE=JS>
<TMPL_VAR NAME="PARAM1" ESCAPE=URL>

Variables may have a default value:

<TMPL_VAR NAME="PARAM1" DEFAULT="the devil">

A libwwwmem template which include another template, uses the same syntax as HTML::Template:

<TMPL_INCLUDE NAME="filename.tmpl">

The difference is that, when using libwwwmem, "filename.tmpl" is internally mapped to the pointer _binary_filename_tmpl_start instead of reading "filename.tmpl".

The HTML template: the library functions

Difference from HTML::Template:

The libwwwmem library functions are named with the HTML::Template method names in mind, but not all methods are implemented.
Most HTML::Template methods are overloaded, which is not permitted in C, so libwwwmem includes either only one "version" of each corresponding HTML::Template method, or none.

Examples:

The HTML::Template->param() method may be called in several ways:
- param(): return list of current template parameters. Not implented by libwwwmem.
- param(PARAM): return value of "PARAM". Not implented by libwwwmem.
- param(PARAM => 'value'): assign single value to "PARAM".
- param(LOOP_PARAM => array_ref): assign array ref to "PARAM".
- param(SUB_PARAM => sub { return 'value' }): assign sub ref to "PARAM".
  The libwwwmem implements one single function, wwwmem_tmpl_param(struct *tmpl_param).
  The tmpl_param struct contains 3 linked lists: 1 for single values, 1 for loop arrays, and 1 for sub refs (list of function pointers).
  This way, any and all kind of parameters may be called using one single call.
The HTML::Template->config() method with arguments sets one or more configuration options.
With no arguments, the current configuration is returned. libwwwmem only permits configuration flags to be set using the wwwmem_templ_new() function, but does not implement any config method.
The HTML::Template->output() method supports the optionally argument print_to => *STDOUT.
libwwwmem implements two functions wwwmem_templ_output() and wwwmem_templ_output_to(int fd).
wwwmem_templ_output() is actually a wrapper for wwwmem_templ_output_to(stderr)

As a template always is an in-memory object, the HTML::Template methods for dealing with file caching do not make sense with libwwwmem.

libwwwmem functions:

wwwmem_tmpl_new()
wwwmem_tmpl_config()
wwwmem_tmpl_param()
wwwmem_tmpl_clear_params()
wwwmem_tmpl_output()
wwwmem_tmpl_output_to()
wwwmem_tmpl_query()

%%% ascii/utf8/utf16 ???