objcopy

Embedding a Text File in a C Executable with 'objcopy' or 'xxd'
Login

Embedding a Text File in a C Executable with 'objcopy' or 'xxd'

Description

Embed text/binary files as objects in C executables using either:

or

Quick start

make demo

Check Makefile.demo for details how to create linkable object files from text files.

Notes about objcopy and xxd

Using either ld+objcopy or xxd+gcc doesn't matter, both imply limitations.
One limit with ld+objcopy is the that the output format is binary, so the generated symbol name and type cannot be change.
On the other hand, xxd (with the -i option) outputs C code to stdout, which makes it possible to modify the output before passing it to the compiler.
Example:

    echo hello > hello.txt
    xxd -i hello.txt

The output:

    unsigned char hello_txt[] = {
      0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a
    };
    unsigned int hello_txt_len = 6;

Notes about objcopy and xxd: Risk for conflicting names

Both ld+objcopy and xxd use the file name (possibly including a path) to create the symbol name.
Both hyphens and slashes are converted to underscores (i.e. - to _, and / to _).
This means that files with different names may create the very same symbol name:

echo dummy > dummy_hello.txt
echo dummy > dummy-hello.txt
mkdir dummy
echo dummy > dummy/hello.txt

xxd -i dummy_hello.txt
xxd -i dummy-hello.txt
xxd -i dummy/hello.txt

Same output for all 3 files:

unsigned char dummy_hello_txt[] = {
    0x64, 0x75, 0x6d, 0x6d, 0x79, 0x0a
};
unsigned int dummy_hello_txt_len = 6;

An easy way to avoid conflicts is to never use neither - nor _ in filenames.

Notes about objcopy and xxd: Only for text files handled as C strings - modify the xxd output

As pointed out here, if you want a more "C string-friendly" approach, and just want to printf() the text, it would be more convenient to handle the text as a NULL-terminated string (char pointer):

    const char *hello_txt[] = {
      0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a, 0x00
    };
    const size_t hello_txt_len = 7;

The string length has to be incremented by one, and should be of type size_t to match C string functions such as strlen().

This "C string-friendly" approach has its disadvantages though, as when compiling from stdin (convenient to avoid creating an intermediate C file):

    xxd -i hello.txt | cc -c -xc -g - -o hello.o
    (no output)
    xxd -i hello.txt | sed -s 's/unsigned int/size_t/' | cc -c -xc -g - -o hello.o
    <stdin>:4:1: error: unknown type name ‘size_t’

Oops. The output from xxd does not include any C headers, so only core C language definitions may be used.
This can be resolved by including a header, but it also makes things more complicated:

    (echo '#include <stddef.h>' && xxd -i hello.txt) | sed -s 's/unsigned int/size_t/' | cc -c -xc -g - -o hello.o
    (no output)

Let's keep it simple, and use the output from xxd "as is". Or, at least, almost. Declaring the variables as const makes sense:

    xxd -i hello.txt | sed -s 's/unsigned/const unsigned/'

    const unsigned char hello_txt[] = {
      0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a
    };
    const unsigned int hello_txt_len = 6;

And as we know the length of the text, we can still use printf(), even if the string isn't NULL-terminated:

    printf("hello_txt=%s\n", hello_txt);
    printf("hello_txt=%.*s\n", hello_txt_len, hello_txt);

Notes (xxd only): Speed up parsing using C

Notes about objcopy and xxd: Summary

Library with web content as in-memory objects: libwwwmem

Step 1: Static HTML

One example of using embedded files is web content files (HTML/CSS/JS/JPEG/PNG/GIF/SVG) compiled
into a C web server as object files, thus the term "wwwmem".
An advantage of using embedded web content is the speed, avoiding disk access to get the web contents.
A disadvantage is the limited flexibility, as all web contents has to be available at compile time, but this is not necessarily a problem for simple web servers.

Step 2: Lookup methods

One problem to solve is how to map a file path in a URI request to an C symbol which points to the requested content.
Each file path must be saved together in a C struct together with the pointer to the file contents.
Directory paths must also be saved, to be able to handle Directory Index and Directory Listings.
This C struct must be generated by either a script (in our case, done by the Makefile), or the C program which parses the files.

The lookup method to find a specific path may vary, from a simple loop (slow and inefficient, but simple) to more advanced techniques such MPFH (Minimal Perfect Hash Function), which are faster, but may be considered overkill for a web server serving just a few pages.

Step 3: Directory index and listing

By default, trying to access a URI which is a directory returns either the directory's index.html file, or 404 Access Denied.
Use either a global configuration (command-line arguments or configuration file), or per-directory basis .htaccess files to change the default behaviour.

Step 4: Using HTML templates

If the web server is CGI-enabled, using HTML templates instead of static HTML pages offers more flexibility.

Step 5: Dynamically loaded library

To make web contents maintenance as flexible as possible, separate the web server code from the libwwwmem library, which is dynamically loaded at run-time.
The libwwwmem library contains a list of functions, one function to access each web content object, and the web content objects themselves.
Being a dynamically loaded library, the web content may be modified/added/deleted and then recompiled and reloaded without recompiling, or even stopping, the web server itself.

Step 6: Test suite

As libwwwmem is compiled, test routines may be added to validate both static HTML, htaccess directives, and template syntax at compile time.

libwwwmem details

Step 1: Static HTML

Using objcopy:

    rm -rf build; mkdir build
    ld -r -o build/css/style.o [-z execstack] --format=binary www/css/style.css
    objcopy --rename-section .data=.rodata,alloc,load,readonly,data,contents build/www_css_style_css.o
    nm build/www_css_style_css.o
        00000000000000b2 D _binary_www_css_style_css_end
        00000000000000b2 A _binary_www_css_style_css_size
        0000000000000000 D _binary_www_css_style_css_start

Using xxd:

    rm -rf build; mkdir build
    xxd -i www/css/style.css | cc -c -xc -g -Wall -Wextra -Werror -ansi -pedantic - -o build/www_css_style_css.o
    nm build/www_css_style_css.o
        0000000000000000 D www_css_style_css
        00000000000000b4 D www_css_style_css_len
        GET /css/style.css HTTP/1.0
The HTTP response:

        HTTP/1.0 200 OK
        Content-Length: 178
        Content-Type: text/html
        Date: Fri, 31 Dec 1999 23:59:59 GMT

        body {
            font-family: Arial;
            background-color: #bbf0ff;
        }
        .
        .
        .

The problem is that there is no way to map the path string /css/style.css to the symbol _binary_www_css_style_css_start from within the web server program at run time.

Some additional information is required to map the file to the symbol:

To be able to return either the HTTP/1.0 200 OK or the HTTP/1.0 404 Not Found, there must be a way to search for a symbol, and return 404 if the symbol does not exist. The search is done using a hash table.

The index for each entry (bucket) in the hash table entry is the hash value of the full path (i.e. hash_function(/css/style.css)).
Each bucket is a linked list, but normally with one single element (unless collisions occured).

The element is a key/value pair.
The key is the string of full path to the file (i.e. /css/style.css).
The key is only used when there is more than one element in the linked list, to search for the matching element in the list.
The value is a structure:

This results in 3 structures:

    /* The web content object struct */
    typedef struct web_content_t
    {
      const unsigned char *content_start; /* _binary_example_jpg_start[]    */
      const size_t content_len;           /* _binary_www_css_style_css_size */
      const char *mime;                   /* MIME type, i.e. "text/plain"   */
      web_content_t *next;
    } web_content_t;


    /* The hash element (bucket) struct */
    typedef struct hash_element_t
    {
      const char *key;                    /* The file path, i.e. "/css/style.css" */
      char *value;
      hash_element_t *next;
    } hash_element_t;

    /* The hash table struct */
    typedef struct hash_table_t
    {
      hash_element_t *bucket;             /* The dynamic array of buckets of keys/values */
    } hash_table_t;

MIME

The MIME type may be achieved using the file tool:

    file -b --mime-type www/css/style.css
        text/plain

The commonly used MIME types on the web:

text/plain
text/html
text/css
text/javascript
image/png
image/gif
image/jpeg
image/svg+xml

https://stackoverflow.com/questions/15594988/objcopy-prepends-directory-pathname-to-symbol-name

xxd -i input.txt | sed 's/input_txt/test/' | gcc -c -xc - -o obj.o

Step 2: Lookup methods

The path for all web content files and directories must be included in a list. This list of paths are used as keys for the lookup method.

The lookup method to find a specific path may vary, from a loop using strcmp (slow and inefficient, but simple) to more advanced techniques such MPFH (Minimal Perfect Hash Function), which are faster, but may be considered overkill for a web server serving just a few pages.

To be able to handle Directory index and Directory listing, all directory paths must be included as separate keys in the list. That is, if /path/to/an/image/pic1.png is a file, the following entries have to be in the list:

/path/to/an/image/pic1.png
/path/to/an/image/
/path/to/an/
/path/to/
/path/

There are probably more efficient ways to solve this problem.

The lookup methods:

|-------------------|-----------------|-------------------------------------------------------------------------------------|
| Number of keys    | Search function | Description                                                                         |
|-------------------|-----------------|-------------------------------------------------------------------------------------|
| very small (< 10) | loop            | Not using hash function at all.                                                     |
| very small (< 10) | hf_oa_minimal   | Open-addressing probing, with collisions.                                           |
| small (< 100)     | hf_oa_perfect   | Open-addressing probing, without collisions.                                        |
| medium (< 10 000) | hf_mphf_bob     | MPHF by Bob Jenkins.                                                                |
| big (+10000)      | hf_bdz_ph       | BDZ_PH, extracted from the CMPH library.                                            |
|-------------------|-----------------|-------------------------------------------------------------------------------------|

Step 3: Directory index and listing

Directory index file

The default behaviour for accessing the URI /path/to/ is to return /path/to/index.html and status 200 OK.
Use DirectoryIndex to use another filename instead of index.html, or a list of filenames, checking for existance from left to right.

Directory listing

The default behaviour when /path/to/index.html (or whatever indicated by DirectoryIndex) does not exist, is to return status 403 Access Denied.
Use Options +Indexes to return and status 200 OK and list all files in the directory.

Usage

To use these options, do one of the following:

Step 4: Using HTML templates

The HTML template: the tags

The HTML template is inspired by these two libraries:

These are the original HTML::Template template tags:

TMPL_VAR
TMPL_LOOP
TMPL_INCLUDE
TMPL_IF
TMPL_ELSE
TMPL_UNLESS

The C Template Library offers some additional tags, but to keep things simple (and compatible with the HTML::Template), the additional tags are not supported:

TMPL_ELSIF
TMPL_BREAK
TMPL_CONTINUE
The HTML template: the tags - details

Tags may be written as HTML comments (useful when validating templates as valid HTML):

<TMPL_VAR NAME="PARAM1">
<!-- TMPL_VAR NAME=PARAM1 -->

Variables may be escaped as HTML, JS, URL:

<TMPL_VAR NAME="PARAM1" ESCAPE=HTML>
<TMPL_VAR NAME="PARAM1" ESCAPE=JS>
<TMPL_VAR NAME="PARAM1" ESCAPE=URL>

Variables may have a default value:

<TMPL_VAR NAME="PARAM1" DEFAULT="the devil">

A libwwwmem template which include another template, uses the same syntax as HTML::Template:

<TMPL_INCLUDE NAME="filename.tmpl">

The difference is that, when using libwwwmem, "filename.tmpl" is internally mapped to the pointer _binary_filename_tmpl_start instead of reading "filename.tmpl".

The HTML template: the library functions

Difference from HTML::Template:

The libwwwmem library functions are named with the HTML::Template method names in mind, but not all methods are implemented.
Most HTML::Template methods are overloaded, which is not permitted in C, so libwwwmem includes either only one "version" of each corresponding HTML::Template method, or none.

Examples:

As a template always is an in-memory object, the HTML::Template methods for dealing with file caching do not make sense with libwwwmem.

libwwwmem functions:

wwwmem_tmpl_new()
wwwmem_tmpl_config()
wwwmem_tmpl_param()
wwwmem_tmpl_clear_params()
wwwmem_tmpl_output()
wwwmem_tmpl_output_to()
wwwmem_tmpl_query()

%%% ascii/utf8/utf16 ???

Step 5: Dynamically loaded library

TODO: DETAILS

Step 6: Test suite

TODO: DETAILS

Links

man(1) objcopy:
https://linux.die.net/man/1/objcopy

man(1) ld:
https://linux.die.net/man/1/ld

man(1) xxd:
https://linux.die.net/man/1/xxd

xxd source code:
https://github.com/lwilletts/xxd

Tutorials:
http://www.linuxjournal.com/content/embedding-file-executable-aka-hello-world-version-5967
https://dvdhrm.wordpress.com/tag/objcopy/
https://gareus.org/wiki/embedding_resources_in_executables

HTML template:
https://metacpan.org/pod/HTML::Template (Perl)
http://libctemplate.sourceforge.net/doc.html (C)

HTTP:
https://www.jmarshall.com/easy/http/