Login

Spell check the Fossil SCM source code

Part 1 : Basic usage top

Read this section for basic usage of the fosspell script.

Description top

This software checks the Fossil SCM source code for spelling errors, duplicated words, and trailing spaces.
The result, a list of filenames and words, could posted to the Fossil SCM mailing list.

Prerequisites top

This software requires perl to run the script, and both aspell and hunspell for spell checking.
Both aspell and hunspell require an English dictionary each.
The dictionaries normally have to be installed separately.

BSD/Linux top

AFAIK, all BSD/Linux distributions come with Perl already installed.
All common distributiuons have packages for both aspell and hunspell.
The dictionary package names may vary among distributions:
en-aspell or aspell-en, en-hunspell or hunspell-en, etc.

MS Windows (using MSYS2) top

On MSYS2, perl has to be installed, as any other package.


    pacman -S perl
    pacman -S mingw-w64-x86_64-aspell
    pacman -S mingw-w64-x86_64-aspell-en
    pacman -S mingw-w64-x86_64-hunspell
    pacman -S mingw-w64-x86_64-hunspell-en <--- MAY FAIL AS AN UNRECOGNIZED PACKAGE

If the installation of the hunspell-en dictionary package fails, download and install the dictionaries manually from http://wordlist.aspell.net/dicts/:

    pacman -S wget
    wget http://downloads.sourceforge.net/wordlist/hunspell-en_US-2016.06.26.zip
    wget http://downloads.sourceforge.net/wordlist/hunspell-en_CA-2016.06.26.zip
    wget http://downloads.sourceforge.net/wordlist/hunspell-en_GB-ise-2016.06.26.zip
    pacman -S unzip
    unzip hunspell-en_US-2016.06.26.zip
    unzip hunspell-en_CA-2016.06.26.zip
    unzip hunspell-en_GB-ise-2016.06.26.zip
    mv en_GB-ise.aff en_GB.aff
    mv en_GB-ise.dic en_GB.dic
    mkdir -p /c/msys64/mingw64/share/hunspell
    mv *.aff *.dic /c/msys64/mingw64/share/hunspell/

Mac OSX top

Not tested.

Setup top

  1. Clone the Fossil repository:

    fossil clone  http://www.fossil-scm.org/  fossil.fossil
    
  2. Clone this repository:

    fossil clone  http://kuu.se/fossil/fosspell/  fosspell.fossil
    
  3. Open the Fossil repository:

    mkdir fossil
    cd fossil
    fossil open /path/to/fossil.fossil
    
  4. Open this repository (inside the Fossil repository):

    mkdir spell
    cd spell
    fossil open --nested /path/to/fosspell.fossil
    

Usage top

Run

    ./fosspell COMMAND

where COMMAND is one of

Run ./fosspell all to check all the Fossil source code tree.
Run ./fosspell help to get all the gory details.

Result top

The resulting typos is stored in three text files, one for each type of typo:

Part 2 : Internals top

This section covers the fosspell internals.
If you are only interested in basic usage, you can stop reading here.

Cache top

fosspell uses the UNIX file utility to detect different type of files, to know if and how to spell check them.
This is somewhat time-consuming, and is normally done only once.
The result is stored in cache files.
Re-running fosspell will use the cache instead of running file.
The cache is updated only when fosspell detects new files in the Fossil source code tree.

Personal dictionaries top

Two dictionaries are used together with fosspell:

American English versus British English top

The Fossil source code contains spellings both in American English...

skins/xekri/css.txt: /* example ticket colors */

... and in British English:

skins/black_and_white/css.txt: /* consistent colours */

Using the en_US dictionary, colours is detected as a misspelled word:

    echo colours colors | hunspell -d en_US -l
    colours

Using the en_GB dictionary, colors is detected as a misspelled word:

    echo colours colors | hunspell -d en_GB -l
    colors

The trick to accept both spellings is to use both dictionaries:

    echo colours colors | hunspell -d en_US,en_GB -l
    <no output>

False positives when spell checking source code top

Compared to a text written in a natural language, spell checking of source code inevitable detects a lot more of false positives.
Lots of sections in a source code should obviously be filtered out before the spell checking to take place.
For example, in a .c or a .h file, it only makes sense to spell check comments and strings.
Another example is the .wiki files, containing HTML tags, where the tags themselves should not be spell checked, only the literal strings.
Even so, there will be many false positives.
Source code is by nature full of special technical terms, not always included in a standard English dictionary.
For example, the word SQL is a known word to hunspell's US English dictionary, but unknown to aspell:

echo SQL | hunspell -l -d en_US
<no output>

echo SQL | aspell list --lang=en_US
SQL

The example above is an example of a false positive, which can easily be fixed by adding the word to a personal dictionary:

echo personal_ws-1.1 en 0 > my.false.positives.for.aspell.txt
echo SQL >> my.false.positives.for.aspell.txt
echo SQL | aspell list --lang=en_US --personal=./my.false.positives.for.aspell.txt
<no output>

We consider these false positives easy to handle, as we add them once, and forget about the problem.

Unfortunately there are also false positives tricky to handle.
One example is the word notfound, which, under normal circumstances, always is a spelling error for not found.
In Fossil terminology, however, notfound is a option, use together with the fossil ui command, among others.
This means that notfound may or may not be a spelling error, depending on the context.
We cannot just add notfound to our personal dictionary as we did with SQL, as that would prevent us to catch future spelling errors for not found.
Instead, we add all the lines where notfound appears to a separate list of tricky false positives.
Text added in the future, containing a line with notfound, will be detected as a spelling error, unless the entire line matches an line in the existing list.
The user will then have two options for the error to disappear:

Finally, there are the technical terms containing symbols, function and variable names etc, for example blob_appendf().
In any software, it is common to refer to function and variable names in comments.
Function and variable names frequently contains underscore _, which becomes a real headache when working with hunspell and aspell.
Both spell checkers consider _ as a word separator, so in their eyes, blob_appendf() is split into blob and appendf. blob is considered a valid word, while appendf is interpreted as a misspelling of append.
Logical for a spell checker, unfortunate for us. ☹
This part needs some real hacking... TBD.

False positives, easy to handle top

Let's take a string from src/blob.c as an example:


 char *blob_sql_text(Blob *p){
   blob_is_init(p);
   if( (p->blobFlags & BLOBFLAG_NotSQL) ){
     fossil_fatal("Internal error: Use of blob_appendf() to construct SQL text"); /* <--- LET'S SPELL CHECK THIS STRING */
   }
   return blob_str(p);
 }

There are no visible spelling errors in the string.
But when we run hunspell on the string, one false positive is detected, appendf:

    echo 'Internal error: Use of blob_appendf() to construct SQL text' | hunspell -d en_US,en_GB -l

    appendf

%%% TODO %%% CHECK HOWTO CUSTOM ASPELL DICTIONARY AND ccpp mode %%% cat ../bld/blob_.c | aspell list --lang=en --mode=ccpp

The solution is to create a personal dictionary containing false positives:

    echo appendf >> false.positives.easy.txt

Test the dictionary to check that appendf disappears:

    echo 'Internal error: Use of blob_appendf() to construct SQL text' | hunspell -d en_US,en_GB -l -p false.positives.easy.txt
    <no output>

To apply this technique to all files in the Fossil source code tree, fosspell does something similar to one of the man (1) hunspell examples:

    EXAMPLES
          ...
          hunspell -l *.odt | sort | uniq >unrecognized
          Saving unrecognized words of ODF documents  (filtering  duplications).

          hunspell -p unrecognized_but_good *.odt
          Interactive  spell  checking  of ODF documents, using the previously saved and reduced
          word list, as a personal dictionary, to speed up spell checking.

In our case, the spell shecking must be done in three steps:

  1. hunspell -l <all_text_files_but_src_code> | sort | uniq > unrecognized.words.from.text.files.txt
  2. echo <all_words_extracted_from_c_comments> | hunspell -l | sort | uniq > unrecognized.words.from.c.comments.txt
  3. echo <all_words_extracted_from_c_strings> | hunspell -l | sort | uniq > unrecognized.words.from.c.strings.txt

To join the three files and delete duplicates:

    cat unrecognized.words.from.text.files.txt unrecognized.words.from.c.comments.txt unrecognized.words.from.c.strings.txt \  
    | sort | uniq > unrecognized.words.txt

Now begins the tedious task - to edit unrecognized.words.txt to cut out the true spell checking errors, and to paste them into another file.
(It is A Good Thing™ to report the contents of this file to the Fossil mailing list.)
Separating false postivies from true ones, has to be done manually, or, at least, mostly manually. The bright side of this tedious task is that it only has do be done once. That means that when you are using this software, the databases for false positives already exist.

There are no silver bullets to help us creating the dictionaries, but a few methods to reduce the task:

By using the -H flag, we have now reduced the number of words to check from 792 to 602.

Our list of unrecognized words have no * (OK) words. The words marked as & (Miss) offers suggestions, and may be misspelled words, even if most of them actually are false positives.
The words marked as # (None) offers no suggestions, so we can be pretty sure that they are false positives.

Let's separated the words into two files:

    cat unrecognized.words.from.wiki.files.with.H.flag.txt | hunspell -d en_US,en_GB | grep '^#' | \
    cut -d ' ' -f 2 > unrecognized.words.without.suggestions.from.wiki.files.txt

    cat unrecognized.words.from.text.files.with.H.flag.txt | hunspell -d en_US,en_GB | grep '^&' | \
    cut -d ' ' -f 2 > unrecognized.words.with.suggestions.from.wiki.files.txt

    wc -l unrecognized.words.with*suggestions.from.wiki.files.txt
        565 unrecognized.words.with.suggestions.from.wiki.files.txt
         37 unrecognized.words.without.suggestions.from.wiki.files.txt
        602 total

We can assume that all the words without suggestions are false positives.
Effectively, almost all 37 words in unrecognized.words.without.suggestions.from.wiki.files.txt are hash strings.
This file will be used as a "base" for false positives, so we can just copy the file:

    cp unrecognized.words.without.suggestions.from.wiki.files.txt false.positives.easy.txt

Now

Even so, the vast majority of these words will be false positives.
Save the edited file as false.positives.easy.txt.

%%% COPY WORDS EITHER TO easy OR TO tricky OR TO true.positives. %%%

False positives, tricky to handle top

The section above shows an obvious case of a false positive:

Anyhow, there are less obvious and more ambigous cases:
One example is the string notfound; the commonly used command fossil ui has an option called --notfound (see src/main.c, for example).
Thus, there are several C comments and strings containing the word notfound.
The common expression not found is also present in the Fossil source code (in www/tech_overview.wiki, for example).
How can hunspell know if notfound is a typo (we really meant to type not found) or not (we refer mentioned option not found)?
Obviously, it can't.
The problem is that simply adding notfound to the dictionary of false positives would not solve the problem, as misspelling not found as notfound would never be detected.
(Note that misspelling the other way around, notfound as not found is much less of a problem, as the compiler or test programs would detect this misspelling as something similar to "unknown option --not found".)

One way to deal with such words is to add the entire line where the word occurs to a special database for "known tricky words".
The database contains the Any occurences of notfound

    for l in "`grep -n notfound ../src/main.c`"; do printf "%s\n" "$l"; done

%%% --personal=./false.positives.easy.txt

    notfound: src/main.c: (2361,7) (2540,51) ...

The column position is needed to deal with lines like this (src/main.c, line 2361):

    **   --notfound URL   use URL as "HTTP 404, object not found" page.

%%% The easiest way to do this is probably to:

  1. run ./fosspell spell all - creates typo_spell.txt
  2. typo_spell.txt is quite big, so it is rather tedious to check all false positives manually. Anyway, this is a one-time job. As a helping tool,

Even when filtering out very long words and/or strings that obviously not are words (such as hashes), there are still lots of more false positives than real misspelled words.
The solution is to create a database (a plain text file) of false positives.
The easiest way to create the database is to delete the (relatively few) real misspelled words from a result, and keep the false positives as a database.
Subsequent runs of fosspell, using the same version of the Fossil source code tree, will have then have no false positives.
Running fosspell using an updated version of the Fossil source code tree, will probably cause a few false positives (new variable names, etc.),
but it should be a minor task to add them to the database.
Use the following command to add false positives to the database:

    ./fosspell addfalse ?FILENAME? | ?TEXT?

Words can be added from a file, or as words directly from the command line.
The new words are added, one at a line, to the database, which is then resorted alphabetically.
A warning message is shown when trying to add already existing words to the database.
TBD.

Why use both aspell and hunspell? top

%%% BASICALLY: ASPELL FOR SOME STUFF, HUNSPELL FOR OTHER. NOT VERY WELL DOCUMENTED example 1: DOCUMENTATION FOR ccpp ASPELL FILTER MODE example 2: special characters possible in hunspell? perl MODE NOT DUCUMENTED

Well, aspell may be faster/more stable/other reason than hunspell, but:

  1. Spell checking source code means including many non-alphabetic characters.
    hunspell deals better with such characters than aspell.

    cat ../src/main.c | aspell  -a --lang en| grep -i Error # NOT OK
    cat ../src/main.c | hunspell [-a] -d en_US| grep -i Error # SLOWER, BUT OK (-a FLAG NOT NEEDED)
    
  2. When detecting tricky false positives, it is useful to be able to print the entire line where the spell checked word is found.
    This can be done in hunspell using the -L option:

    cat ../src/main.c | hunspell -d en_US  -L | grep notfound
    

AFAIK, there is no such option in aspell.

Links top

This page:
http://kuu.se/fossil/fosspell.cgi
Hunspell:
http://hunspell.github.io//
Text::Hunspell Perl module:
http://search.cpan.org/~eleonora/text_hunspell_1.3/Hunspell.pm