Spell check the Fossil SCM source code
Part 1 : Basic usage top
Read this section for basic usage of the fosspell
script.
Description top
This software checks the Fossil SCM source code for spelling errors, duplicated words, and trailing spaces.
The result, a list of filenames and words, could posted to the Fossil SCM mailing list.
Prerequisites top
This software requires perl
to run the script, and both aspell
and hunspell
for spell checking.
Both aspell
and hunspell
require an English dictionary each.
The dictionaries normally have to be installed separately.
BSD/Linux top
AFAIK, all BSD/Linux distributions come with Perl already installed.
All common distributiuons have packages for both aspell
and hunspell
.
The dictionary package names may vary among distributions:
en-aspell
or aspell-en
, en-hunspell
or hunspell-en
, etc.
MS Windows (using MSYS2
) top
On MSYS2
, perl
has to be installed, as any other package.
pacman -S perl
pacman -S mingw-w64-x86_64-aspell
pacman -S mingw-w64-x86_64-aspell-en
pacman -S mingw-w64-x86_64-hunspell
pacman -S mingw-w64-x86_64-hunspell-en <--- MAY FAIL AS AN UNRECOGNIZED PACKAGE
If the installation of the hunspell-en
dictionary package fails, download and install the dictionaries
manually from http://wordlist.aspell.net/dicts/:
pacman -S wget
wget http://downloads.sourceforge.net/wordlist/hunspell-en_US-2016.06.26.zip
wget http://downloads.sourceforge.net/wordlist/hunspell-en_CA-2016.06.26.zip
wget http://downloads.sourceforge.net/wordlist/hunspell-en_GB-ise-2016.06.26.zip
pacman -S unzip
unzip hunspell-en_US-2016.06.26.zip
unzip hunspell-en_CA-2016.06.26.zip
unzip hunspell-en_GB-ise-2016.06.26.zip
mv en_GB-ise.aff en_GB.aff
mv en_GB-ise.dic en_GB.dic
mkdir -p /c/msys64/mingw64/share/hunspell
mv *.aff *.dic /c/msys64/mingw64/share/hunspell/
Mac OSX top
Not tested.
Setup top
Clone the Fossil repository:
fossil clone http://www.fossil-scm.org/ fossil.fossil
Clone this repository:
fossil clone http://kuu.se/fossil/fosspell/ fosspell.fossil
Open the Fossil repository:
mkdir fossil cd fossil fossil open /path/to/fossil.fossil
Open this repository (inside the Fossil repository):
mkdir spell cd spell fossil open --nested /path/to/fosspell.fossil
Usage top
Run
./fosspell COMMAND
where COMMAND is one of
all
dup
false
help
scan
setup
spc
spell
version
Run ./fosspell all
to check all the Fossil source code tree.
Run ./fosspell help
to get all the gory details.
Result top
The resulting typos is stored in three text files, one for each type of typo:
typo_spell.txt
: each entry contains: filename, line number word, misspelled wordtypo_dup.txt
: each entry contains: filename, paragraph (possibly multiline) with duplicated wordtypo_spc.txt
: each entry contains: filename, line number
Part 2 : Internals top
This section covers the fosspell
internals.
If you are only interested in basic usage, you can stop reading here.
Cache top
fosspell
uses the UNIX file
utility to detect different type of files, to know if and how to spell check them.
This is somewhat time-consuming, and is normally done only once.
The result is stored in cache files.
Re-running fosspell
will use the cache instead of running file
.
The cache is updated only when fosspell
detects new files in the Fossil source code tree.
Personal dictionaries top
Two dictionaries are used together with fosspell
:
false.positives.easy.txt
Words which are easy to classify as false positives.
The file format is one word per line.
This file can be used directly as a personal directory for spell checking:
hunspell -p false.positives.easy.txt ...
false.positives.tricky.txt
False positives, where each word has its own section of one or more lines.
Thius is the format:[notfound] the --notfound option is used. a "notfound:" tag to tell where to redirect if the particular repository requested notfound: http://url-to-go-to-if-repo-not-found/
American English versus British English top
The Fossil source code contains spellings both in American English...
skins/xekri/css.txt:
/* example ticket colors */
... and in British English:
skins/black_and_white/css.txt:
/* consistent colours */
Using the en_US
dictionary, colours is detected as a misspelled word:
echo colours colors | hunspell -d en_US -l
colours
Using the en_GB
dictionary, colors is detected as a misspelled word:
echo colours colors | hunspell -d en_GB -l
colors
The trick to accept both spellings is to use both dictionaries:
echo colours colors | hunspell -d en_US,en_GB -l
<no output>
False positives when spell checking source code top
Compared to a text written in a natural language, spell checking of source code inevitable detects a lot more of false positives.
Lots of sections in a source code should obviously be filtered out before the spell checking to take place.
For example, in a .c
or a .h
file, it only makes sense to spell check comments and strings.
Another example is the .wiki
files, containing HTML tags, where the tags themselves should not be spell checked, only the literal strings.
Even so, there will be many false positives.
Source code is by nature full of special technical terms, not always included in a standard English dictionary.
For example, the word SQL
is a known word to hunspell
's US English dictionary, but unknown to aspell
:
echo SQL | hunspell -l -d en_US
<no output>
echo SQL | aspell list --lang=en_US
SQL
The example above is an example of a false positive, which can easily be fixed by adding the word to a personal dictionary:
echo personal_ws-1.1 en 0 > my.false.positives.for.aspell.txt
echo SQL >> my.false.positives.for.aspell.txt
echo SQL | aspell list --lang=en_US --personal=./my.false.positives.for.aspell.txt
<no output>
We consider these false positives easy to handle, as we add them once, and forget about the problem.
Unfortunately there are also false positives tricky to handle.
One example is the word notfound
, which, under normal circumstances, always is a spelling error for not found
.
In Fossil terminology, however, notfound
is a option, use together with the fossil ui
command, among others.
This means that notfound
may or may not be a spelling error, depending on the context.
We cannot just add notfound
to our personal dictionary as we did with SQL
, as that would prevent us to catch
future spelling errors for not found
.
Instead, we add all the lines where notfound
appears to a separate list of tricky false positives.
Text added in the future, containing a line with notfound
, will be detected as a spelling error,
unless the entire line matches an line in the existing list.
The user will then have two options for the error to disappear:
- Either: Fix the typo (if she/he really meant to type
not found
) - Or: Add the enire line containing
notfound
to the existing list.
Finally, there are the technical terms containing symbols, function and variable names etc, for example blob_appendf()
.
In any software, it is common to refer to function and variable names in comments.
Function and variable names frequently contains underscore _
, which becomes a real headache when working with hunspell
and aspell
.
Both spell checkers consider _
as a word separator, so in their eyes, blob_appendf()
is split into blob
and appendf
.
blob
is considered a valid word, while appendf
is interpreted as a misspelling of append
.
Logical for a spell checker, unfortunate for us. ☹
This part needs some real hacking... TBD.
False positives, easy to handle top
Let's take a string from src/blob.c
as an example:
char *blob_sql_text(Blob *p){
blob_is_init(p);
if( (p->blobFlags & BLOBFLAG_NotSQL) ){
fossil_fatal("Internal error: Use of blob_appendf() to construct SQL text"); /* <--- LET'S SPELL CHECK THIS STRING */
}
return blob_str(p);
}
There are no visible spelling errors in the string.
But when we run hunspell
on the string, one false positive is detected, appendf
:
echo 'Internal error: Use of blob_appendf() to construct SQL text' | hunspell -d en_US,en_GB -l
appendf
%%% TODO %%% CHECK HOWTO CUSTOM ASPELL DICTIONARY AND ccpp mode %%% cat ../bld/blob_.c | aspell list --lang=en --mode=ccpp
The solution is to create a personal dictionary containing false positives:
echo appendf >> false.positives.easy.txt
Test the dictionary to check that appendf
disappears:
echo 'Internal error: Use of blob_appendf() to construct SQL text' | hunspell -d en_US,en_GB -l -p false.positives.easy.txt
<no output>
To apply this technique to all files in the Fossil source code tree, fosspell
does something similar to one of the man (1) hunspell
examples:
EXAMPLES
...
hunspell -l *.odt | sort | uniq >unrecognized
Saving unrecognized words of ODF documents (filtering duplications).
hunspell -p unrecognized_but_good *.odt
Interactive spell checking of ODF documents, using the previously saved and reduced
word list, as a personal dictionary, to speed up spell checking.
In our case, the spell shecking must be done in three steps:
hunspell -l <all_text_files_but_src_code> | sort | uniq > unrecognized.words.from.text.files.txt
echo <all_words_extracted_from_c_comments> | hunspell -l | sort | uniq > unrecognized.words.from.c.comments.txt
echo <all_words_extracted_from_c_strings> | hunspell -l | sort | uniq > unrecognized.words.from.c.strings.txt
To join the three files and delete duplicates:
cat unrecognized.words.from.text.files.txt unrecognized.words.from.c.comments.txt unrecognized.words.from.c.strings.txt \
| sort | uniq > unrecognized.words.txt
Now begins the tedious task - to edit unrecognized.words.txt
to cut out the true spell checking errors,
and to paste them into another file.
(It is A Good Thing™ to report the contents of this file to the Fossil mailing list.)
Separating false postivies from true ones, has to be done manually, or, at least, mostly manually.
The bright side of this tedious task is that it only has do be done once.
That means that when you are using this software, the databases for false positives already exist.
There are no silver bullets to help us creating the dictionaries, but a few methods to reduce the task:
For example, the
www/
directory contains mainly.wiki
files, with a format similar to HTML, so if we tellhunspell
to parse these files as HTML, we may reduce the number of false positives:hunspell -l -d en_US,en_GB `find www/ -type f -name "*.wiki"` | sort | \ uniq > unrecognized.words.from.text.files.txt hunspell -H -l -d en_US,en_GB `find www/ -type f -name "*.wiki"` | sort | \ uniq > unrecognized.words.from.text.files.with.H.flag.txt wc -l unrecognized.words.* 794 unrecognized.words.from.wiki.files.txt 602 unrecognized.words.from.wiki.files.with.H.flag.txt
By using the -H
flag, we have now reduced the number of words to check from 792 to 602.
When
hunspell
is run to offer suggestions, the line starts with one of these signs:OK: * Miss: & <original> <count> <offset>: <miss>, <miss>, ... None: # <original> <offset>
Our list of unrecognized words have no * (OK)
words. The words marked as & (Miss)
offers suggestions, and may be misspelled words,
even if most of them actually are false positives.
The words marked as # (None)
offers no suggestions, so we can be pretty sure that they are false positives.
Let's separated the words into two files:
cat unrecognized.words.from.wiki.files.with.H.flag.txt | hunspell -d en_US,en_GB | grep '^#' | \
cut -d ' ' -f 2 > unrecognized.words.without.suggestions.from.wiki.files.txt
cat unrecognized.words.from.text.files.with.H.flag.txt | hunspell -d en_US,en_GB | grep '^&' | \
cut -d ' ' -f 2 > unrecognized.words.with.suggestions.from.wiki.files.txt
wc -l unrecognized.words.with*suggestions.from.wiki.files.txt
565 unrecognized.words.with.suggestions.from.wiki.files.txt
37 unrecognized.words.without.suggestions.from.wiki.files.txt
602 total
We can assume that all the words without suggestions are false positives.
Effectively, almost all 37 words in unrecognized.words.without.suggestions.from.wiki.files.txt
are hash strings.
This file will be used as a "base" for false positives, so we can just copy the file:
cp unrecognized.words.without.suggestions.from.wiki.files.txt false.positives.easy.txt
Now
Even so, the vast majority of these words will be false positives.
Save the edited file as false.positives.easy.txt
.
%%% COPY WORDS EITHER TO easy OR TO tricky OR TO true.positives. %%%
False positives, tricky to handle top
The section above shows an obvious case of a false positive:
appendf
: (part of) a function name
Anyhow, there are less obvious and more ambigous cases:
One example is the string notfound
; the commonly used command fossil ui
has an option called --notfound
(see src/main.c
, for example).
Thus, there are several C comments and strings containing the word notfound
.
The common expression not found
is also present in the Fossil source code (in www/tech_overview.wiki
, for example).
How can hunspell
know if notfound
is a typo (we really meant to type not found
) or not (we refer mentioned option not found
)?
Obviously, it can't.
The problem is that simply adding notfound
to the dictionary of false positives would not solve the problem,
as misspelling not found
as notfound
would never be detected.
(Note that misspelling the other way around, notfound
as not found
is much less of a problem,
as the compiler or test programs would detect this misspelling as something similar to "unknown option --not found
".)
One way to deal with such words is to add the entire line where the word occurs to a special database for "known tricky words".
The database contains the
Any occurences of notfound
for l in "`grep -n notfound ../src/main.c`"; do printf "%s\n" "$l"; done
%%% --personal=./false.positives.easy.txt
notfound: src/main.c: (2361,7) (2540,51) ...
The column position is needed to deal with lines like this (src/main.c, line 2361):
** --notfound URL use URL as "HTTP 404, object not found" page.
%%% The easiest way to do this is probably to:
- run
./fosspell spell all
- createstypo_spell.txt
typo_spell.txt
is quite big, so it is rather tedious to check all false positives manually. Anyway, this is a one-time job. As a helping tool,
Even when filtering out very long words and/or strings that obviously not are words (such as hashes),
there are still lots of more false positives than real misspelled words.
The solution is to create a database (a plain text file) of false positives.
The easiest way to create the database is to delete the (relatively few) real misspelled words from a result,
and keep the false positives as a database.
Subsequent runs of fosspell
, using the same version of the Fossil source code tree, will have then have no false positives.
Running fosspell
using an updated version of the Fossil source code tree, will probably cause a few false positives (new variable names, etc.),
but it should be a minor task to add them to the database.
Use the following command to add false positives to the database:
./fosspell addfalse ?FILENAME? | ?TEXT?
Words can be added from a file, or as words directly from the command line.
The new words are added, one at a line, to the database, which is then resorted alphabetically.
A warning message is shown when trying to add already existing words to the database.
TBD.
Why use both aspell
and hunspell
? top
%%% BASICALLY: ASPELL FOR SOME STUFF, HUNSPELL FOR OTHER. NOT VERY WELL DOCUMENTED example 1: DOCUMENTATION FOR ccpp ASPELL FILTER MODE example 2: special characters possible in hunspell? perl MODE NOT DUCUMENTED
Well, aspell
may be faster/more stable/other reason than hunspell
, but:
Spell checking source code means including many non-alphabetic characters.
hunspell
deals better with such characters thanaspell
.cat ../src/main.c | aspell -a --lang en| grep -i Error # NOT OK cat ../src/main.c | hunspell [-a] -d en_US| grep -i Error # SLOWER, BUT OK (-a FLAG NOT NEEDED)
When detecting tricky false positives, it is useful to be able to print the entire line where the spell checked word is found.
This can be done inhunspell
using the-L
option:cat ../src/main.c | hunspell -d en_US -L | grep notfound
AFAIK, there is no such option in aspell
.
Links top
This page:
http://kuu.se/fossil/fosspell.cgi
Hunspell:
http://hunspell.github.io//
Text::Hunspell Perl module:
http://search.cpan.org/~eleonora/text_hunspell_1.3/Hunspell.pm