#
# README for the docmetrics.py Script
#
# $Id: docmetrics-README.txt 33967 2008-08-19 07:37:23Z toms $


------------------------------------------------------------------
1. Purpose
------------------------------------------------------------------
The docmetrics tools determines some statistical information about
releases in the documentation department. It collects mainly page
numbers and words and outputs it in a form suitable for consumption
either by humans or by other programs.


------------------------------------------------------------------
2. Precondition
------------------------------------------------------------------
The script needs PDF and XML files in order to work correctly. To
minimize the programming effort, each PDF and XML filename have to be
identical, only separated by a file extension. For example, if you
have a book FOO you need FOO.pdf and FOO.xml.gz (yes, packed!).

Each PDF and XML files are stored under /suse/lxbuch/Export/RELEASE.
RELEASE is a placeholder for openSUSE or SLE (business) products.
Currently openSUSE can be 100, 101, 102, 103, 110, 111, ... whereas
the business products are mainly SLES and SLED. They are stored in
SLES10, SLES10-SP1, SLES10-SP2, ... etc. Same for SLED.

NOTE: Each XML file MUST contain only the respective book! The reason
for this is to minimize structural changes in other parts of the
documents that has nothing to do with your book.
Normally the root elements are book or article, NOT set! If you get
a set after "make bigfile", run the following command (you have to
source an ENV file first):

$ xsltproc --output XML_REDUCED --stringparam rootid $ROOTID \
  $DTDROOT/novdoc/xslt/misc/reduce-from-set.xsl \
  BIGFILE

(Maybe this will be automated in the future.) Pack the XML_REDUCED
file gzip -9, rename it if necessary, and store it under
/suse/lxbuch/Export/RELEASE/.


------------------------------------------------------------------
3. Project Files
------------------------------------------------------------------
A project file is mainly a CSV file (comma separated value) with
optional comments which collects all deliverables in a certain project.
For example, the openSUSE project file contains the KDE and GNOME
Userguide, the KDE and GNOME Quickstart, Startup, Reference, and a
few others.

A project file must have a header and can have optional comments
(beginning with #). The header for the openSUSE release looks like this:

Name,100,101,102,103,110

Each header line is separated through commas. The first entry in
this list ("Name") gives each deliverable a distinctive, meaningful
name. Following entries name the releases. Each release have two
purposes: (1) it gives the following rows a name, and
(2) it is a pointer to a directory under /suse/lxbuch/Export.

In the above example, we have 5 releases (100, 101, 102, 103, and 110).
Each of this release have its own directory under /suse/lxbuch/Export.
So the first release (100) points to /suse/lxbuch/Export/100, the
second release (101) to /suse/lxbuch/Export/101 and so forth.

Consecutive lines determine the deliverables. For example, the line
for the KDE Quickstart looks like this:

KDEQuick,,,opensuse-kdequick_en.pdf,opensuse-kdequick_en.pdf,opensuse-kdequick_en.pdf

The first entry (KDEQuick) is just the name of this deliverable. The next
two entries are empty, meaning there are no KDE Quickstart for 100 and 101
releases. Following entries points to the PDF file of this book. In
this case, it is all opensuse-kdequick_en.pdf. But remember, these stay
in different directories!


------------------------------------------------------------------
3. New Books, Obsolete Books, and Changed Books
------------------------------------------------------------------
The project file in the previous section stores not only the deliverables.
Depending which release you want to compare, it gives you also the
information about which books are new, obsolete or has been changed.

Look at the KDE Quickstart from the previous section. Lets print both
the header and deliverable lines:

Name,100,101,102,103,110
KDEQuick,,,opensuse-kdequick_en.pdf,opensuse-kdequick_en.pdf,opensuse-kdequick_en.pdf

For the sake of readability, we insert some spaces. However, don't
insert them into the CSV file! It's just for this example.

Name     , 100, 101, 102                      ,103                      ,110
KDEQuick ,    ,    , opensuse-kdequick_en.pdf ,opensuse-kdequick_en.pdf ,opensuse-kdequick_en.pdf

Now this looks much nicer! Lets say, you want to compare 10.1 with 10.2. The
script grabs the the line, splits it and creates a list of only theses two
releases. You get something similar like this:

[ "", "opensuse-kdequick_en.pdf"]

As you can see, there was no KDE Quickstart in 10.1, it has been started
in 10.2. So this is obviously a new book. The same applies the other way:
if there is an empty entry in your list, this is an obsolete book. If both
entries contains something, it must be a changed book. However, if you get
two empty entries, the deliverable is not available in both releases and
is removed from any further observations.


------------------------------------------------------------------
4. Technical Dependencies
------------------------------------------------------------------
The script depends on some other programs:

  * Python (of course!)
  * wdiff, to compare the XML files
  * pdfinfo, to grab the pages of your PDF
    (package poppler-tools)
  * xmlformat and the config file from
    https://svn.suse.de/svn/doc/trunk/novdoc/etc/docbook-xmlformat.conf


------------------------------------------------------------------
4. Output Formatters
------------------------------------------------------------------
The script is able to output its statistic in normal text, CSV, and
XML. Further formatters can be easily inserted.

  * The text output is used to get an overview.
  * The CSV output can be used in OpenOffice Calc to get nice and
    shiny graphs.
  * The XML output can be used to apply an XSLT stylesheet on it,
    mainly for Web services or to transform it to different output.


------------------------------------------------------------------
4. Wdiff Output
------------------------------------------------------------------
The scripts delegates any statistical calculation to the wdiff tool.
As such, it can not influence the statistical algorithm. It depends
on the output of wdiff.

The output of wdiff is a bit unfortunate and confusing. Docmetrics
uses the following parts of the output:

A: 3 words  1 33% common  0 0% deleted  2 66% changed
B: 2 words  1 50% common  0 0% inserted  1 50% changed

File A contains 3 words which 2 where changed. File B contains only 2 words
and 1 has been changed. 1 word is available in both files.

The percentage values are calculated like this:

  words%    = A(words) / B(words)
  deleted%  = A(deleted) / A(words)
  inserted% = B(inserted) / B(words)
  common%   = A(common) / A(words)
  changed%  = A(changed) / A(words) # defined by Tom

If there are new or obsolete files, the values are set to:

  * New files
    word%     = 1.0
    deleted%  = 0.0
    inserted% = 1.0
    common%   = 0.0
    changed%  = 1.0

  * Obsolete files
    word%     = -1.0
    deleted%  = -1.0
    inserted% = 0.0
    common%   = 0.0
    changed%  = -1.0


------------------------------------------------------------------
5. Pages
------------------------------------------------------------------
The script needs not only two XML files but also two PDF files.
They are analyzed with pdfinfo. Only the line with "Pages:  ..."
are detected, anything else is skipped.

Percentage values are calculated like in the wdiff section:

  pages% = B(page) / A(page)



------------------------------------------------------------------
7. Algorithm
------------------------------------------------------------------
The script uses the following algorithm:

  1. Open the project file and get header and deliverables
    a. Read the header
    b. Omit any lines beginning with #. These are comments.
    c. Append each deliverables in a Python list
  2. Iterate through the list of deliverables
    a. if source and dest
       => put it in a dict of changed books
    b. if source and not dest
       => put it in a dict of obsolete books
    c. if not source and dest
       => put it in a dict of new books
    d. else
       => None exist, skip it and continue with the next line
  3. Get information
    a. Get the pages of the PDF file with pdfinfo
    b. Unpack the source and dest XML files and store the unpacked
       files under /tmp with a temporary name
    c. Format with xmlformat utility in place.
    d. Compare the formatted XML files with wdiff -s to get statistical
       information. Suppress the diff output.
    e. Calculate the percentage values
  3. Format the result according to the current formatter classes
    a. Format the dict of new books
    b. Format the dict of obsolete books
    c. Format the dict of changed books
    


