FILE block_variation VERSION 1.2
CHANGE TIME 94/10/28 14:06:26
ACCESS TIME 94/10/28 14:06:27


                             BLOCK SIZE VARIATION
                               by Gregg Walters


Why does the block size vary in a dataset whose documentation claims
that fixed length is used?  And what are these extra bytes?


GENERAL

They are meaningless, extraneous bits appended to the end of the original
data blocks.

Most of our continuing datasets have been collected as a series of updates
for many years.  It is therefore likely that the updates have been processed
on different machines, and that different internal word sizes may have been
involved.  It is also possible that the data have been migrated from older
storage media to newer media.  The same kind of changes have also occurred
at our data sources.

The consequences of such changes could theoretically be completely eliminated,
but in practise, the workarounds and costs become a factor.  First of all,
a processing machine will usually use as many whole internal words as is
necessary to hold the raw input data, and will usually only be able to move
the entire contents of those whole words to output.  To do this it will append
bits, called "padding," to the original bits.  The software on some output
devices may also do the same thing to ensure that a certain exact multiple of
bytes are written to the target media.  All this padding should be ignored
by the users, but sometimes the software won't let them.  The NCAR system I/O
software supporting our access software handles this padding for users almost
transparently.  It provides users access at a simpler level, where it is
possible to gracefully move a stream of bits, merely given a maximum block
size.  The added expectations (and sometimes bytes) of various so-called
"standard," and generally proprietary, fixed-length blocking schemes do not
lend themselves to simple data transfers.


EXAMPLES -

Every dataset has the potential for its own little variation.  These are
just a few examples that should illuminate most problems.  The first example
should be studied by all users, as it shows in detail what happens.


DS108.0 AUSTRALIAN S.HEM TROPO ANALS, DAILY 1972APR-CON

When the Australian dataset was started, we were using a CDC7600, which used
60-bit words.  The Australian data was received in a somewhat inefficient
format, which had put 3 10-bit packed values in each of their 32-bit words.
We eliminated the wasted 2 bits and rewrote the ID at the beginning of the
data block.  Each subsequent update from Australia (about every 3 years) came
in a new format, which we converted to the format we use, so as to maintain a
daily series in one format.  The most recent update came in a very cumbersome
character format.  Considering the direction other data centers have gone
with storing their analyses, this seemed very strange.  Anyway, our "switch"
from 2805 to 2808 bytes is trivial in comparison to what Australia keeps
changing.

But where did the extra 3 bytes come from?  The 374 60-bit words hold 22440
bits.  When these data were migrated from half-inch tape, using the CDC7600,
to our TBM (our old MSS), the block size was preserved.  When the CDC7600 was
replaced by the 64-bit word Crays, we had the effect of adding a "filter."
The Cray would pad the 22440 bits to fill 351 64-bit words (it would add 24
bits), and when the data were written to the TBM, it was moving 3 more bytes,
making 2808 bytes.  When the NCAR systems group migrated our data to our
present MSS, it was done directly from the TBM, so both block sizes were
preserved.  We have considered rewriting several datasets to achieve a
consistent block size (the NMC global analyses in ds082.0 is another case)
but to date we consider the expense prohibitive.  Local users are not
effected, but we help off-site users obtain software solutions.


DS082.0 NMC GLOBAL TROPO ANALS, DAILY 1976JUL-CON

The story for the "global grids" is very similar to the "Australian grids."
Historically, we receive the updates from NMC on a monthly, if not weekly
basis, rather than every 3 years or so.  Please read the text above about the
Australian analyses, before considering the following details about the NMC
globals.

The handling of the NMC global analyses has involved 4 different block sizes:
10784, 10785, 10792, and the 10778 as received from NMC, which is 86224 bits.
These required 1438 60-bit words on our bygone CDC7600, or 86280 bits, which
is 10785 bytes.  In 1983, when these 10785 byte blocks were moved from
half-inch tapes to the TBM via the Crays, the Crays padded them out to 1349
64-bit words, which is 10792 bytes.  So for the period 1976jul-1983mar, users
will find blocks of 10792 bytes.  Occasionally there are volumes with a mix
of 107884 and 10792 (where, after the 1983 migration, we inserted gap fillers
for missing data).  Beginning 1983apr, they are always 10784 bytes.  But they
never match the original NMC 10778 bytes because the Crays pad these to 10784.
The compression software for this dataset will always write out as many words
as are used to read the data, although some versions may actually be tailored
to eliminate some of the extraneous padding.


DS353.1  NMC ADP GLOBAL UPPER AIR OBS (MIXED), DAILY 1985-CON
DS353.4  NMC ADP GLOBAL UPPER AIR OBS SUBSETS, DAILY 1973-CON
DS464.0  NMC ADP GLOBAL SFC OBS, DAILY JUL1976-CON

Please read the text above about the Australian analyses, before considering
the following details about these ADP datasets.

Prior to 1976Sep15,18Z blocks are 5120 bytes.  Beginning 1976Sep15,18Z blocks
are 6440 or 6432 bytes.

Historically, we receive the updates from NMC on a monthly, if not weekly
basis.  The observational data contained therein has reports that can vary in
length, and NMC pads the data blocks.  The change from 5120 to 6432 was done
at NMC, probably to improve efficiency by reducing the number of inter-record
gaps on the tapes.  The difference between 6432 and 6440 involves a byte added
by the output device software at NMC.

When we compress a selection of data for a user, this difference is eliminated
by the output algorithm, which always uses 6440.  But when we merely copy the
data volumes, the 5120 (or 6432) is preserved.  Theoretically, we could use the
compress algorithm to "filter" data copies, but this makes for an expensive use
of Cray time.  Fortunately our access software recognizes the report pointers
within either of these blocksizes, and properly extracts the data.  This
software will ignore extraneous padding beyond the 5120, 6432 or 6440.  In
theory, we could rewrite everything in still bigger blocks, to eliminate
still more inter-record gaps, but the newer technologies involving data
cartridges have killed this concern.


DS24X.X  (various Navy gridded analyses)

Please read the text above about the Australian analyses, before considering
the following details about the Navy analyses.

For some data from July 1973 to July 1984, and then from May 1990 to April
1994 (when the Navy unilaterally decided to terminate shipment) we processed
the Navy grids with an algorithm that implemented our decision to carry some
extraneous bits that appeared after the end of the Navy blocks, but which were
previously ignored.  These were appended to our "fixed-length" blocks as extra
bytes (up to 120).  These are nevertheless of no use to the user, and should
probably be ignored.


CONCLUSION

The extra bytes always appear at the end of the data block, beyond the
useful, documented information.  To solve the problem of variable block
size, you need to be able to read the maximum block size, and then to
ignore the padding.  If the documentation for a particular dataset does
not indicate a variation in block size, you may still need to specify an
input array that can handle a few more bytes or words, and use I/O software
that can tolerate the variation.

For observed data, try printing out the observations.  Do they make sense?
For analyzed grids, try contouring.  Do they make sense?  If these simple
dumps of the results of your extractions look reasonable, you very probably
have solved the problem.  And you can move on to more complicated analysis
of the data.  It may be difficult to evaluate the data extraction on the
basis of the results of complicated analyses.  Check the data extraction
as early as possible in your processing.

We regret that staff limitations have prevented us from keeping pace with
needed documentation updates.  The availability of new means to deliver this
information has both mitigated and exacerbated the problem.  Many of our
datasets still have documentation on paper only, and based on a 60-bit word
orientation.  In that regard, the on-line MASTER file for a dataset may often
have information more current than the related on-line format file, or the
hardcopy.  Please contact the specialist for the dataset if you need more
help.