Introduction#

This document will describe the WHP-Exchange formats used by the CCHDO for CTD and bottle. The WHP-exchange formats provide simplified exchange and improved readability of hydrographic data. WHP-exchange data files carry the essential information from CTD and water sample profiles. WHP-exchange is a rigorously-described comma-delimited (csv) format designed to ease data exchange and simplify data import.

Overview#

The WHP-exchange bottle and CTD data formats include these features:

UTF-8 Encoded
Spreadsheet-like
Comma-delimited values (csv)
No special meaning to blank/empty spaces
Station information in every line in the file (bottle) or in the top lines in each file (CTD)
Only one missing data value defined for all parameters
Positions in decimal degrees
Dates in ISO 8601 YYYYMMDD format

File Types and Names#

There are three types of WHP-exchange format files, each with a unique 8-character suffix:

Data Type	Filename Suffix	Description
CTD data	_ct1.csv	One CTD profile in WHP-exchange format
CTD data	_ct1.zip	Zip archive containing one or more _ct1.csv WHP-exchange CTD files
Bottle data	_hy1.csv	Data from one or more bottle profiles in WHP-exchange format

Requirement Levels#

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

About Text Encodings and UTF-8#

Text on Computers#

As strange as it may seem, there is no such thing in computing as ‘plain text’. Computers only understand binary, on or off, commonly represented by zeros and ones. For a series of zeros and ones to have meaning to humans, there needs to be an agreed upon standard for what any specific set of binary data represents. As an example, in 8-bit ASCII (ANSI X3.4-1986) the capital letter A is represented by the binary 01000001 (hex 41). 7-bit ASCII is limited to representing 127 characters, which is fine for most English speaking countries.

As the use of computers spread to non english speaking countries, it became necessary to extend the encodings to support other characters needed. However, most systems still only supported 8-bit bytes, with a maximum of 255 different characters it could represent. With more characters needing to be represented than space available, a proliferation of incompatible encoding standards occurred. There are at least 15 parts of ISO 8859, 6 JIS standards for Japanese, 3 for Chinese, 9 encodings specific to the Windows operating system, and 16 DOS code pages. Unicode was created to provide a unified way of representing all the characters which occur in most writing systems, including those of dead languages.

The unicode standard itself is not an encoding standard, but rather a list characters with a number assigned to each one, these numbers are what are called code points. For example, the capital letter A is the 65th letter in unicode, usually written in the hex 41. In the standard way of writing code points, this would be written as U+0041. You may notice that the unicode point for the capital letter A is the same as in ASCII, this feature was exploited to create the most common text encoding on the internet, UTF-8.

Character encodings were created to represent, in binary, all the code points allowed within unicode. One encoding in particular has become the dominant one for text on the internet [1], UTF-8. UTF-8 is a variable length encoding, meaning a character can take anywhere from 1 to 6 bytes to represent. In UTF-8 the first 127 characters of unicode are encoded with only byte. Since the first 127 code points in unicode are exactly the same as ASCII, the UTF-8 representation of any unicode character less than 128 is ASCII. This allows forward compatibility of ASCII with UTF-8, and if containing only code points below 128, a UTF-8 file to be backwards compatible with ASCII.

Unicode Representation in this Document#

Character in this document will be defined as unicode points in the format U+#### where the # symbols are hexadecimal numbers. Since exchange files are defined to be UTF-8 encoded, this unambiguously specifies the exact bits which must occur in a file.

Versioning#

The format spec follows semantic versioning x.y.z where:

x is the major version number, incrementing this number would be due to breaking changes to the exchange format spec, we do not expect this to happen ever.
y is the minor version number and would indicate “non-breaking” changes to the specification, examples of non-breaking changes could include clarifications, the addition of new synthesized parameters (like _FLAG_W).
z is the release number and would be incremented for things such as grammar changes, spelling corrections, and updating the parameters list, the parameters list is versioned independently.

The components of the numbers are incremented independently, e.g. the version after 1.0.9 would be 1.0.10.