CSON

CSON(Cursive Script Object Notation) is a superset of JSON that can be written by hand and translated to a canonical JSON.

Status: Draft design with a known implementation.

Example

# CSON example
pi: 3.141592
e = 2.718281828, 'foo': 'bar'
"nested" = ["JSON array",
            {and = "JSON object"},
            "with a trailing comma", # yes!
            # yes, the comment can be inside JSON arrays/objects as well
           ]
"verbatim": |a verbatim string
            |  keeps the preceding whitespace
            |    and joins all lines with `\n`
            |      as you see, no escape sequence is processed
            |        and this string does not have a trailing \n -->
i18n: {
  한국어: "Korean"
  日本語: "Japanese"
  汉语-or-漢語: "Chinese"
  ᏣᎳᎩ: "Cherokee"
}

should translate to:

{"pi": 3.141592,
 "e": 2.718281828, "foo": "bar",
 "nested": ["JSON array",
            {"and": "JSON object"},
            "with a trailing comma"
           ],
 "verbatim": "a verbatim string\n  keeps the preceding whitespace\n    and joins all lines with `\\n`\n      as you see, no escape sequence is processed\n        and this string does not have a trailing \\n -->",
 "i18n": {
   "\ud55c\uad6d\uc5b4": "Korean",
   "\u65e5\u672c\u8a9e": "Japanese",
   "\u6c49\u8bed-or-\u6f22\u8a9e": "Chinese",
   "\u13e3\u13b3\u13a9": "Cherokee"
 }
}

Syntax

CSON is defined as grammar additions to RFC 4627, which formally defines JSON. So without a further ado, here is a delta:

  JSON-text = object
            / array
+           / object-items

  begin-array     = ws %x5B ws    ; [ left square bracket
  begin-object    = ws %x7B ws    ; { left curly bracket
  end-array       = ws %x5D ws    ; ] right square bracket
  end-object      = ws %x7D ws    ; } right curly bracket
  name-separator  = ws %x3A ws    ; : colon
+                 / ws %x3D ws    ; = equal sign
  value-separator = ws %x2C ws    ; , comma
+                 / newline ws

  ws = *(
            %x20 /                ; Space
            %x09 /                ; Horizontal tab
-           %x0A /                ; Line feed or New line
-           %x0D                  ; Carriage return
+           newline-char /
+           comment
        )
+ newline = *(%x20 / %x09) newline-char
+ newline-char = %x0A             ; Line feed or New line
               / %x0D             ; Carriage return
+ comment = sharp *comment-char
+ sharp = %x23                    ; # sharp
+ comment-char = %x00-09 / %x0B-0C / %x0E-10FFFF

  value = false / null / true / object / array / number / string

  false = %x66.61.6c.73.65        ; false
  null  = %x6e.75.6c.6c           ; null
  true  = %x74.72.75.65           ; true

- object = begin-object [ member *( value-separator member ) ] end-object
+ object = begin-object [ object-items ] end-object
+ object-items = member *( value-separator member ) [ value-separator ]
- member = string name-separator value
+ member = name name-separator value
+ name = string / bare-string

- array = begin-array [ value *( value-separator value ) ] end-array
+ array = begin-array [ array-items ] end-array
+ array-items = value *( value-separator value ) [ value-separator ]

  number = [ minus ] int [ frac ] [ exp ]
  decimal-point = %x2E            ; .
  digit1-9 = %x31-39              ; 1-9
  e = %x65 / %x45                 ; e E
  exp = e [ minus / plus ] 1*DIGIT
  frac = decimal-point 1*DIGIT
  int = zero / ( digit1-9 *DIGIT )
  minus = %x2D                    ; -
  plus = %x2B                     ; +
  zero = %x30                     ; 0

- string = quotation-mark *char quotation-mark
+ string = quotation-mark *dquoted-char quotation-mark
+        / apostrophe-mark *squoted-char apostrophe-mark
- char = unescaped /
-        escape (
+ dquoted-char = dquoted-unescaped / escaped
+ squoted-char = squoted-unescaped / escaped
+ escaped = escape (
+            %x27 /               ; '    apostrophe      U+0027
             %x22 /               ; "    quotation mark  U+0022
             %x5C /               ; \    reverse solidus U+005C
             %x2F /               ; /    solidus         U+002F
             %x62 /               ; b    backspace       U+0008
             %x66 /               ; f    form feed       U+000C
             %x6E /               ; n    line feed       U+000A
             %x72 /               ; r    carriage return U+000D
             %x74 /               ; t    tab             U+0009
             %x75 4HEXDIG )       ; uXXXX                U+XXXX
  escape = %x5C                   ; \
  quotation-mark = %x22           ; "
+ apostrophe-mark = %x27          ; '
- unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
+ dquoted-unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
+ squoted-unescaped = %x20-26 / %x28-5B / %x5D-10FFFF

+ verbatim-string = verbatim-fragment *(newline ws verbatim-fragment)
+ verbatim-fragment = pipe *verbatim-char
+ pipe = %x7C                     ; |
+ verbatim-char = %x20-10FFFF

+ bare-string = id-start *id-end

+ ; an union of JS identifier and XML name. see below for the rationale.
+ id-start = %x24 / %x2D / %x41-5A / %x5F / %x61-7A / %xAA / %xB5
+          / %xBA / %xC0-D6 / %xD8-F6 / %xF8-02FF / %x0370-037D
+          / %x037F-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF
+          / %x3001-D7FF / %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
+ id-end = id-start / %x2E / %x30-39 / %xB7 / %x0300-036F / %x203F-2040

Changes from JSON:

The entire value is assumed to be an object if no delimiter is found. You can write a CSON file just like an INI file.
Comma can be replaced with a newline.
Colon can be replaced with an equal sign.
String can be quoted with an apostrophe instead of a quotation mark.
Line comment starting with # is allowed.
Bare string, which is analogous to Perl’s bare word, can be used in the key position. Bare string is like a string but without quotes and no whitespace inside. Bare string is not allowed in the value position. (No confusion with true/false therefore. Ah, of course true/false in the key position should be converted to a string as in JavaScript.)
Verbatim string, which is analogous to Python’s raw triple-quoted string, can be used. Verbatim string is one or more lines starting with | (the first line can be prefixed with other constructs), and each occurrence of <newline><spaces>| is replaced with \n except for the first one which is ignored. It does not undergo any other processing, so it can be used to write a verbatim string.

Otherwise same usage and restriction as JSON is applied. For example, Unicode encoding detection specified in Chapter 3 of RFC 4627 can be adapted to CSON as well.

There is one ambiguity in this grammar. The following fragment…

[
  |one
  |two
  |three
]

…can be interpreted both as ["one", "two", "three"] and as ["one\ntwo\nthree"]. The parser should choose the later. If you want to write an array of verbatim strings, you can do as follows:

[ |one
, |two
, |three
]

Rationale

What is an intended use of CSON?

Same as JSON: machine-readable semi-structured data. CSON is equivalent to JSON but with a fancy and more readable syntax. It is a strict superset of JSON (its syntax is defined in terms of RFC 4627 JSON syntax) and can be converted to a canonical JSON.

Why not just JSON?

You know, JSON is very annoying to write or edit by hand. Wouldn’t it be great to have a syntactic sugar for JSON with all frills, bells and whistles?

Why not YAML?

YAML is a human friendly data serialization standard for all programming languages.

See the rationale for no user-defined types, no recursive structures and whitespace-insensitive syntax. Those should be enough.

Why not TOML?

TOML is like INI, only better.

While it does have a niche as a configuration format, I strongly feel that TOML has its own flaws:

Its syntax is quite natural, but it is a bit underspecified which is a PITA for implementations. For example, how about [][].[][]? How about foo bar = "quux"? How about 한글 = "Hangul"? And seriously, no Unicode escape? Really?
It is not a superset of JSON as currently specified. Of course it should be not, but its JSON fragment does not support the full JSON syntax anyway.
Having a datetime as a literal is not bad, but do not underestimate a complexity of ISO 8601. TOML should really specify what is allowed and what is not allowed; it is quite far from what the simple serialization format should do. And again, no timezone, seriously?
And it still does not fix many other annoyances with JSON: no single quoted string, no verbatim string, no multi-line string.

CSON is explicitly designed to avoid these problems. It has a defined grammar which fixes many problems with JSON but avoids to go too far.

Why no user-defined types?

JSON does not have them. See also the rationale for JSON compatibility.

Besides that, having them means that every implementation should recognize major user-defined types (otherwise they are useless), which is a major complexity burden. Many user-defined types can be simulated as an object with a reserved key like $type, which can be readily inspected without a prior knowledge of those types.

And of course, the complexity is an enemy to security. YAML had a big one recently.

Why no recursive structures?

JSON does not have them. See also the rationale for JSON compatibility.

Besides that, such a restriction is actually a good thing! Unlike programming languages, data formats should be limited in computational power (and similarly, expressiveness) in order to be efficiently processed. If your data format is Turing-complete, you cannot inspect them until it has been executed. If your data format is mutable (can be overwritten during data processing) you can’t be sure of its contents for a while.

Yes, recursive structures are not necessarily harmful. LISP has supported recursive structures for long time and even has a proper serialization and deserialization algorithm. But this “feature” is not without a complexity; a naive algorithm cannot traverse a recursive structure, and it introduces free-form identifiers which are tagged but not associated to pointing structures. (This is similar to XML namespaces but much worse.) Restricting yourself in non-recursive structures saves you from such problems.

If you really want, you can always use a supplementary standard like JSPON on top of CSON. That is much better than a rule-’em-all serialization format.

Why not whitespace-sensitive?

It is not a programming language but a data format that can be potentially rewritten in a variety of ways. Whitespace is not visible so very prone to be changed or misinterpreted; a confusion between tabs and spaces is a popular problem even in Python.

CSON tries to avoid whitespace dependency at all cost. Ironically, this became a problem when verbatim string was added: verbatim strings often used to explicitly insert a indentation before each line. CSON solved this problem with a designated marker for verbatim string (|). It feels that you should align markers in the same column, but you don’t have to. But its appearance gives a strong visual cue, and you will end up with aligning them in the same column anyway. Isn’t it great?

Why JSON compatibility?

JSON has a broad language, library and tool support. Designing a format on top of JSON automatically gives the same support. This requires a round trip compatibility, not like a forward compatibility like YAML (no YAML to JSON possible).

Could you explain bare string syntax a bit more?

Okay, that needs a further explanation. CSON used two repertoire of identifier syntaxes as a baseline:

JavaScript (i.e. ECMAScript 5th edition) identifier syntax
XML (i.e. XML 1.0 5th edition) name syntax, excluding : (already taken by CSON)

Note that almost all JavaScript identifier is an XML name (only exceptions are $, U+00AA, U+00B5 and U+00BA). It is also worthwhile that the range of an XML name is very simplified; it contains lots of unassigned characters or punctuations that can be assumed to be letters for casual use. (For example, U+3002 IDEOGRAPHIC FULL STOP is actually a punctuation but included in an XML name anyway.) This makes matching an XML name a lot easier.

CSON identifier is a union of JavaScript identifier and XML name, excluding :, and including $ and - in every position. This should make both users and implementations comfortable enough.

What the hell with the name?

CSON is written by hand and cursive script is also written by hand. And I wanted to keep -SON suffix. I apologize for careless naming.

Note: I found that CSON also stands for CoffeeScript Object Notation after the initial draft is finished. Sigh. Maybe I should find another name.

Implementations

CSON-js