CSON

CSON(Cursive Script Object Notation) is a superset of JSON that can be written by hand and translated to a canonical JSON.

Status: Draft design with a known implementation.

Example

# CSON example
pi: 3.141592
e = 2.718281828, 'foo': 'bar'
"nested" = ["JSON array",
            {and = "JSON object"},
            "with a trailing comma", # yes!
            # yes, the comment can be inside JSON arrays/objects as well
           ]
"verbatim": |a verbatim string
            |  keeps the preceding whitespace
            |    and joins all lines with `\n`
            |      as you see, no escape sequence is processed
            |        and this string does not have a trailing \n -->
i18n: {
  한국어: "Korean"
  日本語: "Japanese"
  汉语-or-漢語: "Chinese"
  ᏣᎳᎩ: "Cherokee"
}

should translate to:

{"pi": 3.141592,
 "e": 2.718281828, "foo": "bar",
 "nested": ["JSON array",
            {"and": "JSON object"},
            "with a trailing comma"
           ],
 "verbatim": "a verbatim string\n  keeps the preceding whitespace\n    and joins all lines with `\\n`\n      as you see, no escape sequence is processed\n        and this string does not have a trailing \\n -->",
 "i18n": {
   "\ud55c\uad6d\uc5b4": "Korean",
   "\u65e5\u672c\u8a9e": "Japanese",
   "\u6c49\u8bed-or-\u6f22\u8a9e": "Chinese",
   "\u13e3\u13b3\u13a9": "Cherokee"
 }
}

Syntax

CSON is defined as grammar additions to RFC 4627, which formally defines JSON. So without a further ado, here is a delta:

  JSON-text = object
            / array
+           / object-items

  begin-array     = ws %x5B ws    ; [ left square bracket
  begin-object    = ws %x7B ws    ; { left curly bracket
  end-array       = ws %x5D ws    ; ] right square bracket
  end-object      = ws %x7D ws    ; } right curly bracket
  name-separator  = ws %x3A ws    ; : colon
+                 / ws %x3D ws    ; = equal sign
  value-separator = ws %x2C ws    ; , comma
+                 / newline ws

  ws = *(
            %x20 /                ; Space
            %x09 /                ; Horizontal tab
-           %x0A /                ; Line feed or New line
-           %x0D                  ; Carriage return
+           newline-char /
+           comment
        )
+ newline = *(%x20 / %x09) newline-char
+ newline-char = %x0A             ; Line feed or New line
               / %x0D             ; Carriage return
+ comment = sharp *comment-char
+ sharp = %x23                    ; # sharp
+ comment-char = %x00-09 / %x0B-0C / %x0E-10FFFF

  value = false / null / true / object / array / number / string

  false = %x66.61.6c.73.65        ; false
  null  = %x6e.75.6c.6c           ; null
  true  = %x74.72.75.65           ; true

- object = begin-object [ member *( value-separator member ) ] end-object
+ object = begin-object [ object-items ] end-object
+ object-items = member *( value-separator member ) [ value-separator ]
- member = string name-separator value
+ member = name name-separator value
+ name = string / bare-string

- array = begin-array [ value *( value-separator value ) ] end-array
+ array = begin-array [ array-items ] end-array
+ array-items = value *( value-separator value ) [ value-separator ]

  number = [ minus ] int [ frac ] [ exp ]
  decimal-point = %x2E            ; .
  digit1-9 = %x31-39              ; 1-9
  e = %x65 / %x45                 ; e E
  exp = e [ minus / plus ] 1*DIGIT
  frac = decimal-point 1*DIGIT
  int = zero / ( digit1-9 *DIGIT )
  minus = %x2D                    ; -
  plus = %x2B                     ; +
  zero = %x30                     ; 0

- string = quotation-mark *char quotation-mark
+ string = quotation-mark *dquoted-char quotation-mark
+        / apostrophe-mark *squoted-char apostrophe-mark
- char = unescaped /
-        escape (
+ dquoted-char = dquoted-unescaped / escaped
+ squoted-char = squoted-unescaped / escaped
+ escaped = escape (
+            %x27 /               ; '    apostrophe      U+0027
             %x22 /               ; "    quotation mark  U+0022
             %x5C /               ; \    reverse solidus U+005C
             %x2F /               ; /    solidus         U+002F
             %x62 /               ; b    backspace       U+0008
             %x66 /               ; f    form feed       U+000C
             %x6E /               ; n    line feed       U+000A
             %x72 /               ; r    carriage return U+000D
             %x74 /               ; t    tab             U+0009
             %x75 4HEXDIG )       ; uXXXX                U+XXXX
  escape = %x5C                   ; \
  quotation-mark = %x22           ; "
+ apostrophe-mark = %x27          ; '
- unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
+ dquoted-unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
+ squoted-unescaped = %x20-26 / %x28-5B / %x5D-10FFFF

+ verbatim-string = verbatim-fragment *(newline ws verbatim-fragment)
+ verbatim-fragment = pipe *verbatim-char
+ pipe = %x7C                     ; |
+ verbatim-char = %x20-10FFFF

+ bare-string = id-start *id-end

+ ; an union of JS identifier and XML name. see below for the rationale.
+ id-start = %x24 / %x2D / %x41-5A / %x5F / %x61-7A / %xAA / %xB5
+          / %xBA / %xC0-D6 / %xD8-F6 / %xF8-02FF / %x0370-037D
+          / %x037F-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF
+          / %x3001-D7FF / %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
+ id-end = id-start / %x2E / %x30-39 / %xB7 / %x0300-036F / %x203F-2040

Changes from JSON:

Otherwise same usage and restriction as JSON is applied. For example, Unicode encoding detection specified in Chapter 3 of RFC 4627 can be adapted to CSON as well.

There is one ambiguity in this grammar. The following fragment…

[
  |one
  |two
  |three
]

…can be interpreted both as ["one", "two", "three"] and as ["one\ntwo\nthree"]. The parser should choose the later. If you want to write an array of verbatim strings, you can do as follows:

[ |one
, |two
, |three
]

Rationale

What is an intended use of CSON?

Same as JSON: machine-readable semi-structured data. CSON is equivalent to JSON but with a fancy and more readable syntax. It is a strict superset of JSON (its syntax is defined in terms of RFC 4627 JSON syntax) and can be converted to a canonical JSON.

Why not just JSON?

You know, JSON is very annoying to write or edit by hand. Wouldn’t it be great to have a syntactic sugar for JSON with all frills, bells and whistles?

Why not YAML?

YAML is a human friendly data serialization standard for all programming languages.

See the rationale for no user-defined types, no recursive structures and whitespace-insensitive syntax. Those should be enough.

Why not TOML?

TOML is like INI, only better.

While it does have a niche as a configuration format, I strongly feel that TOML has its own flaws:

CSON is explicitly designed to avoid these problems. It has a defined grammar which fixes many problems with JSON but avoids to go too far.

Why no user-defined types?

JSON does not have them. See also the rationale for JSON compatibility.

Besides that, having them means that every implementation should recognize major user-defined types (otherwise they are useless), which is a major complexity burden. Many user-defined types can be simulated as an object with a reserved key like $type, which can be readily inspected without a prior knowledge of those types.

And of course, the complexity is an enemy to security. YAML had a big one recently.

Why no recursive structures?

JSON does not have them. See also the rationale for JSON compatibility.

Besides that, such a restriction is actually a good thing! Unlike programming languages, data formats should be limited in computational power (and similarly, expressiveness) in order to be efficiently processed. If your data format is Turing-complete, you cannot inspect them until it has been executed. If your data format is mutable (can be overwritten during data processing) you can’t be sure of its contents for a while.

Yes, recursive structures are not necessarily harmful. LISP has supported recursive structures for long time and even has a proper serialization and deserialization algorithm. But this “feature” is not without a complexity; a naive algorithm cannot traverse a recursive structure, and it introduces free-form identifiers which are tagged but not associated to pointing structures. (This is similar to XML namespaces but much worse.) Restricting yourself in non-recursive structures saves you from such problems.

If you really want, you can always use a supplementary standard like JSPON on top of CSON. That is much better than a rule-’em-all serialization format.

Why not whitespace-sensitive?

It is not a programming language but a data format that can be potentially rewritten in a variety of ways. Whitespace is not visible so very prone to be changed or misinterpreted; a confusion between tabs and spaces is a popular problem even in Python.

CSON tries to avoid whitespace dependency at all cost. Ironically, this became a problem when verbatim string was added: verbatim strings often used to explicitly insert a indentation before each line. CSON solved this problem with a designated marker for verbatim string (|). It feels that you should align markers in the same column, but you don’t have to. But its appearance gives a strong visual cue, and you will end up with aligning them in the same column anyway. Isn’t it great?

Why JSON compatibility?

JSON has a broad language, library and tool support. Designing a format on top of JSON automatically gives the same support. This requires a round trip compatibility, not like a forward compatibility like YAML (no YAML to JSON possible).

Could you explain bare string syntax a bit more?

Okay, that needs a further explanation. CSON used two repertoire of identifier syntaxes as a baseline:

Note that almost all JavaScript identifier is an XML name (only exceptions are $, U+00AA, U+00B5 and U+00BA). It is also worthwhile that the range of an XML name is very simplified; it contains lots of unassigned characters or punctuations that can be assumed to be letters for casual use. (For example, U+3002 IDEOGRAPHIC FULL STOP is actually a punctuation but included in an XML name anyway.) This makes matching an XML name a lot easier.

CSON identifier is a union of JavaScript identifier and XML name, excluding :, and including $ and - in every position. This should make both users and implementations comfortable enough.

What the hell with the name?

CSON is written by hand and cursive script is also written by hand. And I wanted to keep -SON suffix. I apologize for careless naming.

Note: I found that CSON also stands for CoffeeScript Object Notation after the initial draft is finished. Sigh. Maybe I should find another name.

Implementations


ikiwiki를 씁니다.
마지막 수정