CSON(Cursive Script Object Notation) is a superset of JSON that can be written by hand and translated to a canonical JSON.
Status: Draft design with a known implementation.
# CSON example
pi: 3.141592
e = 2.718281828, 'foo': 'bar'
"nested" = ["JSON array",
{and = "JSON object"},
"with a trailing comma", # yes!
# yes, the comment can be inside JSON arrays/objects as well
]
"verbatim": |a verbatim string
| keeps the preceding whitespace
| and joins all lines with `\n`
| as you see, no escape sequence is processed
| and this string does not have a trailing \n -->
i18n: {
한국어: "Korean"
日本語: "Japanese"
汉语-or-漢語: "Chinese"
ᏣᎳᎩ: "Cherokee"
}
should translate to:
{"pi": 3.141592,
"e": 2.718281828, "foo": "bar",
"nested": ["JSON array",
{"and": "JSON object"},
"with a trailing comma"
],
"verbatim": "a verbatim string\n keeps the preceding whitespace\n and joins all lines with `\\n`\n as you see, no escape sequence is processed\n and this string does not have a trailing \\n -->",
"i18n": {
"\ud55c\uad6d\uc5b4": "Korean",
"\u65e5\u672c\u8a9e": "Japanese",
"\u6c49\u8bed-or-\u6f22\u8a9e": "Chinese",
"\u13e3\u13b3\u13a9": "Cherokee"
}
}
CSON is defined as grammar additions to RFC 4627, which formally defines JSON. So without a further ado, here is a delta:
JSON-text = object
/ array
+ / object-items
begin-array = ws %x5B ws ; [ left square bracket
begin-object = ws %x7B ws ; { left curly bracket
end-array = ws %x5D ws ; ] right square bracket
end-object = ws %x7D ws ; } right curly bracket
name-separator = ws %x3A ws ; : colon
+ / ws %x3D ws ; = equal sign
value-separator = ws %x2C ws ; , comma
+ / newline ws
ws = *(
%x20 / ; Space
%x09 / ; Horizontal tab
- %x0A / ; Line feed or New line
- %x0D ; Carriage return
+ newline-char /
+ comment
)
+ newline = *(%x20 / %x09) newline-char
+ newline-char = %x0A ; Line feed or New line
/ %x0D ; Carriage return
+ comment = sharp *comment-char
+ sharp = %x23 ; # sharp
+ comment-char = %x00-09 / %x0B-0C / %x0E-10FFFF
value = false / null / true / object / array / number / string
false = %x66.61.6c.73.65 ; false
null = %x6e.75.6c.6c ; null
true = %x74.72.75.65 ; true
- object = begin-object [ member *( value-separator member ) ] end-object
+ object = begin-object [ object-items ] end-object
+ object-items = member *( value-separator member ) [ value-separator ]
- member = string name-separator value
+ member = name name-separator value
+ name = string / bare-string
- array = begin-array [ value *( value-separator value ) ] end-array
+ array = begin-array [ array-items ] end-array
+ array-items = value *( value-separator value ) [ value-separator ]
number = [ minus ] int [ frac ] [ exp ]
decimal-point = %x2E ; .
digit1-9 = %x31-39 ; 1-9
e = %x65 / %x45 ; e E
exp = e [ minus / plus ] 1*DIGIT
frac = decimal-point 1*DIGIT
int = zero / ( digit1-9 *DIGIT )
minus = %x2D ; -
plus = %x2B ; +
zero = %x30 ; 0
- string = quotation-mark *char quotation-mark
+ string = quotation-mark *dquoted-char quotation-mark
+ / apostrophe-mark *squoted-char apostrophe-mark
- char = unescaped /
- escape (
+ dquoted-char = dquoted-unescaped / escaped
+ squoted-char = squoted-unescaped / escaped
+ escaped = escape (
+ %x27 / ; ' apostrophe U+0027
%x22 / ; " quotation mark U+0022
%x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
escape = %x5C ; \
quotation-mark = %x22 ; "
+ apostrophe-mark = %x27 ; '
- unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
+ dquoted-unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
+ squoted-unescaped = %x20-26 / %x28-5B / %x5D-10FFFF
+ verbatim-string = verbatim-fragment *(newline ws verbatim-fragment)
+ verbatim-fragment = pipe *verbatim-char
+ pipe = %x7C ; |
+ verbatim-char = %x20-10FFFF
+ bare-string = id-start *id-end
+ ; an union of JS identifier and XML name. see below for the rationale.
+ id-start = %x24 / %x2D / %x41-5A / %x5F / %x61-7A / %xAA / %xB5
+ / %xBA / %xC0-D6 / %xD8-F6 / %xF8-02FF / %x0370-037D
+ / %x037F-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF
+ / %x3001-D7FF / %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
+ id-end = id-start / %x2E / %x30-39 / %xB7 / %x0300-036F / %x203F-2040
Changes from JSON:
#
is allowed.true
/false
therefore. Ah, of course true
/false
in the key position should be converted to a string as in JavaScript.)|
(the first line can be prefixed with other constructs), and each occurrence of <newline><spaces>|
is replaced with \n
except for the first one which is ignored. It does not undergo any other processing, so it can be used to write a verbatim string.Otherwise same usage and restriction as JSON is applied. For example, Unicode encoding detection specified in Chapter 3 of RFC 4627 can be adapted to CSON as well.
There is one ambiguity in this grammar. The following fragment…
[
|one
|two
|three
]
…can be interpreted both as ["one", "two", "three"]
and as ["one\ntwo\nthree"]
. The parser should choose the later. If you want to write an array of verbatim strings, you can do as follows:
[ |one
, |two
, |three
]
Same as JSON: machine-readable semi-structured data. CSON is equivalent to JSON but with a fancy and more readable syntax. It is a strict superset of JSON (its syntax is defined in terms of RFC 4627 JSON syntax) and can be converted to a canonical JSON.
You know, JSON is very annoying to write or edit by hand. Wouldn’t it be great to have a syntactic sugar for JSON with all frills, bells and whistles?
YAML is a human friendly data serialization standard for all programming languages.
See the rationale for no user-defined types, no recursive structures and whitespace-insensitive syntax. Those should be enough.
TOML is like INI, only better.
While it does have a niche as a configuration format, I strongly feel that TOML has its own flaws:
[][].[][]
? How about foo bar = "quux"
? How about 한글 = "Hangul"
? And seriously, no Unicode escape? Really?CSON is explicitly designed to avoid these problems. It has a defined grammar which fixes many problems with JSON but avoids to go too far.
JSON does not have them. See also the rationale for JSON compatibility.
Besides that, having them means that every implementation should recognize major user-defined types (otherwise they are useless), which is a major complexity burden. Many user-defined types can be simulated as an object with a reserved key like $type
, which can be readily inspected without a prior knowledge of those types.
And of course, the complexity is an enemy to security. YAML had a big one recently.
JSON does not have them. See also the rationale for JSON compatibility.
Besides that, such a restriction is actually a good thing! Unlike programming languages, data formats should be limited in computational power (and similarly, expressiveness) in order to be efficiently processed. If your data format is Turing-complete, you cannot inspect them until it has been executed. If your data format is mutable (can be overwritten during data processing) you can’t be sure of its contents for a while.
Yes, recursive structures are not necessarily harmful. LISP has supported recursive structures for long time and even has a proper serialization and deserialization algorithm. But this “feature” is not without a complexity; a naive algorithm cannot traverse a recursive structure, and it introduces free-form identifiers which are tagged but not associated to pointing structures. (This is similar to XML namespaces but much worse.) Restricting yourself in non-recursive structures saves you from such problems.
If you really want, you can always use a supplementary standard like JSPON on top of CSON. That is much better than a rule-’em-all serialization format.
It is not a programming language but a data format that can be potentially rewritten in a variety of ways. Whitespace is not visible so very prone to be changed or misinterpreted; a confusion between tabs and spaces is a popular problem even in Python.
CSON tries to avoid whitespace dependency at all cost. Ironically, this became a problem when verbatim string was added: verbatim strings often used to explicitly insert a indentation before each line. CSON solved this problem with a designated marker for verbatim string (|
). It feels that you should align markers in the same column, but you don’t have to. But its appearance gives a strong visual cue, and you will end up with aligning them in the same column anyway. Isn’t it great?
JSON has a broad language, library and tool support. Designing a format on top of JSON automatically gives the same support. This requires a round trip compatibility, not like a forward compatibility like YAML (no YAML to JSON possible).
Okay, that needs a further explanation. CSON used two repertoire of identifier syntaxes as a baseline:
:
(already taken by CSON)Note that almost all JavaScript identifier is an XML name (only exceptions are $
, U+00AA, U+00B5 and U+00BA). It is also worthwhile that the range of an XML name is very simplified; it contains lots of unassigned characters or punctuations that can be assumed to be letters for casual use. (For example, U+3002 IDEOGRAPHIC FULL STOP is actually a punctuation but included in an XML name anyway.) This makes matching an XML name a lot easier.
CSON identifier is a union of JavaScript identifier and XML name, excluding :
, and including $
and -
in every position. This should make both users and implementations comfortable enough.
CSON is written by hand and cursive script is also written by hand. And I wanted to keep -SON suffix. I apologize for careless naming.
Note: I found that CSON also stands for CoffeeScript Object Notation after the initial draft is finished. Sigh. Maybe I should find another name.