Okay, lets keep this simple, I'm going to modify the example RSS from the O'Reily article dive into XML.
Let's set some ground rules; XML lets you put attributes and sub-nodes in any order you like. There's no good reason for this in the context of simple syndication, so we're dispensing with it; order counts. In XML whitespace is meaningless, since this is less natural than it's proponents would have you believe, I'm going to make some whitespace (end of line) count.
Based on that we could make a file something like this:
SSS v0.0 exampleXML.com http://www.xml.com/ XML.com features a rich mix of information and services for the XML community. en-us Normalizing XML, Part 2 http://www.xml.com/pub/a/2002/12/04/normalizing.html In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms. The .NET Schema Object Model http://www.xml.com/pub/a/2002/12/04/som.html Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas. SVG's Past and Promising Future http://www.xml.com/pub/a/2002/12/04/svg.html In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
What's this mean? This is the exact same information as the RSS (v0.91) example conveys. It does so by defining that the first 4 lines describe the channel and every block of 3 lines there after describe an article. You can tell that generally all blocks represent, line by line, the title, url, and description. Additionally the 4th line represents the channel's language. What you can't see is that any blank line, whitespace only line, or line starting in a non alpha-numeric [^A-Za-z0-9] will be ignored and go uncounted. This allows me to add human information and separate the channel and articles.
Let's point out the problems here. One, if in the future I made a 5th line to describe the channel's icon's url, older readers would be confused; some sort of versioning is in order. Second, if an SSS felt that it wasn't important to tell me the url or description, it would be difficult to choose to omit the information (considering that blank lines are uncounted).
So lets add some versioning, and while we're at it we'll identify the lines' meaning. Succinctly. Lines now start with an id character [cdhluv]. The order below matters. Any line not starting with one of these characters should be ignored. All SSS files must have a 'v' line and 'c' line. The 'l' line may only appear after the 'c' and before the first 'h' of the 'c'. Order should be v, a block of c,u,d,l and repeating blocks of h,u,d followed optionally by more blocks of 'c' with a minimum of 1 block of 'h' after each 'c' block.
SSS v0.1 examplev0.1-SSS cXML.com uhttp://www.xml.com/ dXML.com features a rich mix of information and services for the XML community. len-us hNormalizing XML, Part 2 uhttp://www.xml.com/pub/a/2002/12/04/normalizing.html dIn this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms. hThe .NET Schema Object Model uhttp://www.xml.com/pub/a/2002/12/04/som.html dPriya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas. hSVG's Past and Promising Future uhttp://www.xml.com/pub/a/2002/12/04/svg.html dIn this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
So there we have a more flexible format at the cost of only 22 more bytes, let's say the first file was the new defunct SSS v0.0. Notice how an editor with syntax coloring would help? Technically we could get away with just highlighting the first character as say red, and the channel title or article head yellow, the urls blue. That would separate the sections clearly. Whitespace does the same of course. Now we have all the info of the RSS v0.91 file at 66% of the file size.
A note on parsing behavior. The parser could be flexible and allow the order of u, d and l to change. Don't write this kind of parser. Take the extra time to crash horribly on malformed data. It'll teach that pesky user that wrong is also bad. If the order is bad, the version is missing, or if there's lines which you should ignore that probably contain data, don't work with what you have, inform the user that the input is not valid SSS v 0.1 as the version line dictates; offer no compromise.
We should want to add some of the information that the RSS1 and RSS2 examples have, namely authors/creators and publishing dates. So lets extend SSS to a new version which will include 2 new character lines; Also to round it out lets add some things that RSS 2.0 supports but that we don't see in the examples (channel build date):
SSS v0.2 examplev0.2-SSS cXML.com uhttp://www.xml.com/ dXML.com features a rich mix of information and services for the XML community. len-us hNormalizing XML, Part 2 uhttp://www.xml.com/pub/a/2002/12/04/normalizing.html dIn this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms. aWill Provost tWed Dec 4 12:00:03 EDT 2002 hThe .NET Schema Object Model uhttp://www.xml.com/pub/a/2002/12/04/som.html dPriya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas. aPriya Lakshminarayanan tWed Dec 4 12:05:03 EDT 2002 hSVG's Past and Promising Future uhttp://www.xml.com/pub/a/2002/12/04/svg.html dIn this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003. aAntoine Quint tWed Dec 4 12:15:03 EDT 2002
Now we have all the information of the RSS1 or RSS2 example with a 55% or 42% cost savings respectively. Conversely, using RSS1 or RSS2 costs us 217.4% or 159.2% respectively of the cost of using SSS v0.2. The former numbers are less sensationalized, but both are true.
There is a feature of the RSS1 RDF based xml that's missing, and it addresses a general concern. Extensibility. If we kept adding more types of information for a channel or article, we'd start to run out of single lowercase characters we could use to do so. This is why I've kept the numerals reserved thus far.
Right after the version line but before the channel line, I am now going to allow any amount of lines beginning with integers between 1 and 999 without 0 padding. These lines will map the number on the left to a name on the right, optionally followed by a dash and a version number. The name will name a specific extension that the parser either knows or doesn't; it must be composed only of the following characters [a-z_] (which doesn't include '-' in case you don't read regular expressions). Any mapping is acceptable, only those names of extensions that will be used must be listed. After this block of definitions (after the first channel line), lines beginning with this number will refer to the extension. The extension may define the meaning of up to 26 more lines identified by a-z after it's mapped integer. See the example for adding the slash code extensions for category, section, hit parade, and number of comments. Additionally this is where you could add urn stuff, by including a line '2urn' and other lines like '2iurn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6' if that's the sort of thing you find useful.
What happens if an SSS v0.2 parser parses SSS v0.1 that has a comment starting with 'a' or 't' that should be ignored? Well it should also ignore it because it knows the version is v0.1-SSS. For v0.2 we can be more explicit by listing which character lines we will be using in alphabetical order, rendering all others to be comment lines (expect numerals defined after the version line). If this is omitted then all characters defined in this spec could denote valid meaningful lines.
SSS v0.3 examplev0.3-SSS acdhlut 1slash cXML.com uhttp://www.xml.com/ dXML.com features a rich mix of information and services for the XML community. len-us hNormalizing XML, Part 2 uhttp://www.xml.com/pub/a/2002/12/04/normalizing.html dIn this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms. aWill Provost tWed Dec 4 12:00:03 EDT 2002 1cXML 1sWeb 1h11,54,76,21 1m197 hThe .NET Schema Object Model uhttp://www.xml.com/pub/a/2002/12/04/som.html dPriya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas. aPriya Lakshminarayanan tWed Dec 4 12:05:03 EDT 2002 1c.NET 1sLanguages 1h11,154,76,221 1m399 hSVG's Past and Promising Future uhttp://www.xml.com/pub/a/2002/12/04/svg.html dIn this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003. aAntoine Quint tWed Dec 4 12:15:03 EDT 2002 1cSVG 1sWeb 1h11,54,76,21 1m197
An SSS v0.3 parser that doesn't know what the slash extensions mean (thus can't do much with them) could choose to ignore those lines starting with 1[a-z] or could choose to inform the user the article includes the following info: slash c XML , slash s Web , slash h 11,54,76,21 , slash m 197. On separate lines. Optionally we could define something like an SSS lookup address [pretend sss.org isn't taken] http://www.sss.org/extension/slash which would simply be a text file with the alphabetically ordered letters given a real name. E.G. hHit-Parade. This would allow the information to be expanded to "slash Hit-Parade 11,54,76,21" If we were seriously doing this we could throw in a validating regular expression and any number of transformations. As such:
SSS extension definition v0.3 examplev0.3-SSSextDef =slash # first h line gives a human name hHit-Parade # second h line gives a validating regular expression h\d+(,\d+)* # third and subsequent h lines provide a transformation to HTML, where $u stands in for the article's u line and something like $slashc would stand in for the slash extension's c line. h/^(\d+)/<a href="$u&comment=\1">\1/ h/,(\d+)/</a><a href="$u&comment=\1">\1/ h/$/</a>/ etc...
What about future considerations? Will an SSS file need more than 999 extensions? If so, we could just make vX.X support more extension integers. What if I need to compound some more information into SSS without using a-z or numbers to represent extensions? Well, I've set aside 0. We could 0 pad some integers to provide new datum; I don't think we'll need to.
Anyway, maybe this format isn't all that much better than RSS1. But in most common measures it is perfectly equivalent and notably more compact. I think if you had 1000 users downloading your RSS every 5 minutes for 8 hours a day, a 55% savings could make you a happy person. Note that SSS v0.2 contains a full time stamp, where the RSS examples only contained a date, and v0.3 contains extra information for slash specific syndication where none of the RSS demo files did. Yet all SSS files are smaller than the smallest RSS file. This remains generally true under compression.
File Sizes, rss vs. sss-rw-r--r-- 1 daniell users 2107 Jul 6 16:35 x.rss1 -rw-r--r-- 1 daniell users 1543 Jul 6 13:46 x.rss2 -rw-r--r-- 1 daniell users 1237 Jul 6 16:33 x.rss91 -rw-r--r-- 1 daniell users 804 Jul 7 11:41 x.sss0 -rw-r--r-- 1 daniell users 826 Jul 7 11:41 x.sss1 -rw-r--r-- 1 daniell users 969 Jul 7 11:50 x.sss2 -rw-r--r-- 1 daniell users 1089 Jul 7 12:00 x.sss3 -rw-r--r-- 1 daniell users 1201 Apr 22 15:50 x.jssson
GZiped File Sizes, rss vs. sss-rw-r--r-- 1 daniell users 738 Jul 6 16:35 x.rss1.gz -rw-r--r-- 1 daniell users 629 Jul 6 13:46 x.rss2.gz -rw-r--r-- 1 daniell users 557 Jul 6 16:33 x.rss91.gz -rw-r--r-- 1 daniell users 454 Jul 7 11:41 x.sss0.gz -rw-r--r-- 1 daniell users 466 Jul 7 13:20 x.sss1.gz -rw-r--r-- 1 daniell users 510 Jul 7 13:21 x.sss2.gz -rw-r--r-- 1 daniell users 571 Jul 7 12:00 x.sss3.gz -rw-r--r-- 1 daniell users 614 Apr 22 15:50 x.jssson.gz
So the point is, next time you're thinking that you need to come up with a way to format some information, if you settle on XML after a minutes' thought, you may not have thought enough. Certainly XML could be perfect, but in running through this example I think you can see that a line-by-line marked-up text file also works, and could work better.
Now for comparison here's the RSS 1.0 of the example:
Also, out of curiosity I retro-XMLized this example sss format in two different manners: Manner 1 at 1324 Bytes, and Manner 2 at 1250 Bytes. I think except for the matter of quoting within attributes, everyone should prefer (of these two only) Manner 2. Overall, I think sssv0.3 is best of these choices.RSS v1.0 for comparison (RSS 1 is an RDF which is in XML)<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" > <channel rdf:about="http://www.xml.com/cs/xml/query/q/19"> <title>XML.com</title> <link>http://www.xml.com/</link> <description>XML.com features a rich mix of information and services for the XML community.</description> <language>en-us</language> <items> <rdf:Seq> <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/> <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/> <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/> </rdf:Seq> </items> </channel> <item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html"> <title>Normalizing XML, Part 2</title> <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link> <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description> <dc:creator>Will Provost</dc:creator> <dc:date>2002-12-04</dc:date> </item> <item rdf:about="http://www.xml.com/pub/a/2002/12/04/som.html"> <title>The .NET Schema Object Model</title> <link>http://www.xml.com/pub/a/2002/12/04/som.html</link> <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description> <dc:creator>Priya Lakshminarayanan</dc:creator> <dc:date>2002-12-04</dc:date> </item> <item rdf:about="http://www.xml.com/pub/a/2002/12/04/svg.html"> <title>SVG's Past and Promising Future</title> <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link> <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description> <dc:creator>Antoine Quint</dc:creator> <dc:date>2002-12-04</dc:date> </item> </rdf:RDF>
Lastly, here's a link to an article about how rss can use terrabytes of bandwidth. Of course even reducing that to half-terabytes is better but not exactly best. Delta for feeds would be good. Some kind of protocol that says: "What's new since unix-time X". For example that rfc3229 discussed elsewhere.
April 28 2008: I've also retro JSONized the sss format. The size, without newlines, comes to 1201, and 614 gziped. It required some adapting of the x1 extensions and of the article list as an array. Take a look at it. I think most programmers would prefer more descriptive object names, and to acomplish that they might pass it through an adapter, but actually you could pass sss through a parser for the same effect and in about as much code.