Databases – Eric F. Savage

Life After JSON, Part 1

Posted on June 2, 2011June 1, 2011 by Eric F. Savage

XML. Simply saying that term can elicit a telling reaction from people. Some will roll their eyes. Some will wait for you to say something meaningful. Some will put their headphones back on and ignore the guy that sounds like a enterprise consultant. JSON is where it’s at. It’s new. It’s cool. Why even bother with stupid old XML. JSON is just plain better. Right?

Wrong. But that’s not to say that it’s worse, it’s just not flat-out better, and I’m a little concerned that it’s become the de facto data format because it’s really not very good at that role. I think its success is a historical accident, and the lessons to be learned here are not why JSON is so popular, but why XML failed.

Without going into too much detail, XML came out of the need to express data as more complex relationships than formats like CSV allowed. We had OOP steamrolling everything else in the business world, HTML was out in the wild showing that this “tag” idea had promise, and computers and I/O were finally able to handle this type of stuff at a cost level that didn’t require accounting for every byte of space.

In the beginning, it worked great. It broke through much of headaches of other formats, it was easy to write, even without fancy tools, and nobody was doing anything fancy enough in it that you had to spend a ton of time learning how to use it. After that though, things got a little, actually a lot … messy. User njl at Hacker News sums it up very nicely:

I remember teaching classes in ’99 where I needed to explain XML. I could do it in five minutes — it’s like HTML, but you get to make up tags that are case-sensitive. Every start tag needs to have an end tag, and attributes need to be quoted. Put this magic tag at the start, so parsers know it’s XML. Here’s the special case for a tag that closes itself.
Then what happened? Validation, the gigantic stack of useless WS-* protocols, XSLT, SOAP, horrific parser interfaces, and a whole slew of enterprise-y technologies that cause more problems than they solve. Like COBRA, Ada, and every other “enterprise” technology that was specified first and implemented later, all of the XML add-ons are nice in theory but completely suck to use.

So we have a case here of a technology being so flexible and powerful that everyone started using it and “improving” it. We started getting asked questions by recruiters like “can you program in XML” and “they used to do C++ but now they’re trying to use a lot more XML”. We had insanely complex documents that were never in step with the schemas they supposedly used. We had big piles of namespaces nobody understood in the interest of standards and reusable code.

Where the pain was really felt was in the increasing complexity of the libraries that were supposed to make XML easy to use, even transparent. There was essentially an inexhaustible list of features that a library “needed” to support, and thus none of them were ever actually complete, or stable. 15 years later there are still problems with XML parsing and library conflicts and weird characters that are legal here but not there.

So, long before anyone heard of JSON, XML had a bad rep. While the core idea of it was sound, it ultimately failed to achieve it’s primary goal of being a universal data interchange format that could scale from the simplest document to the most complex. People were eager to find a solution that delivered the value XML promised without the headaches it delivered. In fact, they still are, as nothing has really come close yet, including JSON, but perhaps we can build on the lessons of both XML and JSON to build the real next-big-thing.

XML ended up being used for everything from service message requests and responses to configuration files to local data stores. It was used for files that were a few hundred bytes to many gigabytes. It was used for data meant to represent a single model to a set of models to a stream of simple models that had no relation to each other. The fact that you could use it for all of these things was the sexy lure that drew everyone to it. Not only could you use it for all of these things but it was actually designed to be used for all of these things. Wow!

Ultimately, I think this was the reason for it’s downfall, and JSON’s unsuitability for doing everything has not surprisingly been the reason for it’s ascendance on its home turf (JavaScript calling remote services). Does it really make sense for my application configuration file to use the same format as my data caches and as my web services? Even now it’s hard for me to say that it’s a bad idea because the value of having one format to parse, one set of rules to keep in mind, one set of expectations I don’t need to cover in my documentation is nice. But with XML and JSON both proving this to be false in different ways, we have to take it as such.

The problem that’s on the horizon is that as JSON becomes the go-to format for new developers to use, the people that have been told all along that XML is bad and horrible and “harmful”, they’re using it for everything. We’re heading towards a situtation where we have replaced a language that is failing to do what it was designed to do with a language that is failing to do things it was never designed to do, which I think would actually be worse. Even if it’s not worse, it’s certainly still bad.

If we abide by the collective answer that XML is not the answer, or at least it’s the answer we don’t like, and my prediction that JSON isn’t either, what is? And why is nobody working on it? Or are they?

To be continued…

To switch or not to switch, part 2

Posted on May 19, 2011June 7, 2011 by Eric F. Savage

This entry is part 2 of 3 in the series To switch or not to switch

Continued from Part 1, I’m looking at candidates to replace my current main language: Java. I’m not actually eager to get rid of Java, I still really like it, and enjoy working in it, but I need to convince myself that it’s the right choice for me to continue to invest unpaid time in, or I need to find something that is. My career is, optimistically, 1/3 over, I’ve got 20-30 years left tops, and while I’d actually bet that Java programmers will be needed in 30 years, I’m not sure how much interesting stuff will be happening.

So, let’s add a couple more rules to winnow the set.

Rule 3: It have some kind of network database support.
Almost everything I do involves a database at some point. The volumes of data and network architectures we deal with today rule out simple file I/O, or even local-only databases. I did not look especially hard for the answer to this question, in my opinion if it wasn’t easy to find, it’s not sufficient. Technically, I could write/port my own driver, but if nobody else has done it, I have to suspect that the users are solving very different problems than I am. This eliminates:

Agena
ATS
BASIC
BETA
Diesel
E
FORTH
Icon
Ioke
Logo
Maple
MiniD
Miranda
Modula-3
Nu
Reia
Sather
SQL
Self
SPARK
Squirrel
Timber

Rule 4: This is a tricky one, but I’m going to say it must not have “fallen from grace”. This is essentially the state that Java is entering, depending on who you ask. It’s perfectly functional, and widely used, but it’s had its day and isn’t hip anymore. This doesn’t exclude languages that are just old, but were never at the top, like Eiffel, but I don’t see any reason to abandon Java and go with COBOL.

C
C++
COBOL
Fortran
Pascal
Perl
PHP
Visual Basic .NET

Now, some of those those languages, like C, are still very popular, and important. You could even say that they are continuing to get better and stay modern. C will probably outlive most of these languages, as none of them are strong candidates to rewrite Linux in yet. My argument is that nobody is really using C to solve any new problems in new ways. This leaves 41 languages that are active, capable of doing at least basic database operations, and have not entered decline.

Ada
ALGOL 68
Boo
C#
Clean
Clojure
Cobra
Common Lisp
D
Dylan
Eiffel
Erlang
F#
Factor
Falcon
Fantom
GameMonkey Script
Go
Groovy
Haskell
Io
Java
JavaScript
Lua
Mirah
MUMPS
Objective Caml
Objective-C
Pike
Processing
Pure
Python
Ruby
Scala
Scheme
Scratch
Squeak
Tcl
Tea
Unicon
Vala

Scratching an Itch: The Open Data Bank

Posted on March 11, 2008May 26, 2011 by Eric F. Savage

Engineers, especially those of the software variety, have various types of projects to work on. Some pay the bills, some are for learning, some are to help others’ goals, and then there are the ones that we say “scratch an itch.”

It’s hard not to operate in a world of ideas without having a few of your own, and some ideas just keep popping up. If you’re lucky, someone else does it right and you can reap the benefits, but often you just have to go out and do it. These projects are often done at personal expense “to see if it works” or “because I can”, and not for fame or fortune. I have a few of these kicking around, and it was a new year’s resolution of mine to actual get some of them into the wild. So, as the first of these, I’d like to officially announce a new project that I’ve been working on (and one of the reasons for the lack of blog posts). The Open Data Bank.

The ODB is a simple idea. While tinkering with other projects, I’m often in need of data. Sometimes this is to test things out, sometimes it’s to get things started, but everytime it seems like I have to go and find it anew and coax it into some useful format. I assume that others like me have the same problem, and hopefully ODB will be a useful contribution to the tinkering ecosystem to complement other tools like open source libraries.

For the layman, the ODB is a place where we can put “open data”, that is to say, data that can be shared without restriction. Not only is the data open, but the formats it is shared in are open as well. Formats like XML and JSON don’t have to be licensed from anyone, and therefore people are free to write tools to read it.

If you’re interested in participating or just keeping track of the ODB, there’s a Google Group you can join and share info, ask questions, or offer ideas to improve it.

Database Naming Conventions

Posted on September 21, 2007September 21, 2007 by Eric F. Savage

Time for another naming convention. This time it’s something people care more about than Java packages, we’re talking about databases. Here are the rules, and the reasons behind them.

Use lowercase for everything

We’ve got 4 choices here:

Mixed case
- Some servers are case sensitive, some are not. MySQL for example, is case-insensitive for column names, case-insensitive on Windows for table names, but case-sensitive on Linux for table names.
- Error prone
No convention
- Same reasons as mixed case
Upper case
- SQL is easier to scan when the reserved words are uppercase. This is valuable when scanning log files looking for things like WHERE statements and JOINs.
- MySQL will always dump table names on Windows in lowercase.
Lowercase
- Works everywhere. Some servers, like Oracle, will appear to convert everything to uppercase, but itâ€™s just case-insensitive and you can use lowercase.

Only use letters and underscores and numbers (sparingly)

Most servers support other characters, but there are no other characters which all the major servers support.
Numbers should be used as little as possible. Frequent use is typically a symptom of poor normalization.
- address1, address2 is OK
Whitespace isnâ€™t allowed on most servers, and when it is you have to quote or bracket everything, which gets messy.

Table and column should be short, but not abbreviated.

You’ve seen the same thing abbreviated every way possible, like cust_add, cus_addr, cs_ad, cust_addrs, why not just customer_address? We’re not writing code on 80 character terminals any more, and most people aren’t even writing SQL, so let’s keep it clear, OK?
30 characters is considered the safe limit for portability, but give some serious thought before you go past 20.

Table names should be singular.

Yes, singular! Oh yes, I went there. I used to use plural names, because itâ€™s more semantically accurate. After all if each record is a person, then a group of them would be people, right? Right, but who cares. SELECT * FROM person isn’t any less clear than people, especially if you’ve got a solid convention. You don’t use plurals when you’re declaring class names for a vector of generics do you? Also:

English plurals are crazy, and avoiding them is good.
- user -> users
- reply -> replies
- address -> addresses
- data -> data (unless its geographic, then itâ€™s datum -> data)
Singular names means that your primary key can always be tablename_id, which reduces errors and time.

Double Underscores for Associative Tables.

You’ve got your person table, and your address table, and there’s a many-to-many between them. This table should be called address__person. Why? Well what if you have a legacy_customer table that also ties to address. Now you’ve got address__legacy_customer. A new developer can easily pick up this convention and will be able to break down the names accordingly. Remember, no matter what the Perl/Lisp/Ruby/etc guys say, clarity of code is judged by how someone reads it, not how they write it.

Component Names of Associative Tables in Alphabetical Order.

This rule is somewhat arbitrary, but still beneficial. There’s no good way to determine which goes first. Table size, “importance”, age, who knows what else, and those assessments may change over time. Or, you might find that your manager assigned the same task to two people, and now you’ve got an address__person and a person__address table co-existing peacefully, when you only need one. Everyone putting them in the same order makes reading and writing queries easier.

That’s all I’ve got for now, but I encourage you to offer your own, or even refute some of the ones above (with some reasoning, of course).