Software – Page 2 – Eric F. Savage

Security Club

Posted on June 6, 2011October 21, 2011 by Eric F. Savage

Sony has been getting repeatedly hacked. We’ve seen it before with the TJX incident, and many others, most of which never get reported, much less disclosed, or even discovered. In some of these cases, only email addresses are taken, or maybe passwords. In others, names and addresses are exposed, as well as medical conditions, social security numbers, and other very sensitive information. It seems to me that this is happening more often, and I think it’s for a few reasons.

Bigger Targets

The first reason is that targets are getting bigger and the rewards go up with size. Nobody is going to waste their time getting into a system with a few thousand random users when you can get tens of millions for the same effort. As more people use more sites, it’s only natural that there are going to be more million-user sites out there. This reason isn’t a big deal, it’s just the way things are.

Better Rewards

The second reason is that more companies are collecting more data about their users. This data is valuable, possibly the most valuable asset some of these companies has. Facebook and Google make much of their money from knowing about you, what you do online, the types of things you’re interested in, different ways to contact you.

Large companies like Sony can afford to take whatever information you give them and cross-reference it against various databases to get even more information about you. This lets them focus marketing efforts, tailor campaigns to you, shape product development and so on. This also lets them make the site more easy to use with pre-filled information, to increase sales and conversions.

We don’t even really question when a site asks us for our name any more. What’s the harm, right? Sure, I’ll give them my ZIP code too, and maybe even my phone number, they probably won’t call me anyways, right? Now ask yourself, why do you need to give your name, mailing address and phone number to a company to play a game where you are a pseudonymous elf?

The real answer is that they don’t. They might need it for billing purposes, but billing databases are kept much more secure for reasons I’ll explain later. They ask for this information because it’s free, and because you’ll give it to them, and because it’s valuable to them. It’s probably not protected very well, and when it gets stolen everyone shrugs, changes the password on the database, emails the FBI to make it look like they care, and gets back to more important business like social media strategies.

No Penalties

The companies involved are embarrassed and probably suffer some losses as a result, but these are mostly minor injuries. The news stories spin it to make the intruders the sole criminals, and lose interest. The only people who really pay for these incidents are the people whose data has been stolen. There are no requirements on what companies have to do to protect this information, no requirements on what they need to do if it is compromised, no penalties for being ignorant or reckless. Someone might figure out that it’s going to cost them some sales, and they put some money in the PR budget to mitigate that.

This is the reason why billing information is better secured. The credit card companies take steps to make sure you’re being at least a little responsible with this information. And in the event it leaks, the company who failed to protect it pays a real cost in terms of higher fees or even losing the ability to accept cards at all. These numbers make sense to CEOs and MBAs, so spending money to avoid them also makes sense.

How to Stop It

There are obviously a large number of technological measures that can be put in place to improve security, but there’s one that is far simpler and much more foolproof. But first, let’s look at banks. Banks as we know it have been around for a few hundred years. I’d bet that you could prove that in every single year, banks got more secure. Massive vaults, bullet-proof windows, armed guards, motion detectors, security cameras, silent alarms, behavioral analysis, biometric monitors, the list goes on and on, and all of these things actually work very well. But banks still get robbed. All the time. When was the last time you heard of a bank robber getting caught on their first attempt? They are always linked to dozens of other robberies when they do get caught. Why?

Because they’re full of money.

They can make it harder to rob them. They can make it easier to catch the people who did it. But the odds don’t always matter to someone who sees a pile of money sitting there for them to take if they can bypass these tricks.

People break into networks for many reasons, but the user data is often the pile of gold that they seek. So the most effective way to stop someone from breaking in and stealing it is to not have it in the first place. This advice works in 2011, it will work in 2020. It works on Windows, OS X and Linux. It works online and offline, mobile or laptop, and so on.

“The first rule of security club is you try not to join security club: minimize the amount of user data you store.” – Ben Adida

So if you’re in a situation where you need to figure out if your data is secure enough, or how to secure it, start with this question: Do you need it in the first place? Usability people say they want it. Marketing people say they need it. If you’re an engineer, it’s a losing battle to argue those points, because they’re right. Stop convincing people why you shouldn’t do it, and put a cost on it so they have to convince each other that it’s worth it.

Anyone who went to business school knows the cost/value balance of inventory. It’s pretty straightforward to discuss whether a warehouse should be kept opened or expanded or closed. Nobody wants to store anything for too long, or make too much product or have too many materials. But ask them about how much user data you should be storing and the response will be something like “all of it, why wouldn’t we?”.

So make sure that the conversion bump from using full names asking age and gender and doing geo-targeting covers the cost of the security measures required to protect that data. Make it clear that those costs are actually avoidable, they are not sunk. They are also not a one-time investment. They should show up every quarter on the bottom line of anyone who uses that data. And if nobody wants to pay for it, well, you’ve just solved a major part of your security problem, haven’t you?

Update 10/18/2011: “FTC Privacy Czar To Entrepreneurs: “If You Don’t Want To See Us, Don’t Collect Data You Don’t Need”

Life After JSON, Part 1

Posted on June 2, 2011June 1, 2011 by Eric F. Savage

XML. Simply saying that term can elicit a telling reaction from people. Some will roll their eyes. Some will wait for you to say something meaningful. Some will put their headphones back on and ignore the guy that sounds like a enterprise consultant. JSON is where it’s at. It’s new. It’s cool. Why even bother with stupid old XML. JSON is just plain better. Right?

Wrong. But that’s not to say that it’s worse, it’s just not flat-out better, and I’m a little concerned that it’s become the de facto data format because it’s really not very good at that role. I think its success is a historical accident, and the lessons to be learned here are not why JSON is so popular, but why XML failed.

Without going into too much detail, XML came out of the need to express data as more complex relationships than formats like CSV allowed. We had OOP steamrolling everything else in the business world, HTML was out in the wild showing that this “tag” idea had promise, and computers and I/O were finally able to handle this type of stuff at a cost level that didn’t require accounting for every byte of space.

In the beginning, it worked great. It broke through much of headaches of other formats, it was easy to write, even without fancy tools, and nobody was doing anything fancy enough in it that you had to spend a ton of time learning how to use it. After that though, things got a little, actually a lot … messy. User njl at Hacker News sums it up very nicely:

I remember teaching classes in ’99 where I needed to explain XML. I could do it in five minutes — it’s like HTML, but you get to make up tags that are case-sensitive. Every start tag needs to have an end tag, and attributes need to be quoted. Put this magic tag at the start, so parsers know it’s XML. Here’s the special case for a tag that closes itself.
Then what happened? Validation, the gigantic stack of useless WS-* protocols, XSLT, SOAP, horrific parser interfaces, and a whole slew of enterprise-y technologies that cause more problems than they solve. Like COBRA, Ada, and every other “enterprise” technology that was specified first and implemented later, all of the XML add-ons are nice in theory but completely suck to use.

So we have a case here of a technology being so flexible and powerful that everyone started using it and “improving” it. We started getting asked questions by recruiters like “can you program in XML” and “they used to do C++ but now they’re trying to use a lot more XML”. We had insanely complex documents that were never in step with the schemas they supposedly used. We had big piles of namespaces nobody understood in the interest of standards and reusable code.

Where the pain was really felt was in the increasing complexity of the libraries that were supposed to make XML easy to use, even transparent. There was essentially an inexhaustible list of features that a library “needed” to support, and thus none of them were ever actually complete, or stable. 15 years later there are still problems with XML parsing and library conflicts and weird characters that are legal here but not there.

So, long before anyone heard of JSON, XML had a bad rep. While the core idea of it was sound, it ultimately failed to achieve it’s primary goal of being a universal data interchange format that could scale from the simplest document to the most complex. People were eager to find a solution that delivered the value XML promised without the headaches it delivered. In fact, they still are, as nothing has really come close yet, including JSON, but perhaps we can build on the lessons of both XML and JSON to build the real next-big-thing.

XML ended up being used for everything from service message requests and responses to configuration files to local data stores. It was used for files that were a few hundred bytes to many gigabytes. It was used for data meant to represent a single model to a set of models to a stream of simple models that had no relation to each other. The fact that you could use it for all of these things was the sexy lure that drew everyone to it. Not only could you use it for all of these things but it was actually designed to be used for all of these things. Wow!

Ultimately, I think this was the reason for it’s downfall, and JSON’s unsuitability for doing everything has not surprisingly been the reason for it’s ascendance on its home turf (JavaScript calling remote services). Does it really make sense for my application configuration file to use the same format as my data caches and as my web services? Even now it’s hard for me to say that it’s a bad idea because the value of having one format to parse, one set of rules to keep in mind, one set of expectations I don’t need to cover in my documentation is nice. But with XML and JSON both proving this to be false in different ways, we have to take it as such.

The problem that’s on the horizon is that as JSON becomes the go-to format for new developers to use, the people that have been told all along that XML is bad and horrible and “harmful”, they’re using it for everything. We’re heading towards a situtation where we have replaced a language that is failing to do what it was designed to do with a language that is failing to do things it was never designed to do, which I think would actually be worse. Even if it’s not worse, it’s certainly still bad.

If we abide by the collective answer that XML is not the answer, or at least it’s the answer we don’t like, and my prediction that JSON isn’t either, what is? And why is nobody working on it? Or are they?

To be continued…

To switch or not to switch, part 3

Posted on May 26, 2011June 7, 2011 by Eric F. Savage

This entry is part 3 of 3 in the series To switch or not to switch

Continued from Part 2, we’re down to 41 languages that you could potentially build a modern, database-driven application with. I’d like to knock another 10-12 more off the list before I get into trying the language themselves, but I’m running out of black-and-white rules to do it with.

Concurrency is a big part of scaling, but that’s a tough characteristic to pin down. Any JVM-based language is going to have access to a great threading model, but people have been able to scale ~~languages that lack threads at all, like Python~~ (Edit: Python has threads) languages that have poor or difficult threading, to decent volumes as well, so clearly threads are not a requirement.

I spend much of my time writing web software, so robust support of that, like Java’s Servlet specification and third party additions like Spring MVC, is nice to have as well. Ruby and Python have made many of their gains based on the strengths of their web frameworks. However, an otherwise good language could add a web framework relatively easily, so again it’s not a requirement.

The way a language handles text is important when working with users, databases, web services, etc., but this can be addressed with libraries. Object orientation can be nice too, as can other models. Interpreted vs. compiled, machine code vs. byte code, exceptions, static typing, dynamic typing, the list goes on with important details that aren’t actually important enough that I can’t really live without them. I certainly prefer static typing, checked exceptions, painless threading, garbage collection, run-anywhere bytecode, running in a virtual machine, but I know smart people who can make valid arguments against every one of those, and maybe I just haven’t given the alternatives enough of a chance.

What is really important to me about a language is its community, leadership, and, for lack of a better term, motive. For this reason I’m going to knock the Microsoft-led CLI languages off the list. I know that CLI is a standard, and that the open source implementation is not beholden to it, but I’m simply not going to program in the Microsoft ecosystem. I think they just don’t care about open source and independent development. This eliminates:

C#
Boo
Cobra
F#

It’s kind of a shame because F# looks interesting and is something I’d like to tinker with some day. I also think C# is actually a pretty good language, and, while it initially seemed like Java Jr., it evolved to have some powerful features, moreso than Java in some ways. Unfortunately, even with Mono in the picture, it’s still a Microsoft product to me.

It should definitely be noted that I’m not exactly happy about Oracle owning Java now either. Due to Java’s inertia, I haven’t seen any Oracle decisions affect me yet, and I’ve probably got a number of years before I have to deal with that if and when they decide to do something bad. They could kill it entirely and people would still be using it for a long time.

Along similar lines, Objective-C’s fate seems to be tied closely with the whims of Apple. I don’t see any interest in using it for non-Apple OS projects, so it is bumped off the list as well.

Objective-C

Also some of these languages have such a small community or slow pace of development that they don’t seem worthy of investing in:

ALGOL 68
Clean
Dylan
GameMonkey Script
Mirah
Unicon

Three of these languages are active, but the community (and therefore the language) are solving very different problems, Processing with its graphics and visualizations, Scratch with its introductory/training aspects, and Vala is tied closely to GNOME applications.

Processing
Scratch
Vala

This leaves 27 languages moving on to part 4, most of which I’m definitely looking forward to learning more about:

Ada
Clojure
Common Lisp
D
Eiffel
Erlang
Falcon
Fantom
Factor
Go
Groovy
Haskell
Io
Java
JavaScript
Lua
MUMPS
Pike
Pure
Objective Caml
Python
Ruby
Scala
Scheme
Squeak
Tea
Tcl

To switch or not to switch, part 2

Posted on May 19, 2011June 7, 2011 by Eric F. Savage

This entry is part 2 of 3 in the series To switch or not to switch

Continued from Part 1, I’m looking at candidates to replace my current main language: Java. I’m not actually eager to get rid of Java, I still really like it, and enjoy working in it, but I need to convince myself that it’s the right choice for me to continue to invest unpaid time in, or I need to find something that is. My career is, optimistically, 1/3 over, I’ve got 20-30 years left tops, and while I’d actually bet that Java programmers will be needed in 30 years, I’m not sure how much interesting stuff will be happening.

So, let’s add a couple more rules to winnow the set.

Rule 3: It have some kind of network database support.
Almost everything I do involves a database at some point. The volumes of data and network architectures we deal with today rule out simple file I/O, or even local-only databases. I did not look especially hard for the answer to this question, in my opinion if it wasn’t easy to find, it’s not sufficient. Technically, I could write/port my own driver, but if nobody else has done it, I have to suspect that the users are solving very different problems than I am. This eliminates:

Agena
ATS
BASIC
BETA
Diesel
E
FORTH
Icon
Ioke
Logo
Maple
MiniD
Miranda
Modula-3
Nu
Reia
Sather
SQL
Self
SPARK
Squirrel
Timber

Rule 4: This is a tricky one, but I’m going to say it must not have “fallen from grace”. This is essentially the state that Java is entering, depending on who you ask. It’s perfectly functional, and widely used, but it’s had its day and isn’t hip anymore. This doesn’t exclude languages that are just old, but were never at the top, like Eiffel, but I don’t see any reason to abandon Java and go with COBOL.

C
C++
COBOL
Fortran
Pascal
Perl
PHP
Visual Basic .NET

Now, some of those those languages, like C, are still very popular, and important. You could even say that they are continuing to get better and stay modern. C will probably outlive most of these languages, as none of them are strong candidates to rewrite Linux in yet. My argument is that nobody is really using C to solve any new problems in new ways. This leaves 41 languages that are active, capable of doing at least basic database operations, and have not entered decline.

Ada
ALGOL 68
Boo
C#
Clean
Clojure
Cobra
Common Lisp
D
Dylan
Eiffel
Erlang
F#
Factor
Falcon
Fantom
GameMonkey Script
Go
Groovy
Haskell
Io
Java
JavaScript
Lua
Mirah
MUMPS
Objective Caml
Objective-C
Pike
Processing
Pure
Python
Ruby
Scala
Scheme
Scratch
Squeak
Tcl
Tea
Unicon
Vala

To switch or not to switch, Part 1

Posted on May 15, 2011May 31, 2011 by Eric F. Savage

This entry is part 1 of 3 in the series To switch or not to switch

I was listening to a talk the other day and the speaker derisively mentioned “those people who are happy writing Java for the rest of their lives”, and I thought “Am I one of those?” and then I thought “is that a bad thing?”. As part of my “question everything” journey, I decided that it was time, after 10+ years, to have Java report for inspection and force it to defend its title.

I should make it clear, that I am not a language geek, or collector. I generally disagree with “use the right language for the right problem”, I prefer “use the right language for most of your problems”. So far, Java has been that for me. Some things I do in Java are more easily done in other languages, but not so much so that it overtakes the headaches of heterogeneous codebases. If something is really difficult, or impossible in your main language, bring something else in, but keep it simple. I also think it’s fine to have more than one main language, a client of mine is currently transitioning off C#, keeping Java, and adding Python. What they don’t have is random parts of their infrastructure done in erlang or perl or tcl because that’s what someone wanted to use that day.

I could make this task easier and just look at the “marketable” skills out there, which is a small subset. While I think it’s unlikely that there is some forgetten language just waiting for its moment, it’s certainly possible I could find a neat one that’s fun to play with. Languages like Ruby and Python spent years before people could find jobs doing them. So I’m going to look at literally every single language I can find, and put them through a series of tests. If you find a language I haven’t mentioned, let me know and it will be given the same chance as the rest.

Round 1:

The point of this round is to identify languages that have any potential for being useful to me.

Qualifying Criteria

Rule 1. It must be “active”.
This is admitedly a subjective term, but we’ll see how it goes. Simula is clearly not active, while Processing clearly is, with a release only weeks ago.
Rule 2. It must compile and run on modern consumer hardware and operating systems.
This means, at minimum, it works on at least one modern flavor of Linux, because I will want this to run on a server somewhere, and I don’t want a Windows or OS X server, or worse, something obscure. For bonus points, it will also work on Windows 7 and/or OS X.

So, that’s it for now. There are no requirements for web frameworks or lambdas or preference for static versus dynamic typing, I think those elements will play out in later rounds.

Ada
Agena
ALGOL 68
ATS
BASIC
BETA
Boo
C
C#
C++
Clean
Clojure
COBOL
Cobra
Common Lisp
D
Diesel
Dylan
E
Eiffel
Erlang
F#
Factor
Falcon
Fantom
FORTH
Fortran
GameMonkey Script
Go
Groovy
Haskell
Icon
Io
Ioke
Java
JavaScript
Logo
Lua
Maple
MiniD
Mirah
Miranda
Modula-3
MUMPS
Nu
Objective Caml
Objective-C
Pascal
Perl
PHP
Pike
Processing
Pure
Python
Reia
Ruby
Sather
Scala
Scheme
Scratch
Self
SPARK
SQL
Squeak
Squirrel
Tcl
Tea
Timber
Unicon
Vala
Visual Basic .NET

This list is actually a LOT longer than I expected, and yes, there actually is a modern version of ALGOL 68. Stay tuned for part 2.

The Ultimate Music App

Posted on May 14, 2011May 11, 2011 by Eric F. Savage

There’s an ever-growing number of online music services out there, but none of them have really nailed it for me. Here’s my list of demands:

Instant Purchase – Simple one or two click purchase, which adds it to my portfolio. Downloading from one place and uploading to another is dumb.
Standard format/no DRM – This is why subscription-based services won’t work.
Automatic Download/Sync – As seamless as DropBox, maybe even with a few rules (per playlist, etc).
Smart Playlists – The only reason I use iTunes is that I can set up playlists with dynamic criteria, like “stuff I like that I haven’t heard in 2 weeks”. This entails tracking what I listen to and being able to rate stuff.
Upload My Own – No reason for me to have to buy things again. I’m fine with paying a small extra fee for this, but I should also be able to work that off by buying new stuff. Amazon hosts stuff I’ve bought from them for free, but charges me for uploads, so in the long run they could actually end up costing me more. They should give me a 50MB bonus per album to upload other files.
Mobile – My phone is my music player now, I should be able to stream/sync/download from it as well as my computer.

Bonus Features

API – Let me have another program talk to your service to do things like recommendations and missing tracks.
Podcasts – This doesn’t necessarily have to be done in-service, if the API allowed uploads someone else could do it, but it seems pretty trivial to add on if all of the above things are in place.

Don’t Really Care

Sharing – Nice to have but I’d be fine with a service I can’t share. I’d prefer the option to sign into more than one account at a time.

Software that isn’t afraid to ask questions

Posted on May 13, 2011June 2, 2011 by Eric F. Savage

An area that user-focused software has gotten better at in the past 10 years or so is being aware, and protective of, the context in which users are operating. Things like autocomplete and instant validation are expected behaviors now. An area that software is really picking up steam is analytics, understanding behaviors. You see lightweight versions of this creeping into consumer software with things like Mint.com and the graphs in Thunderbird, but most of the cool stuff is happening on a large scale in Hadoop clusters and hedge funds, because that where the money is right now.

But where software has not been making advancements is in being proactively helpful, using that context awareness, as well as those analytics. If that phrase puts you in a Clippy-induced rage, my apologies, but I think this is an area where software needs to go. I think Clippy failed because it was interfering with creative input. We’ve since learned that when I user wants to tell you something, you want to expedite that, not interfere. Google’s famed homepage doesn’t tell you how, or how to search. They’ve adapted to work with what people want to tell it.

I’m talking about software that gets involved in things computers are good at, like managing information, and gets involved in the process the way that a helpful person would. We’ve done some of this in simple, mechanical ways. Mint.com will tell me when I’ve blown my beef jerky budget, Thunderbird will remind you to attach a file if you have the word “attached” in your email. I think this is a teeny-tiny preview of where things will go.

Let’s say you get a strange new job helping people manage their schedule. You get assigned a client. What’s the first thing you do, after introducing yourself? You don’t sit there and watch them, or ask them to fill out a calendar and promise to remind them when things are due. No, you ask questions. And not questions a computer would currently ask, but a question like “what’s the most important thing you do every day?”. Once you’ve gotten a few answers, you start making specific suggestions like “Do you think you could do this task on the weekends instead of before work?”.

Now, we’re a long way from software fooling people into thinking it cares about them, or understand their quirks, but we’re also not even trying to do the simple stuff. When I enter an appointment on Google calendar, it has some fields I can put data in, but it makes no attempt to understand what I’m doing. It doesn’t try to notice that it’s a doctor’s appointment in Boston at 9am and that I’m coming from an hour away during rush hour, and maybe that 15 minute reminder isn’t really going to do much. It would be more helpful if it asks a question like “are you having blood drawn?”, because if I am, it can then remind me the night before that I shouldn’t eat dinner. It can look at traffic that morning and tell me that maybe I should leave even earlier because there’s an accident. It can put something on my todo list for two weeks from now to see if the results are in. All from asking one easy question.

Now, a programmer who got a spec with a feature like this would probably be speechless. The complexity and heuristics involved are enormous. It would probably get pared down to “put doctor icon on appointment if the word doctor appears in title”. Lame, but that’s a start, right? I think this behavior is going to be attacked on many fronts, from “dumb” rules like that, to fancy techniques that haven’t even been invented yet.

I’ve started experimenting with this technique to manage the list of ideas/tasks I have. In order to see how it might work, I’ve actually forbidden myself to even use a GUI. It’s all command line prompts, because I basically want it to ask me questions rather than accept my commands. There’s not much to it right now, it basically picks an item off the list, and says, “Do you want to do this?” and I have to answer it (or skip it, which is valid data too). I can say it’s already done, or that I can’t do it because something else needs to happen first, or that I just don’t want to do it today.

If it’s having trouble deciding what option to show me, it will show two of them and say “Which of these is more important?”. Again, I’m not re-ordering a list or assigning priorities, I’m answering simple questions. More importantly, I’m only answering questions that have a direct impact on how the program helps me. None of this is artificial intelligence or fancy math or data structures, the code is actually pretty tedious so far, but even after a few hours, it actually feels helpful, almost personable.

If you know of any examples of software that actually tries to help in meaningful ways, even if it fails at it, let me know!

Ubuntu: See you in 2012

Posted on April 14, 2011June 2, 2011 by Eric F. Savage

As a follow-up to my previous post, I’ve just finished moving off my Ubuntu VMs. I don’t necessarily blame Ubuntu for this, but it’s just a little too laggy in a VM. I bet it’s only a few ms most of the time, but it’s noticeable and it’s frustrating when you’re in a good flow. Perhaps next year, if there have been improvements on both the Linux and VMWare side.

I did try VirtualBox, which seemed slightly more responsive but was very flaky, it would randomly lock up in strange ways. I also tried Virtual PC, which isn’t really an option since it lacks real multi-monitor support, but didn’t offer any improvements anyways.

Another issue which may seem minor but any programmer will tell is definitely not is that my code font, Lucida Console, doesn’t work the same on Linux. I’m not familiar enough with font mechanics to say how, but I tried everything including fractional font sizes to going through probably 50 other fonts, and it just doesn’t have the right density. Fonts on Linux are getting better though, I did enjoy the Ubuntu font in the OS UI.

Alternatively running Windows 7 in VMWare so far has been almost completely transparent. I haven’t done it enough to say it’s a total success but I’ve already gotten confused when I’m in full-screen, which is a good sign. Of course the downside is that licensing issues are tricky and expensive, but since these are for paying projects, I can recoup the $200 fee. I may try an OEM license for the next one, it appears to be legal and about half the cost.

To Git or not to Git

Posted on March 17, 2011March 17, 2011 by Eric F. Savage

You might notice from my past few posts that I’m basically going through my entire stack of tools and re-evaluating everything. This time it’s version control.

A little history

I’ve mostly used SVN, and before that, CVS. I’ve tinkered with some of the more heavyweight ones like ClearCase, TrueChange, and Visual SourceSafe as part of consulting gigs, but only enough to know that they were skills unto themselves, and ones I didn’t especially want to let into my brain. My personal repo up till now has been SVN after finally switching over from CVS a few years ago.

Why SVN?

The short answer is, because it’s easy. The longer answer is that it’s easy to set up, it’s fairly hard to break, and it has a decent Eclipse plugin. You might notice that I didn’t mention anything about branches, or rollbacks, or speed, or centralized vs. distributed. Those things don’t really matter to me if the first three requirements aren’t satisfied.

Branches are the devil

I don’t hate branches because they were a legendary nightmare in CVS. I don’t hate branches because svn merge rarely works. I hate branches because of the mental cost they inflict on a team.

Having a team work in multiple branches is, as far as I’ve ever seen it, a sign that your team is too big or your project is too monolithic or your effective management and oversight capabilities are lacking.

There are cases where branches don’t impose such a cost, however. If there are no plans to ever merge a branch back to trunk, they are simply an experimental offshoot where a few snippets might be pulled into trunk/master, that’s fine.

If everyone is switching to a new branch, that’s also fine. At one point we had to rollback to code from over a month prior, and weren’t sure if we were going to get back to trunk or not. Everyone switched to the new branch, and luckily everyone was able to switch back to trunk a few months later. It took days to merge back, and that wasn’t SVN’s fault, that was just a lot of human time that was required to pull two very different versions together safely and not leave landmines all over the place.

Not on my resume

I don’t put any VCS software on my resume, because, while it’s absolutely important to use one, I don’t think it’s that important which one I know or use (it is a good interview question, though). They aren’t really that hard to learn, and since everyone uses them slightly differently there’s no avoiding some ramp-up time. If an employer ever has me and another candidate so close that our VCS experience is the deciding factor, please, take the other person.

Dirty little secret

I don’t actually have the command line version of svn or cvs installed on any of my workstations. Nor do I have standalone GUIs or shell integration like Tortoise. I know the command line, and use it on servers, but I do all my actual development with Eclipse’s integrated client. I’ve actually even used Eclipse to manage svn projects that were Flash or C. I just find the command line so restricting and linear for what really is a very non-linear task. The eclipse subversion plugin took years to go from bad to passable, and it’s still not as good as the CVS version, which is one reason I never grew too attached to SVN.

Why? I need to see everything that’s changed, that’s dirty. I diff every file individually before I commit. I often find changes I didn’t really need to make, and nuke them. Sometimes I find that I actually didn’t account for a situation the old code did. Many times just looking at my code in this way makes me think just different enough that I come up with a better way of doing it. I simply don’t have this visibility amidst the >>> and <<< of a command-line diff, because it’s not my native development environment.

Enter git

So, even though I’d be happy to continue using SVN, I need to see what all this git hubbub is about. It’s been around for over 5 years now, so clearly it’s not a fad. It’s also been vouched for by enough voices I respect that it deserved a shot. There is also Mercurial and Bazaar, but I haven’t seen nearly the same level of buy-in from trusted people for those.

My sideline view is that git is favored by the python people and the “ruby taliban”. Mercurial seems to be favored by the Microsoft and enterprise crowd, and bazaar is somewhere out back playing with package maintainers. Java is still mostly in SVN land, probably because it’s more mature, more corporate, and slower moving. 5 years isn’t a long time in Java years these days, so I’d bet that a high percentage of projects people are still working on are from when git was just Linus stomping his feet. The Spring/JBoss people seem to have gone the Mercurial route, while Eclipse is going git.

Git also has github, which is used by some people I know, while I don’t know anyone personally who is using Mercurial’s version, bitbucket (or even using Mercurial for that matter). So I ultimately went with what Eclipse and my friends were using over the other interests, and started with git. From what I understand the differences are slight in the early stages anyways, this was really more a matter of trying DVCS vs. VCS.

First steps

I started off with Github’s helpful handholding, which included installing msysgit. I imported my projects and used it in earnest for a few days. Once I was confident in my ability to actually get stuff up there, I dug a little deeper.

I read the Pro Git book, which I need to call special attention to because it’s really, really, good. It’s short, concise, has diagrams where you need diagrams, and ranks high in terms of how computer books should be written. If you don’t know anything about git, spend a night reading through this, and you will know plenty.

What about the secret?

So yeah, I was using the command line. Lean in closer so I can tell you why…BECAUSE THE GUIS ARE TERRIBLE. They’re not just ugly, they’re interface poisoning at a master level, think what an even more complicated Bugzilla would look like. And yes I say “they”, because there is more than one, and they’re all basically someone jamming the command line output into various GUI toolkits. Part of this is git’s design, but I know that someone will figure this out.

Git’s design allows for a huge number of permutations of workflow, which means there are a number of extra steps when comparing to something like subversion. On the command line, this doesn’t seem to hurt that much (in comparison). But GUIs don’t deal with situations like this very well. They can either be helpful and guide you down a path, or play dumb and wait for you to hold it’s hand. All of the Git guis I’ve seen so far do the latter.

Am I being a stick in the mud and saying that something as marvelous as git should be constrained to the simplicity of dumb old svn? Actually, yes. I should be able to edit some files, see a list of those files, diff them against the “real” version of the file (as in the one everyone else sees) and commit, with message. Then go home. I don’t care about SHA-1 hashes because I don’t remember them, I only need them when you need to tell me that two things are different. I don’t care about branches other than knowing which I’m in (we’ll get to this next). I don’t want to be bothered with any of this fancy information if nothing is broken (or going to break if I continue).

This isn’t actually a problem of git. This is a problem of people being indecisive when it comes to UIs. If you do everything, you fail. If you do nothing, you fail. If you do any subset of everything, you fail for some people. That’s OK, don’t worry, you can optimize as you go. Your first priority should not be exposing the power of git, it should be letting me put my code on the server so I can go home. Let me drop into all this fancy stuff with local branches and pushing tags and rebasing and such on a case-by-case basis, when I need it, and when I’m good at it.

What about the branches?

Git does branches right. I could go into more detail, but Pro Git does it so well I won’t bother, so just go read that. What git does not do very well, and I’m not sure if anything can do as long as humans are involved, is mitigate the mental cost of a distributed branch. It does make merging much more feasible, and allows for a larger number of cases where branches are not going to cost much, but they still need to be used with discipline and restraint.

The new idea that git adds is the local branch. I haven’t had a chance to use this much yet, but this is the feature that may ultimately win me over. I can look back and say “when have I ever needed a local branch?” and the answer would be “a few times, but not often”. But if I look back and ask “when would I have benefited from a local branch” and answer would be “hmm, I don’t know, but probably more often than I needed one”.

The example of the hotfix scenario (where you need to fix/test something from last week’s release and trunk/master isn’t ready) isn’t very compelling to me as an SVN user. It’s easy to make an SVN branch for something like this. I svn copy and switch it if my local copy is clean. If not, I can check out the project again, or if its a big one, I copy it over and svn switch it. Not as easy as git, but then again, I don’t generally have to make alot of hotfixes either.

The issue scenario (work one one issue per branch, merge when/if complete) is more compelling. I’d like to say that it isn’t and that I try to start and finish one issue at a time, but obviously that doesn’t happen enough. I like the fact that it’s so cheap to make a branch that I might as well just do it all the time. If I didn’t end up needing it, no harm done. If I did end up needing it, because some other issue suddenly got more important and the one I’m working needs to chill, then I’m glad it’s there.

The real win here is that nobody has to know about my branch, which means they never have to wonder what’s in it, or if its up to date. This means there is no cost to my team because I have a branch that is 2 weeks out of date. There is cost to me, but no more than having multiple versions of the project checked out, or a set of patch files sitting there waiting for me to get back to it.

One more thing

The fact that every developer has a full copy of the repository on their computer, basically for free? That’s really nice. Sure, you do backups and the chances of your computer being the last one on Earth with a copy of the repo is slim, but the peace of mind is undeniable.

The Verdict

The only reason this is really a decision at all is that git is harder to use for normal stuff. The command line can be scripted, so if someone got the GUI to the point where you start with an SVN-style workflow and deviate as needed, there really would be no argument to using SVN from what I can see.

My decision is that git is the winner here, despite that massive failing because I have faith that a combination of two things will happen. I will learn the GUIs better because it’s worth my time to do so, and someone will eventually figure out how to make it smarter and simpler.

I could not fault someone for sticking with SVN, even for a new project, because you can always import it later, but I will be starting new projects in git.

Why don’t websites have credits?

Posted on March 10, 2011June 7, 2011 by Eric F. Savage

Engineers of any discipline are largely an anonymous bunch. You don’t know who designed the fuel pump in your car, I’d even wager it would be extremely difficult for you find out if you wanted to. You don’t know who wrote the code for the OS X Dock or Windows Start bar or who wrote the Like button on Facebook. These people made decisions that affect you deeply every day, and you have no idea who they are.

The most interesting part of this is that those people are OK with it. If you ask them (myself included) they will tell you that it doesn’t matter, that what really matters is the quality of the work and the enjoyment you had doing it. Unfortunately, I think we’re wrong.

Should they?

I can’t seem to come up with a good framework for who figuring out who wants credit, never mind who deserves it. If you so much as make a photocopy during the production of a movie, you’re probably in the credits with some high-faluten title like “First deputy assistant duplication specialist”. Music credits are tied to royalties and managed very closely. Most authors wouldn’t think about publishing something anonymously, nor would artists or sculptors. Artists always sign their work.

This is not even strictly a software issue. Video games list credits, often in the box and at the end of the game, and they even have a IMDB-like site. Nor is it an “arts & entertainment” issue, any credible scientific paper will cite other works and acknowledge contributions. Patents have names on them, even when assigned to a company.

A few software packages have listed credits. If I remember correctly, Microsoft did it on old versions of Word and Excel, and Adobe had it on old versions of Photoshop and Illustrator. I’m curious why those were removed, or at least hidden. “The Social Network” had something about Saverin being removed and re-added to “masthead” of Facebook (although I don’t know what or where that is).

So it would seem that we might be in the minority here, perhaps due to convention rather than any specific reason. And if there’s one thing that bugs an engineer, it’s deviating from standards with no good reason.

So let’s do it.

Why do it?

Pride in your work – Sure there is some pride in doing a good job anonymously, but wouldn’t be just a little more motivated or happy now that your name is on it?
Being a stakeholder – We’ve all done projects we didn’t believe in, and consoled ourselves with the fact that “it’s not my project”. Well, now it is.
Reputation – We’ve got our resumes, but credits will verify them.
Honesty/Transparency – There is no good reason to withhold this information, so it should be out there.
All that money they spent on school – Show your parents your name on a website and watch them smile.

So who’s get listed?

I think the short answer here is, everyone. Movies do it, why not websites? It could be just a big list of names, or something more detailed with contributions, dates, whatever makes sense. Let’s just start throwing some names up there, and let the de facto standards evolve on their own.

If you know of any major sites that do this well, put it in the comments. Similarly, if you can think of a good reason why this shouldn’t happen, I’d love hear about it.