Tuesday, July 31, 2007

RFC ? WTF is that ?

Well, some technicians actually dont know.
Ive learned this the hard way.
A recent task ive been assigned to is to build a csv parser.
We all know the format.
Comma separated values. If you want to include a comma into the value - you quote it.
If you have nothing to add, just insert a comma.
If your done with the data row - carrige return, line feed.
The standard flatfile has been known for decades.
It was one of the very first industry standards that everyone agreed on.
It was simple, and it just worked.

Then came microsoft.
The company that creates new standards, while destroying the ones already established.
Its like that with odf, it was the same with smb, and its the same with cvs.
A file format which full specification i just managed to enclose into 4 sentences.

But nooooo....
Lets change the comma into a semicolon. As to avoid confusion - we will keep the name (CommaSeparatedValues).
Oh, and lets assume the encoding. After all - everyone knows what excel's local standard encoding is, right ?
Sure we all do. Its win-12xx. Whatever the x.
Oh, and when a line finishes with empty values, lets not waste any bytes. We ignore them.
Just think of all those WATT's saved this way.

Heh.
But all this causes problems mostly on excels part and can be walked around from inside php.
But not all of it.
You have to know (i didnt before stumbling upon a bug) that fgetcsv is linked to the locale your system is running.
I have no fucking idea what difference does it make if the semicolon (below the 127 ascii threshold) is under iso-1/2/3/win-x.
But it does make a difference to php.
Any characters that are not in the locale encoding get cut out.
Obviously without any error message/warning/whatever.
I mean why bother.

Funny thing is - its not so hard to handcode a fgetcsv equivalent in php itself.
I managed to do it in less then 50 lines of code.
And now for the scary part.
Parsing a 7meg file, the php interpreted csv parser is 20% faster then the c++ written one invoked with fgetcsv.
I was fairly certain that i had an error somewhere and was ommiting data.
Doesnt look like it.
Its faster and has fewer restrictions.
Too bad i cant post it here - but its a work project and belongs to the company.