You've already forked lazarus-ccr
csvdocument: added some notes about CSV variations implementation
git-svn-id: https://svn.code.sf.net/p/lazarus-ccr/svn@1512 8e941d3f-bd1b-0410-a28a-d453659cc2b4
This commit is contained in:
@ -1,7 +1,9 @@
|
||||
=== TODO ===
|
||||
* Write more tests for different CSV variations
|
||||
|
||||
* Write more tests for different CSV format variations, especially those used by Excel and Calc.
|
||||
|
||||
=== Warning about speed optimizations ===
|
||||
|
||||
A try to speed up buffer operations (FCellBuffer, FWhitespaceBuffer)
|
||||
by memory preallocation using straightforward String Builder implementation
|
||||
resulted in about 25% slowdown compared with current implementation based
|
||||
@ -16,8 +18,33 @@ With StrBuf: 2423, 2437, 2404, 2471, 2405 ms
|
||||
This happened on Linux too and was not tested on other platforms.
|
||||
These changes were not commited either.
|
||||
|
||||
=== Warning about CSV extensions like escaping special chars and line breaks ===
|
||||
There are more problems in implementing them than it seems at first glance:
|
||||
=== Some thoughts about CSV variations ===
|
||||
|
||||
There are two CSV specifications:
|
||||
* RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files
|
||||
http://tools.ietf.org/html/rfc4180
|
||||
* An unofficial CSV specification
|
||||
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
|
||||
|
||||
The latter (unofficial) specification mentiones two CSV format features
|
||||
that are not part of RFC 4180. The first of them is mentioned as mandatory:
|
||||
1) Leading and trailing space-characters adjacent to comma field separators are ignored.
|
||||
Fields with leading or trailing spaces must be delimited with double-quote characters.
|
||||
The second feature is optional and comprises several variations
|
||||
(http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#CSVariations):
|
||||
2) Embedded line-feeds in fields. This one is also escaped sometimes. Often like in C ("\n").
|
||||
Embedded commas in fields. Again, an escape character is sometimes used in place of the comma.
|
||||
Check if line feeds are \n
|
||||
Check if embedded double quotes are \"
|
||||
Check if ???
|
||||
|
||||
Here are some critics concerning both of these suggested features.
|
||||
|
||||
Behavior (1) is explicitely forbidden by RFC 4180: "Spaces are considered part
|
||||
of a field and should not be ignored". There is a reason for this: when (1) is obeyed,
|
||||
simple loading and saving CSV document (without any modifications) will result in data loss.
|
||||
|
||||
As for variations (2), there are more problems in implementing them than it seems at first glance:
|
||||
* It should be clearly defined what escaping scheme should be used:
|
||||
- what characters must be escaped,
|
||||
- what escaped characters have special meaning (like \r and \n),
|
||||
@ -33,6 +60,26 @@ There are more problems in implementing them than it seems at first glance:
|
||||
\w\w\wescaped non-trimmable whitespace\w\w\w
|
||||
" quoted non-trimmable whitespace "
|
||||
|
||||
Implementing feature (1) on the CSV parser level still has a point.
|
||||
This feature requires to remove outer whitespace only (a whitespace outside quotes)
|
||||
and keep inner whitespace (a whitespace inside quotes) intact. However, an application
|
||||
that uses CSV parser does not have access to quotes and cannot distinguish between
|
||||
inner and outer whitespace. That is why this feature cannot be implemented by client
|
||||
application on top of parser, and should therefore be implemented by the parser itself.
|
||||
However it should be optional and disabled by default to prevent data loss.
|
||||
|
||||
As for variations (2), they are too ambiguous to be implemented as is. The ambiguity
|
||||
can be removed to some degree by the following limitations:
|
||||
- traditional quoting takes precedence over backslash-escaping;
|
||||
- backslash-escaping of separators and quotation marks is forbidden to obey RFC 4180.
|
||||
These limitations allow client applications to implement backslash-escaping themselves
|
||||
on top of CSV parser, effectively turning backslash-escaping into special field syntax.
|
||||
Since CSV fields as defined by RFC 4180 can transparently store any sequence of characters,
|
||||
applications are not limited in defining their own subformats (such as backslash-escaping)
|
||||
and store them in CSV fields. That is why there is no point in implementing variations (2)
|
||||
on the parser level, unless they are made more specific and require access to CSV internals
|
||||
like feature (1) does.
|
||||
|
||||
=== Links ===
|
||||
http://tools.ietf.org/html/rfc4180#section-2
|
||||
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
|
||||
|
Reference in New Issue
Block a user