You've already forked lazarus-ccr
csvdocument: added some notes about CSV variations implementation
git-svn-id: https://svn.code.sf.net/p/lazarus-ccr/svn@1512 8e941d3f-bd1b-0410-a28a-d453659cc2b4
This commit is contained in:
@ -1,7 +1,9 @@
|
|||||||
=== TODO ===
|
=== TODO ===
|
||||||
* Write more tests for different CSV variations
|
|
||||||
|
* Write more tests for different CSV format variations, especially those used by Excel and Calc.
|
||||||
|
|
||||||
=== Warning about speed optimizations ===
|
=== Warning about speed optimizations ===
|
||||||
|
|
||||||
A try to speed up buffer operations (FCellBuffer, FWhitespaceBuffer)
|
A try to speed up buffer operations (FCellBuffer, FWhitespaceBuffer)
|
||||||
by memory preallocation using straightforward String Builder implementation
|
by memory preallocation using straightforward String Builder implementation
|
||||||
resulted in about 25% slowdown compared with current implementation based
|
resulted in about 25% slowdown compared with current implementation based
|
||||||
@ -16,8 +18,33 @@ With StrBuf: 2423, 2437, 2404, 2471, 2405 ms
|
|||||||
This happened on Linux too and was not tested on other platforms.
|
This happened on Linux too and was not tested on other platforms.
|
||||||
These changes were not commited either.
|
These changes were not commited either.
|
||||||
|
|
||||||
=== Warning about CSV extensions like escaping special chars and line breaks ===
|
=== Some thoughts about CSV variations ===
|
||||||
There are more problems in implementing them than it seems at first glance:
|
|
||||||
|
There are two CSV specifications:
|
||||||
|
* RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files
|
||||||
|
http://tools.ietf.org/html/rfc4180
|
||||||
|
* An unofficial CSV specification
|
||||||
|
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
|
||||||
|
|
||||||
|
The latter (unofficial) specification mentiones two CSV format features
|
||||||
|
that are not part of RFC 4180. The first of them is mentioned as mandatory:
|
||||||
|
1) Leading and trailing space-characters adjacent to comma field separators are ignored.
|
||||||
|
Fields with leading or trailing spaces must be delimited with double-quote characters.
|
||||||
|
The second feature is optional and comprises several variations
|
||||||
|
(http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#CSVariations):
|
||||||
|
2) Embedded line-feeds in fields. This one is also escaped sometimes. Often like in C ("\n").
|
||||||
|
Embedded commas in fields. Again, an escape character is sometimes used in place of the comma.
|
||||||
|
Check if line feeds are \n
|
||||||
|
Check if embedded double quotes are \"
|
||||||
|
Check if ???
|
||||||
|
|
||||||
|
Here are some critics concerning both of these suggested features.
|
||||||
|
|
||||||
|
Behavior (1) is explicitely forbidden by RFC 4180: "Spaces are considered part
|
||||||
|
of a field and should not be ignored". There is a reason for this: when (1) is obeyed,
|
||||||
|
simple loading and saving CSV document (without any modifications) will result in data loss.
|
||||||
|
|
||||||
|
As for variations (2), there are more problems in implementing them than it seems at first glance:
|
||||||
* It should be clearly defined what escaping scheme should be used:
|
* It should be clearly defined what escaping scheme should be used:
|
||||||
- what characters must be escaped,
|
- what characters must be escaped,
|
||||||
- what escaped characters have special meaning (like \r and \n),
|
- what escaped characters have special meaning (like \r and \n),
|
||||||
@ -33,6 +60,26 @@ There are more problems in implementing them than it seems at first glance:
|
|||||||
\w\w\wescaped non-trimmable whitespace\w\w\w
|
\w\w\wescaped non-trimmable whitespace\w\w\w
|
||||||
" quoted non-trimmable whitespace "
|
" quoted non-trimmable whitespace "
|
||||||
|
|
||||||
|
Implementing feature (1) on the CSV parser level still has a point.
|
||||||
|
This feature requires to remove outer whitespace only (a whitespace outside quotes)
|
||||||
|
and keep inner whitespace (a whitespace inside quotes) intact. However, an application
|
||||||
|
that uses CSV parser does not have access to quotes and cannot distinguish between
|
||||||
|
inner and outer whitespace. That is why this feature cannot be implemented by client
|
||||||
|
application on top of parser, and should therefore be implemented by the parser itself.
|
||||||
|
However it should be optional and disabled by default to prevent data loss.
|
||||||
|
|
||||||
|
As for variations (2), they are too ambiguous to be implemented as is. The ambiguity
|
||||||
|
can be removed to some degree by the following limitations:
|
||||||
|
- traditional quoting takes precedence over backslash-escaping;
|
||||||
|
- backslash-escaping of separators and quotation marks is forbidden to obey RFC 4180.
|
||||||
|
These limitations allow client applications to implement backslash-escaping themselves
|
||||||
|
on top of CSV parser, effectively turning backslash-escaping into special field syntax.
|
||||||
|
Since CSV fields as defined by RFC 4180 can transparently store any sequence of characters,
|
||||||
|
applications are not limited in defining their own subformats (such as backslash-escaping)
|
||||||
|
and store them in CSV fields. That is why there is no point in implementing variations (2)
|
||||||
|
on the parser level, unless they are made more specific and require access to CSV internals
|
||||||
|
like feature (1) does.
|
||||||
|
|
||||||
=== Links ===
|
=== Links ===
|
||||||
http://tools.ietf.org/html/rfc4180#section-2
|
http://tools.ietf.org/html/rfc4180#section-2
|
||||||
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
|
http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
|
||||||
|
Reference in New Issue
Block a user