csvdocument: added some notes about CSV variations implementation

git-svn-id: https://svn.code.sf.net/p/lazarus-ccr/svn@1512 8e941d3f-bd1b-0410-a28a-d453659cc2b4
2011-02-27 11:34:33 +00:00
parent 4a643de30c
commit 78c52d3eeb
1 changed files with 53 additions and 6 deletions
--- a/components/csvdocument/doc/todo.txt
+++ b/components/csvdocument/doc/todo.txt
@ -1,7 +1,9 @@
 === TODO ===
-* Write more tests for different CSV variations
+
+* Write more tests for different CSV format variations, especially those used by Excel and Calc.

 === Warning about speed optimizations ===
+
 A try to speed up buffer operations (FCellBuffer, FWhitespaceBuffer)
 by memory preallocation using straightforward String Builder implementation
 resulted in about 25% slowdown compared with current implementation based
@ -16,8 +18,33 @@ With StrBuf:    2423, 2437, 2404, 2471, 2405 ms
 This happened on Linux too and was not tested on other platforms.
 These changes were not commited either.

-=== Warning about CSV extensions like escaping special chars and line breaks ===
-There are more problems in implementing them than it seems at first glance:
+=== Some thoughts about CSV variations ===
+
+There are two CSV specifications:
+* RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files
+  http://tools.ietf.org/html/rfc4180
+* An unofficial CSV specification
+  http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
+
+The latter (unofficial) specification mentiones two CSV format features
+that are not part of RFC 4180. The first of them is mentioned as mandatory:
+1) Leading and trailing space-characters adjacent to comma field separators are ignored.
+   Fields with leading or trailing spaces must be delimited with double-quote characters.
+The second feature is optional and comprises several variations
+(http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#CSVariations):
+2) Embedded line-feeds in fields. This one is also escaped sometimes. Often like in C ("\n").
+   Embedded commas in fields. Again, an escape character is sometimes used in place of the comma.
+   Check if line feeds are \n
+   Check if embedded double quotes are \"
+   Check if ???
+
+Here are some critics concerning both of these suggested features.
+
+Behavior (1) is explicitely forbidden by RFC 4180: "Spaces are considered part
+of a field and should not be ignored". There is a reason for this: when (1) is obeyed,
+simple loading and saving CSV document (without any modifications) will result in data loss.
+
+As for variations (2), there are more problems in implementing them than it seems at first glance:
 * It should be clearly defined what escaping scheme should be used:
  - what characters must be escaped,
  - what escaped characters have special meaning (like \r and \n),
@ -33,6 +60,26 @@ There are more problems in implementing them than it seems at first glance:
    \w\w\wescaped non-trimmable whitespace\w\w\w
    "   quoted non-trimmable whitespace   "

+Implementing feature (1) on the CSV parser level still has a point.
+This feature requires to remove outer whitespace only (a whitespace outside quotes)
+and keep inner whitespace (a whitespace inside quotes) intact. However, an application
+that uses CSV parser does not have access to quotes and cannot distinguish between
+inner and outer whitespace. That is why this feature cannot be implemented by client
+application on top of parser, and should therefore be implemented by the parser itself.
+However it should be optional and disabled by default to prevent data loss.
+
+As for variations (2), they are too ambiguous to be implemented as is. The ambiguity
+can be removed to some degree by the following limitations:
+- traditional quoting takes precedence over backslash-escaping;
+- backslash-escaping of separators and quotation marks is forbidden to obey RFC 4180.
+These limitations allow client applications to implement backslash-escaping themselves
+on top of CSV parser, effectively turning backslash-escaping into special field syntax.
+Since CSV fields as defined by RFC 4180 can transparently store any sequence of characters,
+applications are not limited in defining their own subformats (such as backslash-escaping)
+and store them in CSV fields. That is why there is no point in implementing variations (2)
+on the parser level, unless they are made more specific and require access to CSV internals
+like feature (1) does.
+
 === Links ===
 http://tools.ietf.org/html/rfc4180#section-2
 http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat