csvdocument: added some notes about CSV variations implementation

git-svn-id: https://svn.code.sf.net/p/lazarus-ccr/svn@1512 8e941d3f-bd1b-0410-a28a-d453659cc2b4
2011-02-27 11:34:33 +00:00
parent 4a643de30c
commit 78c52d3eeb
1 changed files with 53 additions and 6 deletions
--- a/components/csvdocument/doc/todo.txt
+++ b/components/csvdocument/doc/todo.txt
@ -1,7 +1,9 @@
 === TODO ===
-* Write more tests for different CSV variations
+
 * Write more tests for different CSV format variations, especially those used by Excel and Calc.
 === Warning about speed optimizations ===
 A try to speed up buffer operations (FCellBuffer, FWhitespaceBuffer)
 by memory preallocation using straightforward String Builder implementation
 resulted in about 25% slowdown compared with current implementation based
@ -16,8 +18,33 @@ With StrBuf:    2423, 2437, 2404, 2471, 2405 ms
 This happened on Linux too and was not tested on other platforms.
 These changes were not commited either.
-=== Warning about CSV extensions like escaping special chars and line breaks ===
+=== Some thoughts about CSV variations ===
-There are more problems in implementing them than it seems at first glance:
+
 There are two CSV specifications:
 * RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files
  http://tools.ietf.org/html/rfc4180
 * An unofficial CSV specification
  http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat
 The latter (unofficial) specification mentiones two CSV format features
 that are not part of RFC 4180. The first of them is mentioned as mandatory:
 1) Leading and trailing space-characters adjacent to comma field separators are ignored.
   Fields with leading or trailing spaces must be delimited with double-quote characters.
 The second feature is optional and comprises several variations
 (http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#CSVariations):
 2) Embedded line-feeds in fields. This one is also escaped sometimes. Often like in C ("\n").
   Embedded commas in fields. Again, an escape character is sometimes used in place of the comma.
   Check if line feeds are \n
   Check if embedded double quotes are \"
   Check if ???
 Here are some critics concerning both of these suggested features.
 Behavior (1) is explicitely forbidden by RFC 4180: "Spaces are considered part
 of a field and should not be ignored". There is a reason for this: when (1) is obeyed,
 simple loading and saving CSV document (without any modifications) will result in data loss.
 As for variations (2), there are more problems in implementing them than it seems at first glance:
 * It should be clearly defined what escaping scheme should be used:
  - what characters must be escaped,
  - what escaped characters have special meaning (like \r and \n),
@ -33,6 +60,26 @@ There are more problems in implementing them than it seems at first glance:
    \w\w\wescaped non-trimmable whitespace\w\w\w
    "   quoted non-trimmable whitespace   "
 Implementing feature (1) on the CSV parser level still has a point.
 This feature requires to remove outer whitespace only (a whitespace outside quotes)
 and keep inner whitespace (a whitespace inside quotes) intact. However, an application
 that uses CSV parser does not have access to quotes and cannot distinguish between
 inner and outer whitespace. That is why this feature cannot be implemented by client
 application on top of parser, and should therefore be implemented by the parser itself.
 However it should be optional and disabled by default to prevent data loss.
 As for variations (2), they are too ambiguous to be implemented as is. The ambiguity
 can be removed to some degree by the following limitations:
 - traditional quoting takes precedence over backslash-escaping;
 - backslash-escaping of separators and quotation marks is forbidden to obey RFC 4180.
 These limitations allow client applications to implement backslash-escaping themselves
 on top of CSV parser, effectively turning backslash-escaping into special field syntax.
 Since CSV fields as defined by RFC 4180 can transparently store any sequence of characters,
 applications are not limited in defining their own subformats (such as backslash-escaping)
 and store them in CSV fields. That is why there is no point in implementing variations (2)
 on the parser level, unless they are made more specific and require access to CSV internals
 like feature (1) does.
 === Links ===
 http://tools.ietf.org/html/rfc4180#section-2
 http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm#FileFormat