AsciiIngest Delimiters

AsciiIngest is a tool for ingesting ASCII data into a database, using the DBIngestor library. Both packages are hosts at GitHub:

In order to read the ASCII files correctly and map the columns from the data file to fields in the database, a structure file needs to be define. The documentation (userguide.pdf) for AsciiIngest gives guidance on that. But there are some things to be aware of, if the delimiting character appears in a (quoted) string-column. Let's consider CSV-files as an example.

CSV-file with delimiting character in a column

For a data file with comma-separated values (CSV), one usually needs to define a comma ',' (or semi-colon) as delimiter in the structure file; except for the last column where the delimiter is usually '\n'. That works pretty well for most cases, but what if you have a column with an array, which shall be stored as a string?

The rows in your data file may look like this:

1.45,100,"(0,1,-2,5)",23,99.9
2.89,80,"(1,0,-2,4)",20,100.0

You may expect to be able to use a structure file that defines for each column ',' as delimiter (except for the last column). But this is not how it works: AsciiIngest will always look for the next delimiter-character, disregarding any quotes on the way. I'll repeat that, because it is so important: AsciiIngest will always look for the next delimiter-character, disregarding any quotes on the way. This may change in the future, but that's how it is currently implemented.

That means it will read the 1. row like this:

1.45
100
"(0
1
-2.5)",23,99.9

(If columns are defined as character-columns; otherwise there may be NULL-values for float-columns for which it read a string.)

This is an issue that may be resolved in a future release of AsciiIngest, but there is also a work-around that you can use now, without changing your data file: You have to specify different delimiting character for quoted columns in your structure file.

Since you specify the delimiting character for each column, you can also have different one for each of them, say a "." here, a "," there a space " " somewhere else and a ";" at the end. You are free to run wild with your data files. It's just (unfortunately) limited to only one character (or a number of the same character replicated). So for the example above, a working structure file would look like this:

5
0
D1 REAL4 0 0 ',' x
D2 INT4 0 0 ',' n
D3 CHAR 0 0 '"' SKIP_THIS_COL
D4 CHAR 0 0 '"' arrayvalues
D5 CHAR 0 0 ',' SKIP_THIS_COL
D6 INT4 0 0 ',' n2
D7 REAL4 0 0 ',' z

This means for the 3. field it just reads from the last ',' to the next double quote, the junk until the next double quote will be inserted as string into the database field "arrayvalues", then we need to skip everything again until the next comma and then we can continue as usual.

Space-separated file with string containing spaces

This is another interesting use case, e.g. if you have a comment-string to be inserted into a database from a space-delimited file. And here you can apply the same rules as before, provided that your string is enclosed in double quotes or anything else that is not appearing inside the string.

Happy ingesting!

Supercomputing and E-Science

CSV-file with delimiting character in a column

Space-separated file with string containing spaces