Hello,
The CSV Input Adapter expects to find a fixed number of fields. If you attach the input adapter to a stream/window with 3 fields, it will always try to read three fields. You can have missing data and this will be read into a column as a NULL value but the delimiters must still be there. For example:
1,1,1
2,2,2
3,3,3
4,,4
5,5,5
CREATE INPUT WINDOW inWindow SCHEMA (c1 integer, c2 integer, c3 integer) PRIMARY KEY (c1);
ATTACH INPUT ADAPTER File_Hadoop_CSV_Input2 TYPE toolkit_file_csv_input TO inWindow PROPERTIES csvExpectStreamNameOpcode = FALSE ,
dir = 'c:/temp' ,
file = 'test.csv' ,
csvDelimiter = ',' ;
A workaround you might consider is reading an entire line from your file into a stream with a single string column. The trick is that you have to choose a character for the column delimiter that you are certain will never show up in your data. Than once you have read that line into a stream, you can use the string functions to parse out the columns as needed:
CREATE INPUT STREAM csv_instream SCHEMA (
log_line_message string
);
// Choose a character that is certain to never show up in the data in order to read the entire line
// from the CSV file into a single column
ATTACH INPUT ADAPTER File_Hadoop_CSV_Input1 TYPE toolkit_file_csv_input TO csv_instream PROPERTIES
csvExpectStreamNameOpcode = FALSE ,
dir = 'C:/temp' ,
file = 'error_log' ,
csvDelimiter = '@' ;
// Parse out the individual columns
CREATE OUTPUT STREAM csv_outstream SCHEMA (
log_datetime timestamp ,
debug_level string ,
host string ,
message string ) AS SELECT
to_timestamp(substr(CI.log_line_message, 1, 24), 'DY MON DD HH24:MI:SS YYYY') as log_datetime,
substr(CI.log_line_message, patindex(CI.log_line_message, '[', 2)+1, (patindex(CI.log_line_message, ']', 2) - patindex(CI.log_line_message, '[', 2))-1) AS debug_level,
replace(substr(CI.log_line_message, patindex(CI.log_line_message, '[', 3)+1, (patindex(CI.log_line_message, ']', 3) - patindex(CI.log_line_message, '[', 3))-1), 'client ', '') AS host,
substr(CI.log_line_message, patindex(CI.log_line_message, ']', 3)+1, 500) AS message
FROM csv_instream CI
WHERE CI.log_line_message IS NOT NULL;
Thanks,
Neal