A command-line tool written in Rust that infers SQL data types from a CSV file and generates a CREATE TABLE SQL DDL statement. It aims to determine the strictest possible SQL type for each column that can accommodate all its values.
- infers sql types for columns:
integer,bigint,float,char(n),varchar(n),date,datetime,boolean. - Determines the strictest type that fits all values in a column.
- e.g., a column with
1,2,3000000000will beBIGINT. - e.g., a column with
1,2.0will beFLOAT. - e.g., a column with
apple,grape(both length 5, and not a more specific type like boolean/date etc.) will beCHAR(5). - e.g., a column with
apple,banana(lengths 5 and 6) will beVARCHAR(6). - e.g., a column with
1,text(different types, varying lengths if applicable) will beVARCHAR.
- e.g., a column with
- handles empty strings: empty strings are treated as
nulland do not influence the type inference for the non-empty values in the column. if all values in a column are empty, it defaults tovarchar(0)(which becomesvarchar(1)in sql due tomax(1)). - Generates
CREATE TABLESQL DDL statements. - Table name is derived from the input CSV filename (e.g.,
my_data.csvbecomes tablemy_data). - Column names are taken directly from the CSV header.
- Utilizes parallel processing for type inference across columns using Rayon for improved performance on multi-core CPUs.
-
Ensure you have Rust and Cargo installed. If not, follow the instructions at rust-lang.org.
-
Clone the repository:
git clone <your-repository-url> cd csv-sql-inference
-
Build the project:
cargo build --release
The executable will be located at
target/release/csv_sql_inference.
Run the tool from the command line, providing the path to your CSV file:
./target/release/csv_sql_inference <path_to_csv_file>or, if you've added the target/release directory to your path:
csv_sql_inference <path_to_csv_file>the generated create table sql statement will be printed to standard output.
given a csv file named products.csv:
id,product_name,quantity,price,entry_date,last_updated
1,apple,10,0.50,2023-01-01,2024-05-01 10:00:00
2,banana,,1.20,2023-01-02,2024-05-02 11:30:15
3,orange,5,0.75,invalid-date,2024-05-03 09:00:00
4,grape,20000000000,0.05,2023-03-10,2024-05-04 14:20:05running the command:
./target/release/csv_sql_inference products.csvwould produce output similar to this (exact varchar lengths depend on the longest string in each respective column):
CREATE TABLE "products" (
"id" integer,
"product_name" varchar(6),
"quantity" bigint,
"price" float,
"entry_date" varchar(12),
"last_updated" datetime
);explanation of example output:
"id": all are integers."product_name": mixed strings, sovarchar. length determined by "orange" (6)."quantity": contains an empty string and a very large number. the empty string forcesvarchar. "20000000000" is 11 chars."price": all are valid floats."entry_date": contains "invalid-date", so it becomesvarchar. length determined by "invalid-date" (12)."last_updated": all are valid datetime strings.
to run the unit and integration tests:
cargo test