tssql is a part of the txt-sushi toolset. The tssql utility allows you to use a subset of the SQL SELECT grammar on your flat files. tssql treats the first row of a CSV file as a table header and the other rows are treated as table data. Here is a summary of what's supported in the current version:
SELECT employees.*, employees.salary * 0.45, office.squareFeet from ...
SELECT ... FROM ... WHERE (employees.salary * 0.45 > 20000) OR (employees.dept = "IT")
SELECT ... FROM tbl1 JOIN tbl2 ON tbl1.id = tbl2.id WHERE ...It is also legitimate to write this join as:
SELECT ... FROM tbl1, tbl2 WHERE tbl1.id = tbl2.id AND ...but this older join syntax is completely unoptimized and will take a very long time for large tables. For the CS folks its a O(table1_rows * table2_rows) operation as opposed to O(table1_rows*log(table1_rows) + table2_rows*log(table2_rows)) for the "newer" join syntax.
SELECT ... FROM ... ORDER BY employee.salesPerAnnum / employees.salary
SELECT AVG(employees.salary) FROM ... WHERE ... GROUP BY employee.deptYou can also apply aggregate functions to the entire select if you leave the GROUP BY out. So something like this will give you the overall average.
SELECT AVG(employees.salary) FROM employees
SELECT emp_id, FOR s IN [salary_2000 .. salary_2005] YIELD s/base_salary FROM yearly_salaries
If you type
tssql -helpyou will see the following usage documentation:
Usage: tssql [-help ...] [-external-sort] [-table table_name CSV_file_name]I will only explain the -external-sort option since -table option is demonstrated in the example below. You only need to consider using the external sort option if you have a very large table that you are performing an "[INNER] JOIN", "GROUP BY" or "ORDER BY" operation. All of these operation require a row sort which can cause tssql to run out of memory if the table is very big. If this occurs you can use the external sort option to tell tssql to sort the rows on disk which is much slower but also requires less memory.
Here are a couple of selected example scripts. To see all of the examples you can go to the examples directory and even the test directory in the darcs repository. The first example comes from test1.2.bash:
#!/bin/bash # Example adapted from http://en.wikipedia.org/wiki/Join_(SQL)#Self-join tssql 'SELECT F.EmployeeID, F.LastName, S.EmployeeID, S.LastName, F.Country FROM `employees.csv` AS F JOIN `employees.csv` AS S ON F.Country = S.Country WHERE F.EmployeeID < S.EmployeeID ORDER BY F.EmployeeID, S.EmployeeID' | csvtopretty -If you don't use any command line options then tssql's default behavior is to assume that table names will match the file names of the comma-separated files that they refer to. So for this example the employees.csv file needs to be available in the working directory. You can also see from this example that tssql accepts standard identifiers like F.EmployeeID without any special syntax, but if you have an identifier that includes spaces or special characters you can quote it with backticks as in `employees.csv`. Here is the output you get from this example:
EmployeeID|LastName|EmployeeID|LastName |Country 123 |Rafferty|124 |Jones |Australia 123 |Rafferty|145 |Steinberg|Australia 124 |Jones |145 |Steinberg|Australia 305 |Smith |306 |Jasper |United KingdomHere is one more example to show how you can name tables using the -table option:
#!/bin/bash # replicating JOIN from # http://www.itl.nist.gov/div897/ctg/dm/sql_examples.htm tssql -table STATION station.csv -table STATS stats.csv \ 'SELECT LAT_N, CITY, TEMP_F FROM STATS JOIN STATION ON STATS.ID = STATION.ID WHERE MONTH = 7 ORDER BY TEMP_F' | csvtopretty -And the output you should get:
LAT_N|CITY |TEMP_F 47 |Caribou|65.8 40 |Denver |74.8 33 |Phoenix|91.7