SQL code that describes the data you desire and the format in which you want it.
SELECT the data I want FROM these database and tables WHERE these criteria are met GROUP (BY) this field HAVING this property ORDER (BY) this field or list;
- SELECT the data I want
- FROM these database and tables
- WHERE these criteria are met, and
- GROUP (BY) this field
- HAVING this property, then
- ORDER (BY) this field or list;
In most SQL interfaces, extra wide spaces are usually ingonred, unless they’re within a string of text and casing quotation marks
No data = NULL, rather than 0
This diagram shows us that data are actually grouped(
GROUP BY) before the
SELECTexpressions are applied.
That means that when a
GROUP BYexpression is included in an SQL query, there is no way to use a
SELECTstatement to summarize daa that cross multiple groups.
Every time we run a line of SQL code in Jupyter, we will need to preface the line with
Get to Know Our Data !
Before we start querying,
- Conform how many tables each database has
- Identify the fields contained in each table of the database.
SHOW tables: To determine how many tables each database has
SHOW columns FROM databasename.tablename
An alternate way to learn the same information would be to use the
DESCRIBEfunction. The syntax is:
SELECT dog_guid, breed
LIKEoperator: allows us to specify a pattern that the textual data we query has to match.
SELECT dog_guid, breed
In this syntax, the percent sign indicates a wild card. Wild cards represent unlimited numbers of missing letters. This is how the placement of the percent sign would affect the results of the query:
- WHERE breed LIKE (“s%”) = the breed must start with “s”, but can have any number of letters after the “s”
- WHERE breed LIKE (“%s”) = the breed must end with “s”, but can have any number of letters before the “s”
- WHERE breed LIKE (“%s%”) = the breed must contain an “s” somewhere in its name, but can have any number of letters before or after the “s”
DISCTINCTclause is used with multiple columns in a
SELECTstatement, the combinaiton of all the columns together is used to determine the uniqueness of a row in a result set.
When we use the
DISTINCTclause with the
LIMITclause in a statement, MySQL stops searching when it finds the number of
uniquerows specified in th
LIMITclause, not when it goes through the number of rows in the
LIMIT … OFFSET…
The expression that follows a WHERE clause has to be applicable to each row of data in a column
It’s userful to remember that SQL output is always a table.
How could we construct a valid table that would have columns for aggregate counts and individual columns at the same time?
The answer is, WE CAN’T.
Every non-aggregated field that is listed in the
SELECTlist must be listed in the
If we include a
GROUP BYclause in our query, but forget to aggregate a column included in our
SELECTlist, MySQL will output a random row from the un-aggregated column when the query is executed.
> SELECT COUNT(d.breed), d.dog_guid AS dDogID, d.breed_type, d.user_guid
> FROM dogs d
> GROUP BY d.user_guid
This query would be executed in MySQL, but a random value in the
d.breed_typewould be outputted for each output row tha reports a count of a given breed number.
SELECT/GROUP BY and
COUNT DISTINCT dose NOT count NULL values, while SELECT/GROUP BY clauses roll up NULL values into one group.
If we want to infer the number of disctinct entries from the results of a query using joins and GROUP BY clauses, remember to include an
IS NOT NULLclause to ensure we ae not counting NULL valuse.
The expression that follows a HAVING clause has to be applicable or computable using a group of data.
If we did not incldue the
WHEREclause to combine two or more tables, sql would request a Caetesian product from datasets, and we could be waiting for our query for days(and will probally get in trouble with our database administrator)
ALWAYS remember to tell the database how to join our tables!
Database operations, like those that join tables, are based on
settheory that assumes there are no duplicate row in your tables. In real life, duplicate rows get enttered all the time, and it can cause you to misinterpret your queries if you do no understand the consequences.
- Avoid making assumptions about your data or your analyses. For example, rather than assume that all the values in a column are unique just because some documentation says they should be, check for yourself!
- Always look at example outputs of your queries before you strongly interpret aggregate calculations. Take extra care to do this when your queries require joins.
- When your queries require multiple layers of functions or joins, examine the output of each layer or join first before you combine them all together.
- Adopt a healthy skepticsm of all your data and results. If you see something you don’t expect, make sure you explore it before interpreting it strongly or incorporating it into other analyses.
|Function||What is Returns|
|COUNT()||# of rows in a column|
|MAX()||a column’s highest value|
|MIN()||a column’s lowest value|
|SUM()||sum of all a column’s values|
- What questions will provide actionable insight?
- Will the answers to my questions matter
- Are the answers to my questions correct
-[ ] 1. Difference
return 20845 rows:
-[ ] 2. would be associated with ….
SELECT u.user_guid AS uUserID, d.user_guid AS dUserID, d.dog_guid AS dDogID, d.breed, COUNT(*) AS Numrows
FROM users u LEFT JOIN dogs d
GROUP BY u.user_guid
ORDER BY COUNT(*) DESC;
This query told us that user ‘ce7b75bc-7144-11e5-ba71-058fbc01cf0b’ would be associated with 913,138 rows in the output of the outer join we designed!