Coursera_Managing Big Data with MySQL_Note 2

Data Query Language

Query:

SQL code that describes the data you desire and the format in which you want it.

Six SQL Keywords

SELECT the data I want FROM these database and tables WHERE these criteria are met GROUP (BY) this field HAVING this property ORDER (BY) this field or list;

  • SELECT the data I want
  • FROM these database and tables
  • WHERE these criteria are met, and
  • GROUP (BY) this field
  • HAVING this property, then
  • ORDER (BY) this field or list;

In most SQL interfaces, extra wide spaces are usually ingonred, unless they’re within a string of text and casing quotation marks

No data = NULL, rather than 0

Run Order

SQL Run Order

This diagram shows us that data are actually grouped(GROUP BY) before the SELECT expressions are applied.
That means that when a GROUP BY expression is included in an SQL query, there is no way to use a SELECT statement to summarize daa that cross multiple groups.

SQL

Every time we run a line of SQL code in Jupyter, we will need to preface the line with %sql.

Get to Know Our Data !

Before we start querying,

  1. Conform how many tables each database has
  2. Identify the fields contained in each table of the database.
  • SHOW tables: To determine how many tables each database has

SHOW columns FROM databasename.tablename

An alternate way to learn the same information would be to use the DESCRIBE function. The syntax is: DESCRIBE tablename

Operator

  • IN operator
1
2
3
SELECT dog_guid, breed
FROM dogs
WHERE breed IN ("golden retriever", "poodle");
  • LIKE operator: allows us to specify a pattern that the textual data we query has to match.
1
2
3
SELECT dog_guid, breed
FROM dogs
WHERE breed LIKE ("s%");

In this syntax, the percent sign indicates a wild card. Wild cards represent unlimited numbers of missing letters. This is how the placement of the percent sign would affect the results of the query:

  • WHERE breed LIKE (“s%”) = the breed must start with “s”, but can have any number of letters after the “s”
  • WHERE breed LIKE (“%s”) = the breed must end with “s”, but can have any number of letters before the “s”
  • WHERE breed LIKE (“%s%”) = the breed must contain an “s” somewhere in its name, but can have any number of letters before or after the “s”

Clause

  • DISCTINCT clause

When the DISCTINCT clause is used with multiple columns in a SELECT statement, the combinaiton of all the columns together is used to determine the uniqueness of a row in a result set.

When we use the DISTINCT clause with the LIMIT clause in a statement, MySQL stops searching when it finds the number of unique rows specified in th LIMIT clause, not when it goes through the number of rows in the LIMIT clause.

  • OFFSET

LIMIT … OFFSET…

  • WHERE

The expression that follows a WHERE clause has to be applicable to each row of data in a column

  • GROUP BY

It’s userful to remember that SQL output is always a table.

How could we construct a valid table that would have columns for aggregate counts and individual columns at the same time?
The answer is, WE CAN’T.

A RULE:

Every non-aggregated field that is listed in the SELECT list must be listed in the GROUP BY list.

Note:

If we include a GROUP BY clause in our query, but forget to aggregate a column included in our SELECT list, MySQL will output a random row from the un-aggregated column when the query is executed.

e.g:

1
2
3
4
> SELECT COUNT(d.breed), d.dog_guid AS dDogID, d.breed_type, d.user_guid
> FROM dogs d
> GROUP BY d.user_guid
>

>

This query would be executed in MySQL, but a random value in the d.dog_guid and d.breed_type would be outputted for each output row tha reports a count of a given breed number.

Difference between SELECT/GROUP BY and COUNT DISTINCT:

COUNT DISTINCT dose NOT count NULL values, while SELECT/GROUP BY clauses roll up NULL values into one group.

If we want to infer the number of disctinct entries from the results of a query using joins and GROUP BY clauses, remember to include an IS NOT NULL clause to ensure we ae not counting NULL valuse.

  • HAVING

The expression that follows a HAVING clause has to be applicable or computable using a group of data.

JOIN

If we did not incldue the WHERE clause to combine two or more tables, sql would request a Caetesian product from datasets, and we could be waiting for our query for days(and will probally get in trouble with our database administrator)

ALWAYS remember to tell the database how to join our tables!

JOIN

The general strategy relational database use to join tables:

General Strategy of Join Table

Innter JOIN:

Inner JOIN

Inner_JOIN2

Left JOIN:

Left JOIN

Distinguish Between theory and real life:

Database operations, like those that join tables, are based on set theory that assumes there are no duplicate row in your tables. In real life, duplicate rows get enttered all the time, and it can cause you to misinterpret your queries if you do no understand the consequences.

Tips:
  • Avoid making assumptions about your data or your analyses. For example, rather than assume that all the values in a column are unique just because some documentation says they should be, check for yourself!
  • Always look at example outputs of your queries before you strongly interpret aggregate calculations. Take extra care to do this when your queries require joins.
  • When your queries require multiple layers of functions or joins, examine the output of each layer or join first before you combine them all together.
  • Adopt a healthy skepticsm of all your data and results. If you see something you don’t expect, make sure you explore it before interpreting it strongly or incorporating it into other analyses.

Function

Function What is Returns
AVG() Column average
COUNT() # of rows in a column
MAX() a column’s highest value
MIN() a column’s lowest value
SUM() sum of all a column’s values

Three Question need to be thinking about

  • What questions will provide actionable insight?
  • Will the answers to my questions matter
  • Are the answers to my questions correct

Question

-[ ] 1. Difference

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
return 20845 rows:
1. SELECT d.user_guid AS UserID, d.dog_guid AS DogID, d.breed, d.breed_type,
ct.test_name
FROM dogs d, complete_tests ct
WHERE d.dog_guid=ct.dog_guid AND ct.test_name="Yawn Warm-up"

return 17273 rows:
2. SELECT d.user_guid AS UserID, d.dog_guid AS DogID, d.breed, d.breed_type,
COUNT(ct.test_name)
FROM dogs d, complete_tests ct
WHERE d.dog_guid=ct.dog_guid AND ct.test_name="Yawn Warm-up"
GROUP BY DogID, d.breed, d.breed_type

return 16706 rows:
SELECT d.user_guid AS UserID, d.dog_guid AS DogID, d.breed, d.breed_type,
COUNT(ct.test_name)
FROM dogs d, complete_tests ct
WHERE d.dog_guid=ct.dog_guid AND ct.test_name="Yawn Warm-up"
GROUP BY UserID, d.breed, d.breed_type

-[ ] 2. would be associated with ….

1
2
3
4
5
6
%%sql
SELECT u.user_guid AS uUserID, d.user_guid AS dUserID, d.dog_guid AS dDogID, d.breed, COUNT(*) AS Numrows
FROM users u LEFT JOIN dogs d
ON u.user_guid=d.user_guid
GROUP BY u.user_guid
ORDER BY COUNT(*) DESC;

This query told us that user ‘ce7b75bc-7144-11e5-ba71-058fbc01cf0b’ would be associated with 913,138 rows in the output of the outer join we designed!