Using this Python notebook I will:
To complete the assignment problems in this notebook I will be using three datasets that are available on the city of Chicago's Data Portal:
This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.
For this assignment I will use a snapshot of this dataset which can be downloaded from: https://ibm.box.com/shared/static/05c3415cbfbtfnr2fx4atenb2sd361ze.csv
A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2
This dataset shows all school level performance data used to create CPS School Report Cards for the 2011-2012 school year. This dataset is provided by the city of Chicago's Data Portal.
For this assignment I will use a snapshot of this dataset which can be downloaded from: https://ibm.box.com/shared/static/f9gjvj1gjmxxzycdhplzt01qtz0s7ew7.csv
A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Education/Chicago-Public-Schools-Progress-Report-Cards-2011-/9xs2-f89t
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.
This dataset is quite large - over 1.5GB in size with over 6.5 million rows. For the purposes of this assignment I will use a much smaller sample of this dataset which can be downloaded from: https://ibm.box.com/shared/static/svflyugsr9zbqy5bmowgswqemfpm1x7f.csv
A detailed description of this dataset and the original dataset can be obtained from the Chicago Data Portal at: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
To analyze the data using SQL, it first needs to be stored in the database.
While it is easier to read the dataset into a Pandas dataframe and then PERSIST it into the database, it results in mapping to default datatypes which may not be optimal for SQL querying. For example a long textual field may map to a CLOB instead of a VARCHAR.
Therefore, it is highly recommended to manually load the table using the database console LOAD tool and specify the name of the table to create and then click "Next".
Let me first load the SQL extension and establish a connection with the database
%load_ext sql
Here I enter my db2 connection string.
%sql ibm_db_sa://vzb00750:5nd34pqbz%5Eqh562c@dashdb-txn-sbox-yp-lon02-01.services.eu-gb.bluemix.net:50000/BLUDB
%sql select count(*) from CHICAGO_CRIME_DATA;
%sql select * from CHICAGO_CRIME_DATA fetch first 10 rows only;
%sql select count(ARREST) from CHICAGO_CRIME_DATA WHERE ARREST=TRUE;
%sql select DISTINCT(PRIMARY_TYPE) from CHICAGO_CRIME_DATA WHERE LOCATION_DESCRIPTION ='GAS STATION';
%sql select COMMUNITY_AREA_NAME from CENSUS_DATA WHERE COMMUNITY_AREA_NAME LIKE 'B%';
%sql select S.NAME_OF_SCHOOL,C.COMMUNITY_AREA_NUMBER,C.COMMUNITY_AREA_NAME,S.healthy_school_certified from CENSUS_DATA as C \
LEFT OUTER JOIN CHICAGO_PUBLIC_SCHOOL as S \
on UPPER(C.COMMUNITY_AREA_NAME) = UPPER(S.community_area_name) \
where C.COMMUNITY_AREA_NUMBER between 10 and 15 AND \
S.healthy_school_certified = 'Yes';
%sql select AVG(safety_score) as AVERAGE_SCRORE from CHICAGO_PUBLIC_SCHOOL;
%sql select COMMUNITY_AREA_NAME,AVG(COLLEGE_ENROLLMENT) as COLLEGE_ENROLLMENT_AVG \
from CHICAGO_PUBLIC_SCHOOL GROUP BY COMMUNITY_AREA_NAME order by COLLEGE_ENROLLMENT_AVG desc LIMIT 5;
%sql select COMMUNITY_AREA_NAME,SAFETY_SCORE from CHICAGO_PUBLIC_SCHOOL \
WHERE SAFETY_SCORE = (SELECT MIN(SAFETY_SCORE) FROM CHICAGO_PUBLIC_SCHOOL);
%sql select COMMUNITY_AREA_NAME,per_capita_income \
from CENSUS_DATA \
where community_area_number = (select community_area_number from CHICAGO_PUBLIC_SCHOOL where safety_score = 1);