Welcome to Urban Computing Skills Lab. This course is a pre-requisite for your Masters program in Applied Urban Science and Informatics, which immediately begs the question: what is Urban Science? It is actually hard to define Urban Science especially given how ubiquitous the term has become. In my mind, Urban Science is an emerging domain of research at the intersection of interdisciplinary sciences that seeks to exploit the large-scale data, from a variety of sources, to understand and address urban challenges.
With this in mind, I would encourage you to think of Urban informatics not as a new domain of knowledge to learn, but a new set of skills that you can apply within your current area of expertise. By the end of this bootcamp, I hope to:
- Get you familiar with Python programming language (that will be used throughout your Masters course)
- Introduce you to scientific tools for performing data analysis
- Exemplify efficient way of storing and visualizing the data
- Give you a demo hands-on experience for developing solutions in real-world
UCSL bootcamp was an answer to the most common student question -- "How should I learn Python?". However, slowly we realized the folks didn't want to learn Python "programming" per se but wanted to learn it as a tool for performing data analysis and visualization.
This bootcamp has been modified and customized to use Python and its powerful data science ecosystem as a tool for data-intensive and computational science. Thus, this Lab in no way aims to be a comprehensive introduction to Python or programming language in general. We will brush over the Python's syntax, Built-in Types and Datastructures, control flow statements and functions before jumping into the scientific modules that we use for performing data analysis
The goal is to create a basic understanding about the language that you will be using as a tool to explore the data science stack which includes modules like Numpy, Matplotlib and Pandas. As such this Lab does not expect you to have any prior experience with using Python language or performing data analysis.
The lab requires Python and scientific packages installed. To make things easier, NYU CUSP has set up an environment where you can login with your net id & password and use the packages.
Setting up your environment has options for
Using CUSP Data Facility and
Setting up Locally
It is highly advised to use the CUSP Data Facility for reading/ executing the course content and for all your homework assignments.
However if you would like to install everything locally on your machine (which I would highly refrain from, unless you know what you are doing) you can follow
Setting up Locally section.
The UCSL Lab is divided into 6 Major sections viz:
Introduction to Notebook and beyond¶
This section contains information about setting up the environment for using the UCSL Lab either via CUSP Data Facility (preferred) or by setting up local environment. We will then familiarize ourselves with the simple-but-important terminologies like kernel, jupyter notebook etc. and understang how combination of these tools makes the perfect environment for data scientists. Finally, we will be introduced to the basic Python syntax, concept of whitespaces, using comments and general
PEP8 styling guidelines which we will follow throughout our Lab (and something you should follow, always.).
Introduction to Python¶
This is where we will discuss about Python variables and understand how to store values and perform operation on the variables. We will also look into built in Types like
integer, etc. We will then learn about
control flow using which we can alter the flow/ execution of the code.
Python in Practice¶
Once you get familiar with Python's syntax and are comfortable with performing simple operations, we will learn about Python's different Data Strucutures. We will then look at creating and using Functions which will help make your code more readable and reusable. Finally, we will learn about syntax errors & exceptions and how to handle them.
At this stage you must be comfortable in writing basic python code that involves using built-in/ custom functions for performing certain operation on different variables and data structures. This is a good time to go through the previous 2 sections.. just to be sure!
Starting with the next section, you will be applying the knowledge you learnt in the above sections for performing efficient computation on large datasets
Introduction to Numpy¶
Numpy (Numerical Python) is the core of almost entire data science ecosystem in Python. The time that you will be spending in learning the core concepts of Python will be valuable for performing data(agnostic) analysis. In this section we will learn some basic concepts on single and multi-dimensional
arrays and understand the differnce between Python's built in types and Numpy arrays. We will perform numerical operations on Numpy arrays and take a look at Numpy's universal functions. Since we will be dealing with large datasets, we will also cover the concepts of indexing and slicing numpy arrays and some fancy (and fast) ways of getting a chunk of data from a huge dataset.
Introduction to Matplotlib¶
Because of the way the human brain processes information, it is important to be able to visualize the data and the results in correct way. Matplotlib is a large project and can seem daunting at first. However, by learning the components, it should begin to feel much smaller and more approachable. We will begin our quest to visualize complex data by starting with plotting simple lines and curves and understanding every element that goes into creating an output figure. We will then look at another interesting companion module
seaborn that will help beautify the output figures without much tinkering. Slowly progressing we will learn about some of the differnt types of plots and look at specific examples where each type is suitable. Throughout this section we will be reminded of selecting proper type of plot, color maps/ palletes, ticks. In the end we will look at Matplotlib's style sheets that will help you get publication quality figures.
Introduction to Statistics *¶
For a change of pace from learning Python for performing data analysis, in this section, we will take a refresher course through some of the concepts of Statistics and see how we use python as a tool to practically implement and visualize statistical problems. This refresher course will consist of following topics:
- Linear Algebra
- Probability Distribution
- Conditional probabilities and Bayes Theorem
- Optimization problems
* This section of the course has been created by Dr. Federica Bianco and Dr. Stanislav Sobolevsky
Getting back onto learning the tools from Python's data science ecosystem.. in this section we will take a look at Pandas Series and Pandas Dataframes, two data structures that are very well organized and efficient implementation of single and multi-dimensional arrays sporting heterogeneous types and/or missing data. I know it can be mouthful but wait till we look into the implementation of Pandas dataframes. Instead of a traditional way of introducing the functions and showing the implemetation example, this time we will get our hands dirty with a real-world dataset and try to answer the questions as data scientists, unraveling the mystery of Pandas!
At the end of every section you will be asked to complete an assignment. The assignment will be due within a week of the date it is posted. Following are the rules for completing and submitting the assignments:
- All the solutions should be named
ChallengeX_Solutions.ipynb and saved inside your
Xis the challenge number and
home_directoryis your net id.
- All the submissions will ONLY BE COLLECTED FROM YOUR
HOME DIRECTORY/ucsl(Refer Setting up your environment's
Using CUSP Data Facilitysection. If you have set up the environment locally on your machine then start the notebook following the instructions from
Using CUSP Data Facilityand after starting the notebook, use the
Uploadbutton on the right hand corner to upload the solutions in your home directory.
- Create a directory/ folder inside of your
Home Directoryby the name
UCSLand save all the solutions in that directory
- The submissions will be fetched automatically and NO EMAIL SUBMISSIONS WILL BE ACCEPTED
- Passing grade is 60%
- Late submission:
- Inform the TA as soon as possible.
- 1 week late: evaluation from 90% of the total grade
- 2 weeks late: evaluation from 75% of the total grade
- Plagiarism will not be tolerated. For more information refer: Academic Integrity for students at NYU
Open Source license¶
This notebook will be available for you to use even after you have completed the UCSL bootcamp as a html document at https://sharmamohit.com/tutorials/ucsl/ and also in the notebook form at https://github.com/Mohitsharma44/ucsl17 that you can download locally (or clone it in cdf in your home directory as well)
Feel free to make the edits, use in whatever way you want and even better, if you find any errors or want to contribute to improve the bootcamp, make a pull request or simply fork it for youself!