00-00 Welcome

Welcome

Welcome to Urban Computing Skills Lab. This course is a pre-requisite for your Masters program in Applied Urban Science and Informatics, which immediately begs the question: what is Urban Science? It is actually hard to define Urban Science especially given how ubiquitous the term has become. In my mind, Urban Science is an emerging domain of research at the intersection of interdisciplinary sciences that seeks to exploit the large-scale data, from a variety of sources, to understand and address urban challenges.

With this in mind, I would encourage you to think of Urban informatics not as a new domain of knowledge to learn, but a new set of skills that you can apply within your current area of expertise. By the end of this bootcamp, I hope to:

  • Get you familiar with Python programming language (that will be used throughout your Masters course)
  • Introduce you to scientific tools for performing data analysis
  • Exemplify efficient way of storing and visualizing the data
  • Give you a demo hands-on experience for developing solutions in real-world

Target Audience

UCSL bootcamp was an answer to the most common student question -- "How should I learn Python?". However, slowly we realized the folks didn't want to learn Python "programming" per se but wanted to learn it as a tool for performing data analysis and visualization.

This bootcamp has been modified and customized to use Python and its powerful data science ecosystem as a tool for data-intensive and computational science. Thus, this Lab in no way aims to be a comprehensive introduction to Python or programming language in general. We will brush over the Python's syntax, Built-in Types and Datastructures, control flow statements and functions before jumping into the scientific modules that we use for performing data analysis

The goal is to create a basic understanding about the language that you will be using as a tool to explore the data science stack which includes modules like Numpy, Matplotlib and Pandas. As such this Lab does not expect you to have any prior experience with using Python language or performing data analysis.

Using this Lab

The lab requires Python and scientific packages installed. To make things easier, NYU CUSP has set up an environment where you can login with your net id & password and use the packages.

Setting up your environment has options for Using CUSP Data Facility and Setting up Locally

It is highly advised to use the CUSP Data Facility for reading/ executing the course content and for all your homework assignments. However if you would like to install everything locally on your machine (which I would highly refrain from, unless you know what you are doing) you can follow Setting up Locally section.

Outline

The UCSL Lab is divided into 6 Major sections viz:

Introduction to Notebook and beyond

This section contains information about setting up the environment for using the UCSL Lab either via CUSP Data Facility (preferred) or by setting up local environment. We will then familiarize ourselves with the simple-but-important terminologies like kernel, jupyter notebook etc. and understang how combination of these tools makes the perfect environment for data scientists. Finally, we will be introduced to the basic Python syntax, concept of whitespaces, using comments and general PEP8 styling guidelines which we will follow throughout our Lab (and something you should follow, always.).

Introduction to Python

This is where we will discuss about Python variables and understand how to store values and perform operation on the variables. We will also look into built in Types like string, integer, etc. We will then learn about control flow using which we can alter the flow/ execution of the code.

Python in Practice

Once you get familiar with Python's syntax and are comfortable with performing simple operations, we will learn about Python's different Data Strucutures. We will then look at creating and using Functions which will help make your code more readable and reusable. Finally, we will learn about syntax errors & exceptions and how to handle them.

At this stage you must be comfortable in writing basic python code that involves using built-in/ custom functions for performing certain operation on different variables and data structures. This is a good time to go through the previous 2 sections.. just to be sure!

Starting with the next section, you will be applying the knowledge you learnt in the above sections for performing efficient computation on large datasets

Introduction to Numpy

Numpy (Numerical Python) is the core of almost entire data science ecosystem in Python. The time that you will be spending in learning the core concepts of Python will be valuable for performing data(agnostic) analysis. In this section we will learn some basic concepts on single and multi-dimensional arrays and understand the differnce between Python's built in types and Numpy arrays. We will perform numerical operations on Numpy arrays and take a look at Numpy's universal functions. Since we will be dealing with large datasets, we will also cover the concepts of indexing and slicing numpy arrays and some fancy (and fast) ways of getting a chunk of data from a huge dataset.

Introduction to Matplotlib

Because of the way the human brain processes information, it is important to be able to visualize the data and the results in correct way. Matplotlib is a large project and can seem daunting at first. However, by learning the components, it should begin to feel much smaller and more approachable. We will begin our quest to visualize complex data by starting with plotting simple lines and curves and understanding every element that goes into creating an output figure. We will then look at another interesting companion module seaborn that will help beautify the output figures without much tinkering. Slowly progressing we will learn about some of the differnt types of plots and look at specific examples where each type is suitable. Throughout this section we will be reminded of selecting proper type of plot, color maps/ palletes, ticks. In the end we will look at Matplotlib's style sheets that will help you get publication quality figures.

Introduction to Statistics *

For a change of pace from learning Python for performing data analysis, in this section, we will take a refresher course through some of the concepts of Statistics and see how we use python as a tool to practically implement and visualize statistical problems. This refresher course will consist of following topics:

  • Linear Algebra
  • Probability
  • Probability Distribution
  • Conditional probabilities and Bayes Theorem
  • Optimization problems

* This section of the course has been created by Dr. Federica Bianco and Dr. Stanislav Sobolevsky

Pandas

Getting back onto learning the tools from Python's data science ecosystem.. in this section we will take a look at Pandas Series and Pandas Dataframes, two data structures that are very well organized and efficient implementation of single and multi-dimensional arrays sporting heterogeneous types and/or missing data. I know it can be mouthful but wait till we look into the implementation of Pandas dataframes. Instead of a traditional way of introducing the functions and showing the implemetation example, this time we will get our hands dirty with a real-world dataset and try to answer the questions as data scientists, unraveling the mystery of Pandas!

Homework/ Assignments

At the end of every section you will be asked to complete an assignment. The assignment will be due within a week of the date it is posted. Following are the rules for completing and submitting the assignments:

  • All the solutions should be named ChallengeX_Solutions.ipynb and saved inside your <home_directory>/ucsl.. where X is the challenge number and home_directory is your net id.
  • All the submissions will ONLY BE COLLECTED FROM YOUR HOME DIRECTORY/ucsl (Refer Setting up your environment's Using CUSP Data Facility section. If you have set up the environment locally on your machine then start the notebook following the instructions from Using CUSP Data Facility and after starting the notebook, use the Upload button on the right hand corner to upload the solutions in your home directory.
  • Create a directory/ folder inside of your Home Directory by the name UCSL and save all the solutions in that directory
  • The submissions will be fetched automatically and NO EMAIL SUBMISSIONS WILL BE ACCEPTED
  • Passing grade is 60%
  • Late submission:
    • Inform the TA as soon as possible.
    • 1 week late: evaluation from 90% of the total grade
    • 2 weeks late: evaluation from 75% of the total grade
  • Plagiarism will not be tolerated. For more information refer: Academic Integrity for students at NYU

Contact/ Office Hours

The best way to contact me is via email: mohit.sharma@nyu.edu

Teaching Assistant:

Virtual Office Hours: Tuesday 2pm - 4pm EST.

Open Source license

This notebook will be available for you to use even after you have completed the UCSL bootcamp as a html document at https://sharmamohit.com/tutorials/ucsl/ and also in the notebook form at https://github.com/Mohitsharma44/ucsl17 that you can download locally (or clone it in cdf in your home directory as well)

Feel free to make the edits, use in whatever way you want and even better, if you find any errors or want to contribute to improve the bootcamp, make a pull request or simply fork it for youself!

Good Luck!

Related

comments powered by Disqus