March edition of Berkeley Datahub Newsletter
Get regular updates about the cutting edge work happening with Berkeley Datahub and across Berkeley Data Science Teaching Stack
Datahub Product Updates:
Julia Hub:
Julia Hub is one of the 15+ hubs managed by the Datahub infrastructure team with the primary focus on enabling instructional workflow to teach scientific computing using Julia programming language. Julia hub provides increased compute requirements over the standard hub, supports advanced simulations, and has a unique installation process to add new packages. Here is an example notebook to run simulations for the FUND model that David Anthoff uses to estimate the Social Cost of Carbon models. Courses like Math 124 (Programming for Mathematical Applications) use Julia hub to teach scientific computing.
Datahub Instructor vs Feature Matrix:
Let’s take a look at our initial thinking with regards to classifying instructors using Datahub (as part of their instructional workflow) based on the characteristics of their course work and the nature of use cases preferred. We classify their needs into 3 different use cases below,
Foundational Use Cases:
The first type of instructor focus is on the foundational use cases. Characteristics of such instructor’s coursework may include any of the below points,
Focus is on teaching introductory data science stuff,
Uses standard packages in Python/R (Eg: NumPy, Matplotlib in Python, and dplyr, ggplot in R, etc..),
Mostly the focus is on asynchronous use cases (Homework assignments), and
Computational requirements (CPU, RAM, disk) are small enough to run basic data science workflow.
Features used for such instructor’s coursework may include,
Jupyter Classic Notebook/RStudio to interact with their Python or R code respectively,
Nbgitpuller extension to create shareable links for assignments.
Intermediate Use Cases:
The second type of instructor focus is on intermediate use cases. Characteristics of such instructor’s coursework may include,
Higher compute requirements (greater than the standard compute offered as part of Datahub),
Regular package additions and upgrades,
Uses Datahub to support both sync (Exams, Labs, Live Demo, Flipped classroom model) and async (homework assignments) components,
May require elevated access for GSIs (admin access) to help students with troubleshooting and
May require auto-grading solutions to grade student assignments.
Features used by this group may include the below ones in addition to the features highlighted in the foundational use cases,
Calendar-based scheduling to handle the dynamic increase in compute requirements during a specific date/time,
Admin functionality for GSIs to access student servers,
Auto-grading in Python/R using otter grader,
Inclination to try out the latest Jupyter tools like Jupyter Lab/RetroLab and
File archival facility for students to retrieve their files stored in respective hubs.
Advanced/Complex Use Cases:
The third type of instructor focus is on complex use cases. Characteristics of such instructor’s coursework may include the below points,
Focus is on teaching advanced technical skills for students with prior experience using Datahub,
Having sync components (Labs) with complex requirements,
Project-based workflows, and
Most importantly, have tech-savvy instructors willing to try new workflows.
Features used by this group of instructors may include the below points in addition to features covered as part of the foundational and intermediate use cases,
Real-Time Collaboration (RTC) using which multiple users can work on shared notebooks,
Linux-based desktop environments available on the browser,
Support for additional software such as PostgreSQL.
Faculty Spotlight:
Instructor: Fernando Perez
Course Name: Stat 159/259 (Reproducible and Collaborative Statistical Data Science)
How did you leverage Berkeley Datahub as part of your course?
My vision is that Datahub should be a perfect environment for full-time work, in-depth for all students enrolled in my class. We plan to teach students how to replicate the experience locally on their devices as well. That's the path I'm pursuing in Stat 159 and will explore this possibility in-depth. Some of the activities students will perform using the Stat 159 hub (Datahub tailored to innovative pedagogy needs of Stat 159 course) are listed below,
Use base environment with JupyterLab, pinned at 3.2.8
Use Git/GitHub extensively for most workflows. All assignments are currently done via Github Classroom.
Collaborate as teams through Github repositories, with issues, PRs, etc.
Use the JupyterLab RTC features for teamwork, combined with SyncThing for file sharing.
Run notebooks, much like Data 8/100 for data analysis and programming tasks.
Write pure python code in scripts.
Possibly compile some Cython/C code at the command line.
Develop documentation and build it with Jupyter Book/sphinx.
Manipulate documents with LaTeX and Pandoc.
Develop small python packages, perhaps even up to posting on PyPI (not sure yet).
Create and share new conda environments for projects.
Work with hdf5 and/or NetCDF data files.
Build code in public repositories with binder support.
[If time permits] do some simple distributed computing tasks with Dask/Xarray.
So far I'm teaching students how to work fully in JupyterLab + terminal, but I am starting to seriously consider the benefit of adding the virtual desktop package to our image to let them access some GUI applications occasionally. I'm starting to lean heavily towards yes, as I see more and more the value of them finding that the hub is a "suitable home for everything", and sometimes you do need GUI apps for certain tasks on the hub itself (QGIS is a good example of that need, but there are many more).