Notes from PyData London 2018

PyData London 2018

t-shirt sponsors

Around 700 attendees across three days. Talks covering new techniques and applications of machine learning (ML), natural language processing (NLP), image classification, more traditional methods, graph theory and architecture of data science projects. This was my second PyData conference having previously attended the 2016 event.

If you look at one thing from the talk, pick one of these talks (details below):

  • 'Winning with Simple, even Linear, Models' - for ways to get a lot out of linear models
  • 'Planes, Trains, and Skateboard Shoes - Bayesian methods in engineering and product design' - just a great talk and intro to Gaussian processes.
  • 'Building out data science at QBE' - to see how the competition are doing it.
  • 'Python Doesn’t Have to Be Slow: Speeding Up a Large-Scale Optimization Algorithm' - how to get the most out of Python for computationally heavy processes.

From the talks and my chats with other attendees my takeaways are:

  • Beyond particular techniques there is still a compelling argument for actuarial teams to adopt many data science practices (use of code, how to measuring model accuracy, version control, notebooks).
  • Data science is still a young profession with attendees average age around 30 and many people coming from PhDs in the sciences. There is an energy and enthusiasm among the attendees that is quite infectious.
  • Wide range of industries and academia (tech, fashion, taxi apps, supermarkets, insurance (only QBE), banking, telecoms, Google) represented - everyone is investing in this.
  • Pace of innovation and development of open source data science is accelerating with new features added to existing tools and new tools being released all the time.
  • As a result, fragmented ecosystem tools is a challenge as there are so many options and pros and cons of each option are not clear. Still the pros of open source out-way the cons in my opinion.
  • Most people work within mixed teams of developers, IT, data science folk and end customers (e.g. marketers, traders, etc) rather than just with other data scientists which is very different to most actuarial teams in the insurance industry.
  • Most people use Slack in their work teams for messaging (replacing email and instant messaging), macs and Linux for computing and Git for version control. Most people code and present their results through Jupyter notebooks. They are comfortable with cloud tools such as Amazon AWS and Google Cloud and avoid proprietary software as much as possible.
  • Code is king - no-one is talking about GUIs for data science (although being PyData it is a biased sample). They may use code to build interfaces for exploring data or communicating with their clients.

All talks should end up here.

Friday

Computer Vision: An (Un?)Expected Journey, with Keras and Tensorflow

Speaker: Rodolfo Bonnin (Machinalis)

GitHub repository

Can also be run on Google CoLab

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

Keras seems to be a ML library with a straightforward syntax that incorporates many standard modelling techniques into the default fitting routines (e.g. validation).

Most of the neural net stuff was over my head but this talk introduced many concepts which will make it easier to get started if I start looking at image classification in more detail.

Use of Google CoLab was interesting.

Apache Airflow in the Cloud: Programmatically orchestrating workloads with Python

Speakers: Satyasheel, Kaxil Naik (Airflow contributor) both at Data Reply

GitHub repo

Failed to run on my machine due to a combination of Windows and Python 2 issues (Airflow was not working on Python 3 at the time, bug fixed by Sunday apparently). Airflow looks like the go to tool for managing tasks in complex calculations.

Looks like the goto tool for running workflows - potentially useful for ML and capital modelling.

Network Science, Game of Thrones and US Airports

Speaker: Mridul Seth

GitHub repo

Subject was very interesting and the examples helped with understanding the concepts.

Hypothesis Testing with SciPy

Speaker: Hillary Green-Lerman, head of data science at Codecademy.com

Codecademy Course

A bit basic for actuaries I suspect.

Winning with Simple, even Linear, Models

Speaker: Vincent Warmerdam, koaning.io

http://koaning.io/theme/notebooks/bayes.pdf

Talk differed from the title. Covered a lot of topics but key takeaway was that linear models and feature engineering still have a lot of value compared to more complex ML techniques.

Saturday

Keynote: Democratising data journalism: building a collaborative and investigative network across the UK

Speaker: Charles Boutaud

Data, Science: Investigating 1 Million Galaxies with Humans and TensorFlow

projects/penguintom79/penguin-watch

An astrophysics post grad on labelling galaxy images. He is using a website called zooniverse to get volunteers to crowd source labelling of galaxies.

Python Doesn’t Have to Be Slow: Speeding Up a Large-Scale Optimization Algorithm

Speaker: Dat Nguyen (Zopa)

Slides

  • Speeding up their NP hard problem using a combination of Numpy over Pandas, joblib and Numba.
  • Recommends profiling before optimisation and using ipython profiling.
  • Preferred Numba over cython or C++ to keep code Pythonic.
  • NamedTuples
  • Joblib for parrallisation
  • Numba
  • Processes 1M assets in 20s

Creating correct and capable classifiers

Speaker: Ian Ozsvald modelinsight.io, founding member of PyData London

Slides

Working with QBE currently.

Data_science_delivered GitHub repo

  • pandas_profiling - useful for high level view of data frame
  • Use of dummy classifier for a baseline model
  • Do repeatedkfold on dummy classifier to understand variability in response
  • randomforest to provide new baseline
  • YellowBrick highly recommended for visualising ML results
  • improve confusion matrix with probabilities in cells
  • ELI5 'explain it like I'm 5' library for sklearn models.
  • Permeation importance method for finding variable importance across model types
  • classification report on splits of data by key factors
  • tSNE using fit error to spot poor performing regions
  • Pandas colouring in Jupiter notebook?

Building out data science at QBE

Speakers: Natalie, Liam P. Kirwin, both QBE

  • 8 data scientists, 4 software engineers, 20 people total
  • Report to head of business intelligence
  • One of 7 strategic themes of the company
  • Team started with fraud
  • Viewed QBE actuarial team as 'old school, better safe than sorry, do everything in Excel'
  • Fraud check - checking GP register for GP's details
  • A hard fight to get software they needed but now have good relationship with IT
  • Noticeably less detailed talk than others - possibly due to not wanting to give away IP (but would equally apply to other presenters)
  • Lime

Stationary data? Forget about it! Bayesian forgetting and Random Effects for forecasting TV ratings

Speaker: Ruadhán Stokes, Channel 4

Bayesian TV forecasting model. Could recursive least squares be applied to claims inflation? https://github.com/rstokes92/pydata

Making the Big Data ecosystem work together with Python: Apache Arrow, Spark, Beam, and Dask

Speaker: Holden Karau, Google

Sunday

Learning programming and science with Scientific Python

Speaker: Emmanuelle Gouillart ( director of joint CNRS / Saint-Gobain)

JupyterHub from the Ground Up with Kubernetes

Speaker: Camila Montonen, @spimescape, race-conditions.net

Slides

JupyterHub is a way for creating a consistent porgramming environment with computing resources on remote servers. Useful for companies, teaching or some.web services (quantopian)

Databases for Data-Science

Speaker: Alex Hendorf, Konigsweg

Slides

Do tyres dream of electric clouds?

Speaker: T. Alisi, Pirelli

Slides

  • Data-Science team of around 20
  • Used https://www.dominodatalab.com as a data analysis platform.
  • See slide 18 for mix of teams and tools.

Using Survival Analysis to understand customer retention

Speaker: Lorna Brightmore, tails.com

Use of survival models to predict retention of customers to a dog food subscription service. Python packages lifelines, sklearn-survival, or in R, KMSurv, OISurv

Planes, Trains, and Skateboard Shoes - Bayesian methods in engineering and product design

Speaker: Jim Parr, Maclaren Applied Technologies

GitHub rep

Used notebook slides which looked really cool. Use of Gaussian processes to optimise design of products. One of the best talks I saw.

Talks I missed

Test-Driven Data Analysis

https://github.com/tdda/pydatalondon2018ad

http://www.tdda.info/?from=@

Missed this one but looks interesting

Deep Probabilistic Methods with PyTorch

https://github.com/chrisorm/pydata-2018

Visualising NLP pipelines with Pynorama

Didn't see talk but new tool looks really useful. GitHub Repo.

Optimisation

https://github.com/gcampanella/pydata-london-2018

Evaluating fairness in machine learning with PyMC3

Couple of the papers mentioned in the fairness talk: - Counterfactual Fairness https://arxiv.org/abs/1703.06856 - Fairness in Criminal Justice Risk Assessments: The State of the Art https://arxiv.org/abs/1703.09207

For the Fairness talk, Google has published an interactive visualisation for understanding various fairness policies: https://research.google.com/bigpicture/attacking-discrimination-in-ml/

Based on paper ‘Equality of Opportunity in Supervised Learning’: https://arxiv.org/abs/1610.02413

RNN sequence labeling for document parsing in Tensorflow

Speaker: Carsten van Weelden, textKernel, Nederland

Using recurring neural network (RNN) to parse over CVs to create structured data. Used TensorBoard to visualise the NN.

Searching for Shady Patterns: Shining a light on UK corporate ownership

https://github.com/Cadarn/pyDataLDN2018_network_corporations

CatBoost

New ML technique that is yielding strong results.

Touchdown Localisation with aircraft flight data

More About Generators

Benefits of linear models

Making computation easier with cool Numpy tricks

https://github.com/kirit93/NumpyTutorial

Missed this one but heard lots of good things so need to watch the video.

Understanding and diagnosing your machine-learning models

Speaker: Gael Varoquaux, one of the founding contributors to sci-kit learnt

Missed this one but attendees said the talk was very good.

http://gael-varoquaux.info/interpreting_ml_tuto/

https://github.com/GaelVaroquaux/interpreting_ml_tuto

Demystifying pandas internals