Search Entrepreneurial Geekiness Artificial Intelligence, high-tech entrepreneurship, coffee, London Home About Me Building Python Data Science Products Coaching £5 App meetups
About
This is Ian Ozsvald's blog (@IanOzsvald), I'm an entrepreneurial geek, a Data Science/ML/NLP/AI consultant, author of O'Reilly's High Performance Python book, co-organiser of PyDataLondon, a Pythonista, co-founder of ShowMeDo and also a Londoner. Here's a little more about me.
ianozsvald No recent repo activity.
Blogroll A.I. Cookbook Blog Artificial Intelligence Cookbook Bubble Generation Daily Python Duncan Jauncey Emily Toop John Montgomery Matt Sarjent Mor Consulting Ltd. Open Rights Group (UK’s ‘EFF’) ShowMeDo ShowMeDo Blog Social Ties The Screencast Handbook
Tags Academic Stuff ArtificialIntelligence BNM BookAnExpert Books BuildBrighton Business Idea Data science Entrepreneur Films High Performance Python Book Life OnlinePrivacy ProCasts Programming projectbrightonblogs pydata Python Screencasting ShowMeDo SocialMediaBrandDisambiguator StartupChile StrongSteam sussexdigital SussexUniversity The Screencasting Handbook Travel Ubuntu Uncategorized £5 App Meet
Archive February 2018 December 2017 November 2017 July 2017 June 2017 January 2017 September 2016 August 2016 June 2016 May 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 October 2014 September 2014 August 2014 July 2014 June 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 October 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 September 2012 August 2012 July 2012 June 2012 May 2012 March 2012 February 2012 January 2012 November 2011 October 2011 August 2011 July 2011 June 2011 May 2011 March 2011 January 2011 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 March 2010 February 2010 January 2010 December 2009 November 2009 October 2009 September 2009 August 2009 July 2009 June 2009 May 2009 April 2009 March 2009 February 2009 January 2009 December 2008 November 2008 October 2008 September 2008 August 2008 July 2008 June 2008 May 2008 April 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 August 2006 July 2006 June 2006 April 2006 March 2006 February 2006 January 2006 December 2005 November 2005 October 2005 September 2005 August 2005 July 2005 June 2005 May 2005
Meta Log in [Feed] Latest News [Feed] Latest Comments WordPress
16 February 2018 - 21:49PyData Conference & AHL Hackathon Our 5th annual PyDataLondon conference will run this April 27-29th, this year we grow from 330 to 500 attendees. As before this remains a volunteer-run conference (with support from the lovely core NumFOCUS team), just as the monthly meetup is a volunteer run event. The Call for Proposals is open until the start of March (you have 2 weeks!) – first time speakers are keenly sought. Our mentorship programme is in full swing to help new speakers craft a good proposal, before it hits the (volunteer run) review committee. As usual we expect 2-3 submissions per speaking slot so the competition to speak at PyDataLondon will remain high. We also have a set of diversity grants to support those who might otherwise not attend the conference – don’t be afraid to apply to use a grant. Tickets are on sale already, this year’s programme will go live towards the end of March. If you’d like a taste of what goes on at a PyDataLondon conference see my write-up from 2017 and see the 2017 schedule. The week before the conference our generous meetup hosts AHL are holding a Python Data Science Hackathon. You should definitely apply if you’re anywhere near London (I have!). They have budget to fly in some core developers – if your project hasn’t yet applied and you’re interested in being involved with a large open-source science hackathon, please do visit their site and apply. Here you have a chance to make a strong contribution to the open source tools that we all use. Finally – if you’re interested in learning about the jobs that are going in the UK Python Data Science world, take a look at my data science jobs list. 7-10 jobs get emailed out every 2 weeks to over 900 people and people are successfully getting new jobs via this list. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, pydata, Python
31 December 2017 - 15:23Python Data Science jobs list into 2018 I’ve been building my data-science jobs list for a couple of years now. Almost 800 folk are on the list, they receive an email update once every two weeks containing around seven job ads. Many active members of PyDataLondon are on the list. The ads are mostly London-based, a few spread into Europe. In addition to the jobs I’ve added a “book of the month” and “video of the month” recommendation along with an open source project that is after contributions from the community. If a selection of jobs and educational recommendations every couple of weeks feel like a useful addition to your inbox – join the mailchimp list here. Your email is never revealed, you’re in control, you can unsubscribe at any time. “I’m very grateful for Ian’s job list as it enabled me to find a DS job in an interesting and meaningful domain, and furthermore connected me with likeminded folk. Strongly recommend.” – Frank Kelly, Senior Data Scientist @HAL24K Companies who have advertised include AHL (our host for PyDataLondon), BBC, Channel 4, QBE Insurance, Willis Towers Watson, UCL and Cambridge Universities, HAL24K, Just Eat, Oxbotica, SkyScanner and many more. Roles range from junior to head-of-dept for data science and data engineering, most are permanent roles, some are contract roles. “After placing a contract ad on this list I was contacted by a number of high quality and enthusiastic data scientists, who all proposed innovative and exciting solutions to my research problem, and were able to explain their proposals clearly to a non-specialist; the quality of responses was so high that I was presented with a real dilemma in choosing who to work with”. – Hazel Wilkinson, Cambridge University Anyone can post to the list, PyDataLondon members get to make a first post to the list gratis (I take the time cost as a part of my usual activity of community-building in London). All posts come via me to check that they’re suitable, they go out every two weeks for three iterations. Contact me directly (ian.ozsvald at modelinsight dot io) if you’re interested in making a post. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, pydata, Python
15 November 2017 - 16:49PyDataBudapest and “Machine Learning Libraries You’d Wish
You’d Known About” I’m back at BudapestBI and this year it has its first PyDataBudapest track. Budapest is fun! I’ve had a second iteration talking on a slightly updated “Machine Learning Libraries You’d Wish You’d Known About” (updated from PyDataCardiff two weeks back). When I was here to give an opening keynote talk two years back the conference was a bit smaller, it has grown by +100 folk since then. There’s also a stronger emphasis on open source R and Python tools. As before, the quality of the members here is high – the conversations are great! During my talk I used my Explaining Regression Predictions Notebook to cover: Dask to speed up Pandas TPOT to automate sklearn model building Yellowbrick for sklearn model visualisation ELI5 with Permutation Importance and model explanations LIME for model explanations
Nick’s photo of me on stage Some audience members asked about co-linearity detection and explanation. Whilst I don’t have a good answer for identifying these relationships, I’ve added a seaborn pairplot, a correlation plot and the Pandas Profiling tool to the Notebook which help to show these effects. Although it is complicated, I’m still pretty happy with this ELI5 plot that’s explaining feature contributions to a set of cheap-to-expensive houses from the Boston dataset:
Boston ELI5 I’m planning to do some training on these sort of topics next year, join my training list if that might be of use. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, pydata, Python
5 November 2017 - 22:47PyConUK 2017, PyDataCardiff and “Machine Learning
Libraries You’d Wish You’d Known About” A week back I had the pleasure to talk on machine learning at PyConUK 2017 in the inaugural PyDataCardiff track. Tim Vivian-Griffiths and colleagues did a wonderful job building our second PyData conference event in the UK. The PyConUK conference just keeps getting better – 700 folk, 5 tracks, a huge kids track and lots of sub-events. Pythontastic! Cat Lamin has a lovely write-up of the main conference. If you’re interested in PyDataCardiff then note that Tim has setup an announcements-list, join it to hear about meetup events around Cardiff and Bristol. I spoke on the Saturday on “Machine Learning Libraries You’d Wish You’d Known About” (slides here) – this is a precis of topics that I figured out this year: Using Pandas multi-core with Dask Automating your machine learning with TPOT on sklearn Visualising your machine learning with YellowBrick Explaining why you get certain machine learning answers with ELI5 and LIME See my “Explaining Regression” Notebook for lots of examples with YellowBrick, ELI5, LIME and more (I used this to build my talk)
Audience at PyConUK 2017 As with last year I was speaking in part to existing engineers who are ML-curious, to show ways of approaching machine learning diagnosis with an engineer’s-mindset. Last year I introduced Random Forests for engineers using a worked example. Below you’ll find for video for this year’s talk:
I’m planning to do more teaching on data science and Python in 2018 – if this might interest you, please join my training mailing list. Posts will go out rarely to announce new public and private training sessions that’ll run in the UK. At the end of my talk I made a request of the audience, I’m going to start doing this more frequently. My request was “please send me a physical postcard if I taught you something” – I’d love to build up some evidence on my wall that these talks are useful. I received my first postcard a few days back, I’m rather stoked. Thank you Pieter! If you want to send me a postcard, just send me an email. Do please remember to thank your speakers – it is a tiny gesture that really carries weight.
First thank-you postcard after my PyConUK talk Thanks to O’Reilly I also got to participate in another High Performance Python signing, this time with Steve Holden (Python in a Nutshell: A Desktop Quick Reference), Harry Percival (Test-Driven Development with Python 2e) and Nicholas Tollervy (Programming with MicroPython):
I want to say a huge thanks to everyone I met – I look forward to a bigger and better PyConUK and PyDataCardiff next year! If you like data science and you’re in the UK, please do check-out our PyDataLondon meetup. If you’re after a job, I have a data scientist’s jobs list. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, pydata, Python
1 July 2017 - 17:38Kaggle’s Mercedes-Benz Greener Manufacturing Kaggle are running a regression machine learning competition with Mercedes-Benz right now, it closes in a week and runs for about 6 weeks overall. I’ve managed to squeeze in 5 days to have a play (I managed about 10 days on the previous Quora competition). My goal this time was to focus on new tools that make it faster to get to ‘pretty good’ ML solutions. Specifically I wanted to play with: TPOT “auto scikit-learn” (but not the auto-sklearn package which is related) The YellowBrick sklearn visualiser Most of the 5 days were spent either learning the above tools or making some suggestions for YellowBrick, I didn’t get as far as creative feature engineering. Currently I’m in the top 50th percentile Now the competition has finished I’m at rank 1497 (top 37th percentile) on the leaderboard using raw features, some dimensionality reduction and various estimators, with 5 days of effort. TPOT is rather interesting – it uses a genetic algorithm approach to evolve the hyperparameters of one or more (Stacked) estimators. One interesting outcome is that TPOT was presenting good models that I’d never have used – e.g. an AdaBoostRegressor & LassoLars or GradientBoostingRegressor & ElasticNet. TPOT works with all sklearn-compatible classifiers including XGBoost (examples) but recently there’s been a bug with n_jobs and multiple processes. Due to this the current version had XGBoost disabled, it looks now like that bug has been fixed. As a result I didn’t get to use XGBoost inside TPOT, I did play with it separately but the stacked estimators from TPOT were superior. Getting up and running with TPOT took all of 30 minutes, after that I’d leave it to run overnight on my laptop. It definitely wants lots of CPU time. It is worth noting that autosklearn has a similar n_jobs bug and the issue is known in sklearn. It does occur to me that almost all of the models developed by TPOT are subsequently discarded (you can get a list of configurations and scores). There’s almost certainly value to be had in building averaged models of combinations of these, I didn’t get to experiment with this. Having developed several different stacks of estimators my final combination involved averaging these predictions with the trustable-model provided by another Kaggler. The mean of these three pushed me up to 0.55508. My only feature engineering involved various FeatureUnions with the FunctionTransformer based on dimensionality reduction. YellowBrick was presented at our PyDataLondon 2017 conference (write-up) this year by Rebecca (we also did a book signing). I was able to make some suggestions for improvements on the RegressionPlot and PredictionError along with sharing some notes on visualising tree-based feature importances (along with noting a demo bug in sklearn). Having more visualisation tools can only help, I hope to develop some intuition about model failures from these sorts of diagrams. Here’s a ResidualPlot with my added inset prediction errors distribution, I think that this should be useful when comparing plots between classifiers to see how they’re failing:
Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, pydata, Python
7 June 2017 - 18:10Kaggle’s Quora Question Pairs Competition Kaggle‘s Quora Question Pairs competition has just closed, I’m pleased to say that with 10 days effort I ranked in the top 39th percentile (rank 1346 of 3396 in the private leaderboard). Having just run and spoken at PyDataLondon 2017, taught ML in Romania and worked on several client projects I only freed up time right at the end of this competition. Despite joining at the end I had immense fun – this was my first ‘proper’ Kaggle competition. I figured a short retrospective here might be a useful reminder to myself in the future. Things that worked well: Use of github, Jupyter Notebooks, my research module template Python 3.6, scikit-learn, pandas RandomForests (some XGBoost but ultimately just RFs) Dask (great for using all cores when feature engineering with Pandas apply) Lots of text similarity measures, word2vec, some Part of Speech tagging Some light text clean-up (punctuation, whitespace, some mixed case normalisation) Spacy for PoS noun extraction, some NLTK Splitting feature generation and ML exploitation into different Notebooks Lots of visualisation of each distance measure by class (mainly matplotlib histograms on single features) Fully reproducible Notebooks with fixed seeds Debugging code to diagnose the most-wrong guesses from the model (pulling out features and the raw questions was often enough to get a feel for “what it missed” which lead to thoughts on new features that might help) Things that I didn’t get around to trying due to lack of time: PoS named entities in Spacy, my own entity recogniser GloVe, wordrank, fasttext Clustering around topics Text clean-up (synonyms, weights & measures normalisation) Use of external corpus (e.g. Stackoverflow) for TF-IDF counts Dask on EC2
Things that didn’t work so well: Fully reproducible Notebooks (great!) to generate features with no caching of no-need-to-rebuild-yet-again features, so I did a lot of recalculating features (which really hurt in the last 2 days) – possible solution below with named columns Notebooks are still a PITA for debugging, attaching a console with –existing works ok until things start to crash and then it gets sticky Running out of 32GB of RAM several times on my laptop and having a semi-broken system whilst trying to persist partial models to disk – I should have started with an AWS deployment earlier so I could easily turn on more cores+RAM as needed I barely checked the Kaggle forums (only reading the Notebooks concerning the negative resampling requirement) so I missed a whole pile of tricks shared by others, some I folded in on the last day but there’s a huge pile that I missed – I think I might have snuck into the top 20% of rankings if I’d have used this public information Calibrating RandomForests (I’m pretty convinced I did this correctly but it didn’t improve things, I’m not sure why) Dask definitely made parallelisation easier with only a few lines of overhead in a function beyond a normal call to apply. The caching, if using something like luigi, would add a lot of extra engineered overhead – not so useful in a rapidly iterating 10 day competition. I think next time I’ll try using version-named columns in my DataFrames. Rather than having e.g. “unigram_distance_raw_sentences” I might add “_v0”, if that calculation process is never updated then I can just use a pre-built version of the column. This is a poor-mans caching strategy. If any dependencies existed then I guess luigi/airflow would be the next step. For now at least I think a version number will solve my most immediate time-sink in recent days. I hope to enter another competition soon. I’m also hoping to attend the London Kaggle meetup at some point to learn from others. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. 1 Comment | Tags: ArtificialIntelligence, Python
1 June 2017 - 15:30PyDataLondon 2017 Conference write-up Several weeks back we ran our 4th PyDataLondon (2017) conference – it was another smashing success! This builds on our previous 3 years of effort (2016, 2015, 2014) building both the conference and our over-subscribed monthly meetup. We’re grateful to our host Bloomberg for providing the lovely staff, venue and catering. Really got inspired by @genekogan’s great talk on AI & the visual arts at @pydatalondon @annabellerol Each year we try some new ideas – this year we tried: 4 hour Hackathon on NLP to help our keynote speakers from FullFact to try new ideas “Making your first open source submission” session to help 30 people step closer to contributing to open source (thanks Katy and Nick for leading that session!) A scavenger hunt A crazy pub-quiz (James Powell is responsible for blowing quite a few minds) on Saturday evening A longer-running Slack channel New co-chair Ruby and review committee lead Linda More financial support for speakers and attendees to increase diversity Several unconference slots (filled by volunteers on the day) spoke on Bayesian statistics and other topics Our first Women @ PyData reception just before the conference started More book signing Fast video turnaround into YouTube A drone giveaway (by sponsor [and my business partner] Endava) pros: Great selection of talks for all levels and pub quiz cons: on a weekend, pub quiz (was hard). Overall would recommend 9/10 @harpal_sahota We’re very thankful to all our sponsors for their financial support and to all our speakers for donating their time to share their knowledge. Personally I say a big thank-you to Ruby (cochair) and Linda (review committee lead) – I resigned both of these roles this year after 3 years and I’m very happy to have been replaced so effectively (ahem – Linda – you really have shown how much better the review committee could be run!). Ruby joined Emlyn as co-chair for the conference, I took a back-seat on both roles and supported where I could. Our volunteer team great again – thanks Agata for pulling this together. I believe we had 20% female attendees – up from 15% or so last year. Here’s a write-up from Srjdan and another from FullFact (and one from Vincent as chair at PyDataAmsterdam earlier this year) – thanks! #PyDataLdn thank you for organising a great conference. My first one & hope to attend more. Will recommend it to my fellow humanists! @1208DL For this year I’ve been collaborating with two colleagues – Dr Gusztav Belteki and Giles Weaver – to automate the analysis of baby ventilator data with the NHS. I was very happy to have the 3 of us present to speak on our progress, we’ve been using RandomForests to segment time-series breath data to (mostly) correctly identify the start of baby breaths on 100Hz singlechannel air-flow data. This is the precursor step to starting our automated summarisation of a baby’s breathing quality. Slides here and video below:
This updates our talk at the January PyDataLondon meetup. This collaboration came about after I heard of Dr. Belteki’s talk at PyConUK last year, whilst I was there to introduce RandomForests to Python engineers. You’re most welcome to come and join our monthly meetup if you’d like. Many thanks to all of our sponsors again including Bloomberg for the excellent hosting and Continuum for backing the series from the start and NumFOCUS for bringing things together behind the scenes (and for supporting lots of open source projects – that’s where the money we raise goes to!). There are plenty of other PyData and related conferences and meetups listed on the PyData website – if you’re interested in data then you really should get along. If you don’t yet contribute back to open source (and really – you should!) then do consider getting involved as a local volunteer. These events only work because of the volunteered effort of the core organising committees and extra hands (especially new members to the community) are very welcome indeed. I’ll also note – if you’re in London or the south-east of the UK and you want to get a job in data science you should join my data scientist jobs email list, a set of companies who attended the conference have added their jobs for the next posting. Around 600 people are on this list and around 7 jobs are posted out every 2 weeks. Your email is always kept private. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, Life, pydata, Python
27 January 2017 - 13:06Introduction to Random Forests for Machine Learning at the
London Python Meetup Last night I had the pleasure of returning to London Python to introduce Random Forests (this builds on my PyConUK 2016 talk from September). My goal was to give a pragmatic introduction to solving a binary classification problem (Kaggle’s Titanic) using scikit-learn. The talk (slides here) covers: Organising your data with Pandas Exploratory Data Visualisation with Seaborn Creating a train/test set and using a Dummy Classifier Adding a Random Forest Moving towards Cross Validation for higher trust Ways to debug the model (from the point of view of a non-ML engineer) Deployment Code for the talk is a rendered Notebook on github I finished with a slide on Community (are you contributing? do you fulfill your part of the social contract to give back when you consume from the ecosystem?) and another pitching PyDataLondon 2017 (May 5-7th). My colleague Vincent is over from Amsterdam – he pitched PyDataAmsterdam (April 8-9th). The Call for Proposals is open for both, get your talk ideas in quickly please. I’m really happy to see the continued growth of the London Python meetup, this was one of the earliest meetups I ever spoke at. The organisers are looking for speakers – do get in touch with them via meetup to tell them what you’d like to talk on. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, Python
20 January 2017 - 18:54PyDataLondon 2017 Conference Call for Proposals Now Open This year we’ll hold our 4th PyDataLondon conference during May 5th-7th at Bloomberg (thanks Bloomberg!). Our Call for Proposals is open and will run during February (closing date to be confirmed so don’t just forget about it! – get on with making a draft submission soon). We want talks at all levels (first timers especially welcome) from beginner to advanced, we want both regular talks and tutorials. We’ll be experimenting with the overflow room just as we did last year (possibly including Office Hours and ‘how to contribute to open source’ workshops). Take a look at the 2016 Schedule to see the range of talks we had – engineering, machine learning, deep learning, visualisation, medical, finance, NLP, Big Data – all the usual suspects. We want all of these and more. Personally I’m especially interested in: talks that cover the communication of complex data (think – bad Daily Mail Brexit graphics and how we might help people communicate complex ideas more clearly) encouraging collaborations between sub-groups. building on last year’s medical track with more medical topics getting journalists involved and sharing their challenges and triumphs and I’d love to be surprised – if you think it’ll fit – put in a submission! The process of submitting is very easy: Go to the website and sign-up to make an account (you’ll need a new one even if you submitted last year) Post a first-draft title and abstract (just a one-liner will do if you’re pressed for time) Give it a day, log back in and iterate to expand on this If your submission is too short then the Review Committee will tell you that you don’t meet the minimum criteria, so you’ll get nagged – but only if you’ve made an attempt first! Iterate, integrating feedback from the Committee, to improve your proposal Keep your fingers crossed that you get selected We’re also accepting Sponsorship requests, take a look on the main site and get in contact. We’ve already closed some of the options so if you’d like the price list – get in contact via the website right away. I’d like to extend a Thank You to the new and larger Review Committee. I’ve handed over the reigns on this, many thanks to the new committee for their efforts. Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. No Comments | Tags: Data science, pydata, Python
23 September 2016 - 12:28Practical ML for Engineers talk at #pyconuk last weekend Last weekend I had the pleasure of introducing Machine Learning for Engineers (a practical walk-through, no maths) [YouTube video] at PyConUK 2016. Each year the conference grows and maintains a lovely vibe, this year it was up to 600 people! My talk covered a practical guide to a 2 class classification challenge (Kaggle’s Titanic) with scikit-learn, backed by a longer Jupyter Notebook (github) and further backed by Ezzeri’s 2 hour tutorial from PyConUK 2014.
Debugging slide from my talk (thanks Olivia) Topics covered include: Going from raw data to a DataFrame (notable tip – read Katharine’s book on Data Wrangling) Starting with a DummyClassifier to get a baseline result (everything you do from here should give a better classification score than this!) Switching to a RandomForestClassifier, adding Features Switching from a train/test set to a cross validation methodology Dealing with NaN values using a sentinel value (robust for RandomForests, doesn’t require scaling, doesn’t require you to impute your own creative values) Diagnosing quality and mistakes using a Confusion Matrix and looking at very-wrong classifications to give you insight back to the raw feature data Notes on deployment I had to cover the above in 20 minutes, obviously that was a bit of a push! I plan to cover this talk again at regional meetups, probably with 30-40 minutes. As it stands the talk (github) should lead you into the Notebook and that’ll lead you to Ezzeri’s 2 hour tutorial. This should be enough to help you start on your own 2 class classification challenge, if your data looks ‘somewhat like’ the Titanic data. I’m generally interested in the idea of helping more engineers get into data science and machine learning. If you’re curious – I have a longer set of notes called Data Science Delivered and some vague plans to maybe write a book (maybe) – for the book join the mailing list here if you’d like to hear more (no hard sell, almost no emails at the moment, I’m still figuring out if I should do this). You might also want to follow-up on Katharine Jarmul’s data wrangling talk and tutorial, Nick Radcliffe’s Test Driven Data Analysis (with new automated TDD-for-data tool to come in a few months), Tim Vivian-Griffiths’ SVM Diagnostics, Dr. Gusztav Belteki’s Ventilator medical talk, Geoff French’s Deep Learning tutorial and Marco Bonzanini and Miguel ‘s Intro to ML tutorial. The videos are probably in this list. If you like the above then do think on coming to our monthly PyDataLondon data science meetups near London Bridge. PyConUK itself has grown amazingly – the core team put in a huge amount of effort. It was very cool to see the growth of the kids sessions, the trans track, all the tutorials and the general growth in the diversity of our community’s membership. I was quite sad to leave at lunch on the Sunday – next year I plan to stay longer, this community deserves more investment. If you’ve yet to attend a PyConUK then I strongly urge you to think on submitting a talk for next year and definitely suggest that you attend. The organisers were kind enough to let Kat and myself do a book signing, I suggest other authors think on joining us next year. Attendees love meeting authors and it is yet another activity that helps bind the community together.
Book signing at PyConUK Ian applies Data Science as an AI/Data Scientist for companies in ModelInsight, sign-up for Data Science tutorials in London. Historically Ian ran Mor Consulting. He also founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees. 9 Comments | Tags: Data science, pydata, Python Next Page » Top ^ | WP | Feed | CC License | Blue-Pix Theme Page Load Time: 0.850 s. | SQL Queries: 67