Featured Jobs

Latest From the Blog

Das Erlernen neuer Programmiersprachen ist eine Investition ins Humankapital. Die Ermittlung des Return on Investment kann daher sehr aussagekräftig sein. Die Anforderungen für jede Branche und jeden spezifischen Job sind sehr spezifisch – eine verallgemeinerbare Antwort auf diese Frage zu finden, ist deshalb schwierig. Ein Ansatz könnte aber darin bestehen, die erforderlichen Softwarekenntnisse bei Stellenausschreibungen zu analysieren, die die aktuelle Nachfrage widerspiegeln und dadurch einen allgemeinen Return on Investment anzeigen können. Wir haben alle deutschen Stellenangebote mit einem Datenbezug auf Indeed heruntergeladen, um eine grobe Vorstellung der Beliebtheit jeder Software auf dem deutschen Arbeitsmarkt zu erhalten. 2807 datenwissenschaftliche Stellenausschreibungen in Deutschland Wir haben im Juni 2017 alle Stellenangebote in Deutschland auf datenwissenschaftliche Schlüsselwörter (Data Scientist, Data Analyst, Big Data, Machine Learning) abgesucht. Dies deckt zwar nur die aktuellen Stellenausschreibungen ab, da sie jedoch in der Regel mehrere Wochen online geschaltet werden, können wir davon ausgehen, dass wir so ein relativ genaues Bild der aktuelle Marktnachfrage erhalten. Die Suche umfasst 2807 Stellenausschreibungen, aber nur 70% der Ausschreibungen geben eine Software oder Programmiersprache an. Warum enthalten so viele Postings keine spezielle Anforderungen? Häufig stellen Arbeitgeber nur allgemeine Anforderungen (z.B. Kenntnisse im Bereich Machine Learning oder Data Analytics) und einige der Stellenausschreibungen enthalten datenwissenschaftliche Suchbegriffe, richten sich aber nicht in erster Linie an Datenwissenschaftler..   SQL und Python sind am beliebtesten Wir haben die 2807 Stellenausschreibungen nach den 25 beliebtesten Data-Science-Softwares durchsucht. 1971 Stellenausschreibungen erwähnen mindestens eine der Softwares, viele von ihnen mehrere. Die folgende Abbildung zeigt die Anzahl der Stellenangebote, die eine bestimmte Software erwähnen. Mit rund 1000 Stellenangeboten ist SQL die beliebteste Software, gefolgt von Python mit rund 900 Nennungen und Java mit 670. Ähnliche Entwicklung in den USA Ist die obige Verteilung charakteristisch für Deutschland oder spiegelt sie weltweite Trends wider? Robert Muenchen hat die gleiche Analyse für den US-Markt durchgeführt. Die Plätze eins bis drei sind identisch: SQL (18.000 Jobs), Python (13.000 Jobs) und Java (13.000 Jobs) dominieren den Markt. Einige Unterschiede bestehen weiter unten: So ist z.B. SAP auf dem deutschen Markt (6.) gefragter als auf dem US-Markt (12.). Insgesamt sind sich die beiden Grafiken jedoch sehr ähnlich, was bestätigt, dass die Softwaretrends global sind und die Anforderungen durch die technologischen Grenzen geprägt sind. Investitionen in Programmierkenntnisse Wenn Sie neu in der Datenwissenschaft sind oder darüber nachdenken, in diese Berufsrichtung zu gehen, gibt Ihnen diese Analyse eine gute Vorstellung davon, welche Programmierkenntnisse in naher Zukunft besonders wertvoll sein dürften. Die hohe Nachfrage nach SQL könnte ein Zeichen dafür sein, dass viele Unternehmen nicht nur Fähigkeiten in der Datenanalyse erwarten, sondern auch ein reibungsloses Zusammenspiel mit Datenbanken. Python scheint auf dem Vormarsch zu sein. Robert Muenchen zeigt, dass die Popularität von Python in den letzten drei Jahren stark zugenommen hat und das Wachstum den anderen großen Open-Source-Player R überholt. Insgesamt dürfte eine Kombination aus starken analytischen Fähigkeiten in Python und R mit soliden SQL-Kenntnissen eine gute Grundlage für eine Karriere im wachsenden Data-Science-Arbeitsmarkt sein.   Interessiert an der Analyse von Jobs auf Indeed? Sie können auf die Jobs über deren API zugreifen. Das jobbR-Paket auf R ist hilfreich; ähnliche Werkzeuge gibt es für Python .
Companies use machine learning to improve their business decisions. Algorithms select ads, predict consumers’ interest or optimize the use of storage. However, few stories of machine learning applications for public policy are out there, even though public employees often make comparable decisions. Similar to the business examples, decisions by public employees often try to optimize the use of limited resources. Algorithms may assist tax authorities in improving the allocation of available working hours, or help bankers make lending decisions. Similarly, algorithms can be employed to guide decisions taken by social workers or judges. // This blogpost lists three research papers that analyze and discuss the use of machine learning for very specific problems in public policy. While the potential seems huge, we do not want to neglect some of the many potential pitfalls for machine learning in public policy. Business applications often maximize profits. For policy decisions, however, the maximizable outcome may be harder to define or multidimensional. In many cases, not all relevant outcome dimensions are directly observable and measurable, which makes it more difficult to evaluate the impact of an algorithm. Tech companies would usually obtain training datasets through experimenting, while datasets for public policy often contain only one outcome for a specific group of people. If tax authorities never scrutinize restaurants, how can we form a predictive model for this industry? Predictions for public policy problems often face this so-called selected labels problem and it needs innovative approaches and the willingness to perform randomized experiments to get around it. This is just a brief list. Susan Athey’s paper provides more food for thought on the potential - and potential pitfalls - of using prediction in public policy.   Research on Machine Learning Applications in Public Policy Improving refugee integration through data-driven algorithmic assignment Developed democracies are settling an increased number of refugees, many of whom face challenges integrating into host societies. We developed a flexible data-driven algorithm that assigns refugees across resettlement locations to improve integration outcomes. The algorithm uses a combination of supervised machine learning and optimal matching to discover and leverage synergies between refugee characteristics and resettlement sites. The algorithm was tested on historical registry data from two countries with different assignment regimes and refugee populations, the United States and Switzerland. Our approach led to gains of roughly 40 to 70%, on average, in refugees’ employment outcomes relative to current assignment practices. This approach can provide governments with a practical and cost-efficient policy tool that can be immediately implemented within existing institutional structures. Bansak, K., Ferwerda, J., Hainmueller, J., Dillon, A., Hangartner, D., Lawrence, D., & Weinstein, J.; Science, 2018 Switzerland is currently implementing an algorithm based allocation of refugees. We are excited to see first results!   Human Decisions and Machine Predictions Can machine learning improve human decision making? Bail decisions provide a good test case. Millions of times each year, judges make jail-or-release decisions that hinge on a prediction of what a defendant would do if released. The concreteness of the prediction task combined with the volume of data available makes this a promising machine-learning application. Yet comparing the algorithm to judges proves complicated. First, the available data are generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the variable the algorithm predicts; for instance, judges may care specifically about violent crimes or about racial inequities. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: one policy simulation shows crime reductions up to 24.7% with no change in jailing rates, or jailing rate reductions up to 41.9% with no increase in crime rates. Moreover, all categories of crime, including violent crimes, show reductions; these gains can be achieved while simultaneously reducing racial disparities. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals. Jon Kleinberg  Himabindu Lakkaraju  Jure Leskovec Jens Ludwig  Sendhil Mullainathan; Quarterly Journal of Economics, 2018 // Using Text Analysis to Target Government Inspections: Evidence from Restaurant Hygiene Inspections and Online Reviews Restaurant hygiene inspections are often cited as a success story of public disclosure. Hygiene grades influence customer decisions and serve as an accountability system for restaurants. However, cities (which are responsible for inspections) have limited resources to dispatch inspectors, which in turn limits the number of inspections that can be performed. We argue that NLP can be used to improve the effectiveness of inspections by allowing cities to target restaurants that are most likely to have a hygiene violation. In this work, we report the first empirical study demonstrating the utility of review analysis for predicting restaurant inspection results. Kang, J. S., Kuznetsova, P., Choi, Y., Luca, M., 2013 , Technical Report Here is related paper on the same topic suggesting ways for governments on how to obtain the required expertise: Crowdsourcing City Government: Using Tournaments to Improve Inspection Accuracy Further readings: Two papers with an excellent overview on the topic Machine Learning: An Applied Econometric Approach Prediction Policy Problems The Economist on the same topic: Of prediction and policy, The Economist 2016  
Are you looking for real world data science problems to sharpen your skills? In this post, we introduce you to four platforms hosting data science competitions. Data science competitions can be a great way for gaining practical experience with real world data, and for boosting your motivation through the competitive environment they provide. Check them out, competitions are a lot of fun! Kaggle Kaggle is the best known platform for data science competitions. Data scientists and statisticians compete to create the best models for describing and predicting the data sets uploaded by companies or NGOs. From predicting house prices in the US to demographics of mobile phone users in China or the properties of soil in Africa, Kaggle offers many interesting challenges to solve real world problems. Check out their No Free Hunch Blog featuring the winners of each competition. The platform was recently acquired by Alphabet, Google’s parent company, and also offers a wide range of datasets to train your algorithms and other useful resources to improve your data science skill set.   // DrivenData Similar to other platforms, the dataset is available online and participants submit their best predictive models. The great thing about DrivenData competitions is that the competition question and datasets are related to the work of non-profits, which can be especially interesting to those who want to contribute to a good cause. Furthermore, the data problems are no less diverse and range from predicting dengue fever cases, to estimating the penguin population in the Antarctic and forecasting energy consumption levels.  For some challenges, the best model wins a prize, for others you get the glory and the knowledge that you applied your skillset to make the world a better place. DrivenData offers great opportunities to tackle real-world problems with real-world impact. Numerai Numerai is a data science competition platform focusing on finance applications. What makes their competitions particularly interesting is that the participants’ predictions are used in the underlying hedge fund. Data scientists entering Numerai’s tournaments currently receive an encrypted data set every week. The data set is an abstract representation of stock market information that preserves its structure without revealing details. The data scientists then create machine-learning algorithms to find patterns in the data, and they test their models by uploading their predictions to the website. Numerai, then creates a meta-model from all submissions to make its investments. The models get ranked, with the top 100 earning Numeraire coins, a cryptocurrency launched by Numerai. Numerai's mix of data science, cryptography, artificial intelligence, crowdsourcing and bitcoin has given the fledgling business an exciting flair.   Tianchi Tianchi is a data competition platform by Alibaba Cloud, the cloud computing arm of Alibaba Group, and has strong similarities with Kaggle. The platform focuses on Chinese data scientist, but most pages are also available in English. Tianchi boasts a community of over 150,000 data scientists, 3,000 institutes and business groups from over 80 countries. Besides the competitions, the platform also offers datasets and a notebook to run Python 3 scripts.       //
View all blog posts