Your Job Board for Big Data

Be part of the digital revolution in Germany


DataCareer Blog

Big Data, AI and Machine Learning are the buzzwords of these days. Data nerds but also business executives and politicians are talking about related opportunities and potential risks. But since when is this the case and how have interests developed over time? We address this question using Google Trend data.  Google searches reveal peoples interests Google search queries have become a powerful tool to capture the moving interests in our society. Nowadays, the verb to google is a commonly used term in many languages for finding information on a question or a topic online. Google Trends offers insights into the 3.5 billion search queries the search engine is processing every day. In order to analyze the development of interest in data science, we look at the interest in 5 data related keywords (Artificial Intelligence, Big Data, Business Intelligence, Data Science, Machine Learning) and their trends since 2009. The first figure shows quarterly data of the development in Germany over the last 8 years. Using quarters instead of months (or even weeks) reduces the noise and trends are easier to spot. Importantly, Google Trends reports all numbers relative to the search volume of the most popular search term where 100 reflects the highest search volume in the observed time period. The data therefore allow to analyze trends and compare volumes for different queries, but they do not offer any insights on the absolute search traffic.  Machine Learning on the rise The graph shows a large increase in search queries for the term “Big Data”. It took much longer for other common terms in the field of data science to take off. Interestingly, Machine Learning became the top search query overtaking even Big Data in 2017 after a rapid growth over the last two years. Data Science and Artificial Intelligence seem to be on the rise but remain on a relatively low level.   How was the development in Germany compared to worldwide search queries? The figure below shows results for the same keywords based on global search queries. Overall, the trends are very similar. Big Data took off in 2012 and Machine Learning saw a rapid growth in the past years. An interesting difference is observable for Artificial Intelligence. While the trend in the last two years is comparable to the German case, the global level of AI search queries was high already in 2009. Overall, the graphs confirm the general impression that data science not only was but still is on the rise. These days, especially Machine Learning sees a strong growth in interest and it is even outpacing the more generic search term Big Data. Google Trends can be easily accessed through the browser . If you want to import the data directly to R, the gtrendsR package is very helpful. For Python, PyTrends allows for automated downloading.  
Image recognition has been a major challenge in machine learning, and working with large labelled datasets to train your algorithms can be time-consuming. One efficient approach for getting such data is to outsource the work to a large crowd of users. Google uses this approach with the game “Quick, Draw!” to create the world’s largest doodling dataset, which has recently been made publicly available . In this game, you are told what to draw in less than 20 seconds while a neural network is predicting in real time what the drawing represents. Once the prediction is correct, you can't finish your drawing, which results in some drawings looking rather weird (e.g., animals missing limbs). The dataset consists of 50 million drawings spanning 345 categories, including animals, vehicles, instruments and other items. By contrast, the MNIST dataset – also known as the “Hello World” of machine learning – includes no more than 70,000 handwritten digits. Compared with digits, the variability within each category of the “Quick, Draw!” data is much bigger, as there are many more ways to draw a cat than to write the number 8, say. A Convolutional Neural Network in Keras Performs Best With the project summarized below, we aimed to compare different machine learning algorithms from scikit-learn and Keras tasked with classifying drawings made in “Quick, Draw!”. Here we will present the results without providing any code, but you can find our Python code on Github . We used a dataset that had already been preprocessed to a uniform 28x28 pixel image size. Other datasets are available too, including for the original resolution (which varies depending on the device used), timestamps collected during the drawing process, and the country where the drawing was made. We first tested different binary classification algorithms to distinguish cats from sheep, a task that shouldn’t be too difficult. Figure 1 gives you an idea of what these drawings look like. While there are over 120,000 images per category available, we only used up to 7,500 per category, as training gets very time-consuming with an increasing number of samples. Figure 1 : Samples of the drawings made by users of the game “Quick, Draw!” We tested the Random Forest, K-Nearest Neighbors (KNN) and Multi-Layer Perceptron (MLP) classifiers in scikit-learn as well as a Convolutional Neural Network (CNN) in Keras. The performance of these algorithms on a test set is shown in Figure 2. As one might expect, the CNN had the highest accuracy by far (up to 96%), but it also required the longest computing time. The KNN classifier came in second, followed by the MLP and Random Forest. Figure 2 : Accuracy scores for RF, KNN, MLP and CNN classifiers Optimizing the Parameters The most important parameter for the Random Forest classifier is the number of trees, which is set to 10 by default in scikit-learn. Increasing this number usually achieves a higher accuracy, but computing time also increases significantly. For this task, we found 100 trees to be a sufficient number, as there is hardly any increase in accuracy beyond that. In the K-Nearest Neighbor classifier, the central parameter is K, the number of neighbors. The default option is 5, which we found to be optimal in this case. For the Multi-Layer Perceptron (MLP), the structure of the hidden layer(s) is a major point to consider. The default option is one layer of 100 nodes. We tried one or two layers of 784 nodes, which slightly increased accuracy at the cost of a much higher computing time. We chose two layers of 100 nodes as a compromise between accuracy and time use (adding a second layer did not change the accuracy, but it reduced fitting time by half). The difference between learning rates was very small, with the best results using 0.001. For the Convolutional Neural Network in Keras (using TensorFlow backend), we adapted the architecture from a tutorial by Jason Brownlee. In short, this CNN is composed of the following 9 layers: 1) Convolutional layer with 30 feature maps of size 5×5, 2) Pooling layer taking the max over 2*2 patches, 3) Convolutional layer with 15 feature maps of size 3×3, 4) Pooling layer taking the max over 2*2 patches, 5) Dropout layer with a probability of 20%, 6) Flatten layer, 7) Fully connected layer with 128 neurons and rectifier activation, 8) Fully connected layer with 50 neurons and rectifier activation, 9) Output layer. Evaluation of the CNN Classifier Beside the accuracy score, there are other useful ways to evaluate a classifier. We can get valuable insights by looking at the images that were classified incorrectly. Figure 3 shows some examples for the top performing CNN. Many of these are hard or impossible even for competent humans to classify, but they also include some that seem doable. Figure 3: Examples of images misclassified by the CNN (trained on 15,000 samples) To obtain the actual number of images classified incorrectly, we can calculate the confusion matrix (Figure 4). In this example, the decision threshold was set to 0.5, implying that the label with the higher probability was predicted. For other applications, e.g. testing for a disease, a different threshold may be better suited. One metric that considers all possible thresholds is the Area Under the Curve (AUC) score that is calculated from the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings (but it doesn’t show which threshold was used for a specific point of the curve). As the name says, the AUC is simply the area under the ROC curve, which would be 1 for a perfect classifier and 0.5 for random guessing. The ROC curve for the CNN is shown in Figure 4, and the associated AUC score is at a very respectable 0.994. Figure 4: Confusion matrix and ROC curve of the CNN Multi-Class Classification To make things more challenging, we also tested the algorithms on five different classes (dog, octopus, bee, hedgehog, giraffe), using 2,500 images of each class for training. As expected, we got a similar ranking as before, but the accuracies were lower: 79% for Random Forest, 81% MLP, 82% KNN, and 90% CNN. Looking at the actual probabilities placed on each class allows a better idea of what the classifier is predicting. Figure 5 shows some examples for the CNN. Most predictions come with a high degree of certainty, which is not surprising given an accuracy of 90%. If the predictions are well calibrated – which they are in this case –, the average certainty of the predicted class must be 90% as well. Figure 5: Some CNN prediction probabilities. Figure 6 shows a few predictions that were incorrect. While these are rather unconventional drawings, competent humans might well recognize them all, or give more accurate probability distributions at least. There certainly is room for improvement, the easiest way being to simply increase the number of training examples (we only used about 2% of the images). Figure 6: Some drawings that were misclassified by the CNN. The true label is shown below the image.   In conclusion, what have we learned from the above analysis? If you take a look at the code, you will see that implementing a CNN in Python takes more effort than the regular scikit-learn classifiers do, which comprise just a few lines. The superior accuracy of the CNN makes this investment worthwhile, though. Due to its huge size, the “Quick, Draw!” dataset is very valuable if you’re interested in image recognition and deep learning. We have barely scratched the surface of the dataset here and there remains huge potential for further analysis. To check out the code and some additional analyses, visit our Github page . We’re very happy to receive any feedback and suggestions for improvement by email .   A blog post by David Kradolfer     Back
Much has been written on the most popular software and programming languages for Data Science (recall, for instance, the infamous “Python vs R battle”). We approached this question by scraping job ads from Indeed and counting the frequency at which each software is mentioned as a measure of current employer demand. In a recent blog post , we analyzed the Data Science software German employers want job applicants to know (showing that Python has a slight edge in popularity over R). In this post, we look at worldwide job ads and analyze them separately for different job titles. We included 6,000 positions for Data Scientists, 7,500 for Data Analysts, 2,200 for Data Engineers and 1,700 for Machine Learning specialists, as well as a population of tens of thousands of data professionals on LinkedIn. Our leading questions: Which programming languages are most popular in general, as database technologies, as machine learning libraries, and for business analysis purposes? How well does the employers’ demand for software skills match the supply, i.e. the skills data professionals currently possess? And are there software skills in high demand that are rare among data professionals? (Yes!) Check out the results below. A detailed description of our methods can be found in this post . The programming language most in demand: Python Being mentioned in over 60% of current job ads, Python is the top programming language for Data Scientists, Machine Learning specialists and Data Engineers. R is highly popular in Data Science (62%) job openings, but less so in the areas of Machine Learning and Data Engineering. Java and Scala are important in Data Engineering, where they rank second and third with 55% and 35% frequencies. In Machine Learning, Java and C/C++/C# are the second most popular languages, at about 40% each. Less programming skills are required for Data Analysts, with Python and R reaching 15%.   Figure 1: Most popular programming languages for data-related job descriptions The leading database technology: SQL SQL is the most in demand database skill for Data Scientists and Analysts, mentioned in 50% of current job ads. For Data Engineering, SQL, Hadoop and Apache Spark are about equally popular, with roughly 60% each. The same is true for Machine Learning, but at a much lower level of about 20%.   Figure 2: Most popular database technologies   Most popular machine learning library: Tensorflow For Machine Learning jobs, Tensorflow is leading the list with 20%, followed by Caffe, Scikit-learn and Theano at about 9%. In Data Science job ads, Scikit-learn is at 7% and slightly more frequent than Tensorflow.   Figure 3: Most popular Machine Learning libraries Business-related software in high demand Other frequent requirements are commercial software tools for statistics, business intelligence, visualization and enterprise resource planning (ERP). In this space, SAS and Tableau are the top mentions, followed by SPSS.   Figure 4:  Software for statistics, business intelligence and enterprise resource planning   Scala programmers are a rare breed among Data Scientists How well do these employer requirements match the available skills of Data Scientists? To obtain a rough estimate of the number of professionals knowledgeable in top-ranked programming languages, we searched LinkedIn for the different languages and filtered the results for people with the job title of Data Scientist. Figure 5 shows this Data Scientist population in relation to the number of Data Science job ads demanding the skill. By a large margin, Python and R turn out to be the most widespread programming skills in the Data Scientist population. Interestingly, MATLAB, C/C#/C++ and Scala are roughly equally frequent in job ads, but the number of professionals possessing these skills varies greatly. Scala is the rarest skill by far, while MATLAB expertise is fairly common, being five times as frequent in the Data Scientist population as Scala. Also, knowledge of the Apache Ecosystem (Hadoop, Spark and Hive) is still relatively rare among Data Scientists. Figure 5: Software skills – current supply (proxy: LinkedIn) vs. demand (proxy: Indeed)     In conclusion, which programming languages should you focus on? Python and R remain the top choices, and mastering at least one of them is a must. If you decide to learn or refine additional languages, pick Scala, which many Data Scientists expect to be the language of the future (Jean-François Puget analyzed the impressive rise of Scala in search trends on Indeed) A further language holding significant potential is Julia. Julia is mentioned in 2% of current Data Science job ads on Indeed, but its trendline is highly promising.   Performed a fascinating analysis you’d like to publish and share? Found a cool dataset that should be featured on our page? Contribute to our blog!  Post by David Kradolfer   Back
View all blog posts

Latest job opportunities

DXC Technology Böblingen, Germany
Full time
Category:  Services DXC Technology is the world’s leading independent, end-to-end IT services company. We guide clients on their digital transformation journeys, multiply their capabilities, and help them harness the power of innovation to thrive on change. CSC and HPE ES brought together innovation to our clients for more than 60 years. Together, we serve nearly 6,000 private and public sector enterprises across 70 countries. Our clients benefit from our technology independence, global talent, expertise and extensive partner network. We are uniquely positioned to lead digital transformations, creating greater value for our people, clients and partners. DXC leads clients through accelerating change, helping them harness the power of technology to deliver new outcomes for their business.Learning does not only happen through training. Relationships are among the most powerful ways for people to learn and grow, and this is part of our DXC culture. In addition to working alongside talented colleagues, you will have many opportunities to learn through coaching and stretch assignment opportunities. You’ll be guided by feedback and support to accelerate your learning and maximize your knowledge. We also have a “reverse mentoring” program which allows us to share our knowledge and strengths across our multi-generation workforce. DXC Analytics help clients gain rapid insights and accelerate their digital transformation journeys through our portfolio of analytics services and robust partner ecosystem. We help customers become an effective IT-enabled business by engaging the right people, processes and technologies and we provide strategy, transformation, implementation and management of custom, industry, packaged, and enterprise applications. Acting to Advise, Transform and Manage, we offer services embracing all of the leading Enterprise Applications solutions and assisting clients where they most need help. We offer services embracing IT and applications transformation, modernization, development, testing and management. The Data Scientist needs to have three primary skills; proficiency in data science, business consulting capability, and team leadership.   From a data science perspective the successful candidate will apply their knowledge of advanced analytics, data mining and statistical techniques to design and develop enterprise analytic solutions which are focused on solving specific challenges experienced by our customers.   From a business consulting perspective the successful candidate will possess the ability to discuss business challenges with our automotive industry customers and suggest potential solutions based on analytics. They will also be able to comprehend the impact of their solution on the business processes of the customer at a detailed level.From a leadership perspective, the successful candidate will have the ability to coordinate a team of more junior or best shore data scientists on a project to ensure that the most effective working structure is obtained.   Main responsibilities:    Has the ability to execute all of the steps of a data science project following a methodology such as CRISP-DM.   Translates data mining results into clear business focused deliverables for decision makers. Works with application developers to deploy data mining models into operational systems.  Integrates advanced analytics into end-to-end business intelligence solutions and operational business processes. Designs and/or contributes to the development of a solutions which can be repeatedly deployed to multiple customers. Contributes to bid and proposal documents based on analytics expertise. Provides consulting advice to customer senior leadership and contributes to setting the strategic direction for customers around analytics.  Proactively encourages re-use within practice or profession. Is recognized by peers inside and outside DXC as an expert in analytics.   5+ years of professional experience in the area of analytics or in related areas supporting professional development in analytics. 3 years of experience in analytic application/solution development to operational readiness and repeatable deployment.  3 years of experience in the financial services industry sector with strong business focus. Must excel at connecting business requirements to data mining objectives and to measurable business benefit. Must be able to develop analytical cases in his industry.  3 years use of R, Python, SAS Enterprise Miner or SPSS Clementine (or similar) for data mining and statistical analysis  Expert use of data mining methodologies such as classical regression, logistic regression, CHAID, CART, neural nets, association rules, sequence analysis, cluster analysis, and text mining  Some experience with Hadoop, Map/Reduce, Hive, Spark, Mahout or other parts of the broader Hadoop framework.  Ability to manage multiple projects efficiently and able to meet deadlines  Preferably presales experience and willingness to spend >25% on presales activities   Able to communicate with external senior management confidently and demonstrate professionalism at all times. Ability to adapt a consulting style appropriate to the situation and can identify up-sell opportunities Be able to demonstrate a broad understanding of market dynamics in the financial services industry Ability to present within own area of expertise as part of a customer sales presentation, putting forward domain specific information within the context of an DXC sales campaign Excellent communication and presentation skills; fluent in English and German What you can expect:   The team exists of ~15 Analytics solution principals from expert to master level across North and Central Europe with a great "can do" spirit and broad spectrum of experience as well as social competencies. We work for different clients in different industries and take accountability for what we do - each single day!  What do we offer? Extensive social benefits, flexible working hours, a competitive salary and shared values, make DXC Technology one of the world´s most attractive employers. At DXC our goal is to provide equal opportunities, work-life balance, and constantly evolving career opportunities. Part time work or job-sharing is also applicable to this position. EEO Tagline: DXC Technology is EEO F/M/Protected Veteran/ Individual with Disabilities Apply Now

Looking for data professionals?


Post a Job