Your Job Board for Big Data

Be part of the digital revolution in Germany

 

DataCareer Blog

Image recognition has been a major challenge in machine learning, and working with large labelled datasets to train your algorithms can be time-consuming. One efficient approach for getting such data is to outsource the work to a large crowd of users. Google uses this approach with the game “Quick, Draw!” to create the world’s largest doodling dataset, which has recently been made publicly available . In this game, you are told what to draw in less than 20 seconds while a neural network is predicting in real time what the drawing represents. Once the prediction is correct, you can't finish your drawing, which results in some drawings looking rather weird (e.g., animals missing limbs). The dataset consists of 50 million drawings spanning 345 categories, including animals, vehicles, instruments and other items. By contrast, the MNIST dataset – also known as the “Hello World” of machine learning – includes no more than 70,000 handwritten digits. Compared with digits, the variability within each category of the “Quick, Draw!” data is much bigger, as there are many more ways to draw a cat than to write the number 8, say. A Convolutional Neural Network in Keras Performs Best With the project summarized below, we aimed to compare different machine learning algorithms from scikit-learn and Keras tasked with classifying drawings made in “Quick, Draw!”. Here we will present the results without providing any code, but you can find our Python code on Github . We used a dataset that had already been preprocessed to a uniform 28x28 pixel image size. Other datasets are available too, including for the original resolution (which varies depending on the device used), timestamps collected during the drawing process, and the country where the drawing was made. We first tested different binary classification algorithms to distinguish cats from sheep, a task that shouldn’t be too difficult. Figure 1 gives you an idea of what these drawings look like. While there are over 120,000 images per category available, we only used up to 7,500 per category, as training gets very time-consuming with an increasing number of samples. Figure 1 : Samples of the drawings made by users of the game “Quick, Draw!” We tested the Random Forest, K-Nearest Neighbors (KNN) and Multi-Layer Perceptron (MLP) classifiers in scikit-learn as well as a Convolutional Neural Network (CNN) in Keras. The performance of these algorithms on a test set is shown in Figure 2. As one might expect, the CNN had the highest accuracy by far (up to 96%), but it also required the longest computing time. The KNN classifier came in second, followed by the MLP and Random Forest. Figure 2 : Accuracy scores for RF, KNN, MLP and CNN classifiers Optimizing the Parameters The most important parameter for the Random Forest classifier is the number of trees, which is set to 10 by default in scikit-learn. Increasing this number usually achieves a higher accuracy, but computing time also increases significantly. For this task, we found 100 trees to be a sufficient number, as there is hardly any increase in accuracy beyond that. In the K-Nearest Neighbor classifier, the central parameter is K, the number of neighbors. The default option is 5, which we found to be optimal in this case. For the Multi-Layer Perceptron (MLP), the structure of the hidden layer(s) is a major point to consider. The default option is one layer of 100 nodes. We tried one or two layers of 784 nodes, which slightly increased accuracy at the cost of a much higher computing time. We chose two layers of 100 nodes as a compromise between accuracy and time use (adding a second layer did not change the accuracy, but it reduced fitting time by half). The difference between learning rates was very small, with the best results using 0.001. For the Convolutional Neural Network in Keras (using TensorFlow backend), we adapted the architecture from a tutorial by Jason Brownlee. In short, this CNN is composed of the following 9 layers: 1) Convolutional layer with 30 feature maps of size 5×5, 2) Pooling layer taking the max over 2*2 patches, 3) Convolutional layer with 15 feature maps of size 3×3, 4) Pooling layer taking the max over 2*2 patches, 5) Dropout layer with a probability of 20%, 6) Flatten layer, 7) Fully connected layer with 128 neurons and rectifier activation, 8) Fully connected layer with 50 neurons and rectifier activation, 9) Output layer. Evaluation of the CNN Classifier Beside the accuracy score, there are other useful ways to evaluate a classifier. We can get valuable insights by looking at the images that were classified incorrectly. Figure 3 shows some examples for the top performing CNN. Many of these are hard or impossible even for competent humans to classify, but they also include some that seem doable. Figure 3: Examples of images misclassified by the CNN (trained on 15,000 samples) To obtain the actual number of images classified incorrectly, we can calculate the confusion matrix (Figure 4). In this example, the decision threshold was set to 0.5, implying that the label with the higher probability was predicted. For other applications, e.g. testing for a disease, a different threshold may be better suited. One metric that considers all possible thresholds is the Area Under the Curve (AUC) score that is calculated from the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings (but it doesn’t show which threshold was used for a specific point of the curve). As the name says, the AUC is simply the area under the ROC curve, which would be 1 for a perfect classifier and 0.5 for random guessing. The ROC curve for the CNN is shown in Figure 4, and the associated AUC score is at a very respectable 0.994. Figure 4: Confusion matrix and ROC curve of the CNN Multi-Class Classification To make things more challenging, we also tested the algorithms on five different classes (dog, octopus, bee, hedgehog, giraffe), using 2,500 images of each class for training. As expected, we got a similar ranking as before, but the accuracies were lower: 79% for Random Forest, 81% MLP, 82% KNN, and 90% CNN. Looking at the actual probabilities placed on each class allows a better idea of what the classifier is predicting. Figure 5 shows some examples for the CNN. Most predictions come with a high degree of certainty, which is not surprising given an accuracy of 90%. If the predictions are well calibrated – which they are in this case –, the average certainty of the predicted class must be 90% as well. Figure 5: Some CNN prediction probabilities. Figure 6 shows a few predictions that were incorrect. While these are rather unconventional drawings, competent humans might well recognize them all, or give more accurate probability distributions at least. There certainly is room for improvement, the easiest way being to simply increase the number of training examples (we only used about 2% of the images). Figure 6: Some drawings that were misclassified by the CNN. The true label is shown below the image.   In conclusion, what have we learned from the above analysis? If you take a look at the code, you will see that implementing a CNN in Python takes more effort than the regular scikit-learn classifiers do, which comprise just a few lines. The superior accuracy of the CNN makes this investment worthwhile, though. Due to its huge size, the “Quick, Draw!” dataset is very valuable if you’re interested in image recognition and deep learning. We have barely scratched the surface of the dataset here and there remains huge potential for further analysis. To check out the code and some additional analyses, visit our Github page . We’re very happy to receive any feedback and suggestions for improvement by email .   A blog post by David Kradolfer     Back
Much has been written on the most popular software and programming languages for Data Science (recall, for instance, the infamous “Python vs R battle”). We approached this question by scraping job ads from Indeed and counting the frequency at which each software is mentioned as a measure of current employer demand. In a recent blog post , we analyzed the Data Science software German employers want job applicants to know (showing that Python has a slight edge in popularity over R). In this post, we look at worldwide job ads and analyze them separately for different job titles. We included 6,000 positions for Data Scientists, 7,500 for Data Analysts, 2,200 for Data Engineers and 1,700 for Machine Learning specialists, as well as a population of tens of thousands of data professionals on LinkedIn. Our leading questions: Which programming languages are most popular in general, as database technologies, as machine learning libraries, and for business analysis purposes? How well does the employers’ demand for software skills match the supply, i.e. the skills data professionals currently possess? And are there software skills in high demand that are rare among data professionals? (Yes!) Check out the results below. A detailed description of our methods can be found in this post . The programming language most in demand: Python Being mentioned in over 60% of current job ads, Python is the top programming language for Data Scientists, Machine Learning specialists and Data Engineers. R is highly popular in Data Science (62%) job openings, but less so in the areas of Machine Learning and Data Engineering. Java and Scala are important in Data Engineering, where they rank second and third with 55% and 35% frequencies. In Machine Learning, Java and C/C++/C# are the second most popular languages, at about 40% each. Less programming skills are required for Data Analysts, with Python and R reaching 15%.   Figure 1: Most popular programming languages for data-related job descriptions The leading database technology: SQL SQL is the most in demand database skill for Data Scientists and Analysts, mentioned in 50% of current job ads. For Data Engineering, SQL, Hadoop and Apache Spark are about equally popular, with roughly 60% each. The same is true for Machine Learning, but at a much lower level of about 20%.   Figure 2: Most popular database technologies   Most popular machine learning library: Tensorflow For Machine Learning jobs, Tensorflow is leading the list with 20%, followed by Caffe, Scikit-learn and Theano at about 9%. In Data Science job ads, Scikit-learn is at 7% and slightly more frequent than Tensorflow.   Figure 3: Most popular Machine Learning libraries Business-related software in high demand Other frequent requirements are commercial software tools for statistics, business intelligence, visualization and enterprise resource planning (ERP). In this space, SAS and Tableau are the top mentions, followed by SPSS.   Figure 4:  Software for statistics, business intelligence and enterprise resource planning   Scala programmers are a rare breed among Data Scientists How well do these employer requirements match the available skills of Data Scientists? To obtain a rough estimate of the number of professionals knowledgeable in top-ranked programming languages, we searched LinkedIn for the different languages and filtered the results for people with the job title of Data Scientist. Figure 5 shows this Data Scientist population in relation to the number of Data Science job ads demanding the skill. By a large margin, Python and R turn out to be the most widespread programming skills in the Data Scientist population. Interestingly, MATLAB, C/C#/C++ and Scala are roughly equally frequent in job ads, but the number of professionals possessing these skills varies greatly. Scala is the rarest skill by far, while MATLAB expertise is fairly common, being five times as frequent in the Data Scientist population as Scala. Also, knowledge of the Apache Ecosystem (Hadoop, Spark and Hive) is still relatively rare among Data Scientists. Figure 5: Software skills – current supply (proxy: LinkedIn) vs. demand (proxy: Indeed)     In conclusion, which programming languages should you focus on? Python and R remain the top choices, and mastering at least one of them is a must. If you decide to learn or refine additional languages, pick Scala, which many Data Scientists expect to be the language of the future (Jean-François Puget analyzed the impressive rise of Scala in search trends on Indeed) A further language holding significant potential is Julia. Julia is mentioned in 2% of current Data Science job ads on Indeed, but its trendline is highly promising.   Performed a fascinating analysis you’d like to publish and share? Found a cool dataset that should be featured on our page? Contribute to our blog!  Post by David Kradolfer   Back
Social science researchers collect much of their data through online surveys. In many cases, they offer incentives to the participants. These incentives can take the form of lotteries for more valuable prices or individual gift card codes. We are doing the latter in our studies here at CEPA Labs at Stanford. Specifically, our survey participants receive a gift card code from Amazon.     However, sending these gift card codes to respondents is challenging. In Qualtrics, our online survey platform, we can embed a code for each potential respondent, and then trigger an email with the code attached after their survey is completed. While this is a very convenient feature, it has one substantial drawback. We need to purchase all codes up front, yet many participants may never answer. There is an option of waiting until the survey is closed and then purchasing the codes for all respondents at once. However, respondents tend to become impatient if they do not receive their code in a timely manner and start reaching out to you and possibly to the IRB office. This creates administrative work and might reduce response rates if potential respondents can talk to each other about their experience. Furthermore, we reach most of our participants with text messages and have found from experience that emails often go unnoticed or end up in the spam folder. Given these problems, we decided to send codes using python and the Qualtrics API. This way, we can send codes immediately and do not need to purchase all codes upfront. We used the Amazon Incentives API, which allows its users to request gift card codes on demand. Codes are generated on the fly, and the amount is charged to our account. An alternative would be Giftbit, which even allows its users to send gift links with different gift card providers as option. I believe this approach would be useful to many social science researchers. After I had developed the program, we immediately thought of a second project where we already had enough codes for the anticipated 70% response rate. We stored the codes in a .csv file. In this post, for simplicity, I will describe a python program that gets the codes from a .csv file and sends them out by email. The other (original) program fetched the codes from the Amazon Incentives API and sent them over the EZtexting.com API. Both versions can be found on GitHub . It is also possible to send text messages per email. The program checks Qualtrics continuously for new responses and then sends each new respondent a code. In a loop, it downloads (new) responses, writes their contact information to an SQL database, assigns a code, and adds it to the data base, and then sends the codes by email. The program is written in a way that guarantees that if the program gets interrupted, it can just be executed again without any problem. Before I go into detail, here is a quick disclaimer. I am not a professional python developer. If you see ways how to improve the program, please let me know. I am always happy to learn new things. I also will not take any responsibility for the proper functioning of the program. If you decide to use it, it is your responsibility to adapt it to your application and thoroughly test it. Furthermore, you can find more scripts to work with the Qualtrics API on GitHub. I used python 2.7.11 (64-bit). SQLite needs to be installed. Let’s get started with importing the required packages and setting up some parameters. The location for the survey download, the SQLite data base, the backups, and the file containing the codes need to be specified. All elements of the code to be changed are bold. import requests import zipfile import pandas as pd import os import sqlite3 import datetime import time import shutil import smtplib import sys from email.MIMEMultipart import MIMEMultipart from email.MIMEText import MIMEText from email.Utils import COMMASPACE, formatdate # Set path of the main folder path= 'D:/YOUR PROJECT FOLDER' os.chdir(path) # Setting path for data export and storage SurveyExportPath = path + '/ DownloadFolder' SQLDatabasePath = path + '/DataBase/' SQLBackupPath = path + '/DataBase/Archive/' SQLDatabaseName = 'ResponseTracker.sqlite' # Set path for files that holds gift card codes CODEPath = path + '/---Your file with codes---.csv' Next, I declare which columns of the Qualtrics data will be loaded into memory later.You can adjust the program, for instance, to allow for different gift card code amounts, send the codes to phone numbers, check if the respondent has answered sufficient questions, or include their name in the email. To do so, you will have to upload that information within the contact lists, embed it within the Qualtrics survey flow, and adjust the columns here to import the columns containing the respective information. # Columns to include when read in data from surveys pdcolumns = ['ResponseID', 'ExternalDataReference', 'RecipientEmail'] Here, you need to declare a dictionary using Qualtrics survey IDs and survey names for all surveys that you want to include. The survey IDs are the keys. You can find the IDs in the Qualtrics account settings. Note that it is important to add a ‘.csv’ to the name. This is necessary because the survey data is downloaded from Qualtrics as “SURVEYNAME.csv”. # List survey is Ids and file extensions in dictionary. Possible to include multiple surveys surveyIdsDic = {'---Your survey ID 1---': '---Your survey name 1---.csv', '---Your survey ID 2---': '---Your survey name 2---.csv', '---Your survey ID 3---': '---Your survey name 3---.csv'} surveyIds = surveyIdsDic.keys() The next variables hold the number of times this script checks for new answers and sends out the codes, the time it waits between each iteration in seconds, and the number of iterations it waits before creating a backup of the SQL database. With these settings, the program would run for at least 695 days (not counting the time the program takes for each iteration) and create a backup every five minutes. # Number of repetitions for the loop, time to wait between each iteration in seconds reps = 1000000 waittime = 60 backupn = 300 Next, we need to set up the parameters of the Qualtrics API call. You have to declare your API Token, which you can find in the account settings in your Qualtrics account, the format of the downloaded file (.csv), and your data center. The following code declares the API URL and the information to be sent in the API call. # Setting user Parameters for Qualtrics API # Add you Qualtrics token and survey ids apiToken = "---Your Token---" fileFormat = "csv" dataCenter = "---Your data center ---" # Setting static parameters for Qualtrics API baseUrl = "https://" + dataCenter + ".qualtrics.com/API/v3/responseexports/".format(dataCenter) headers = { "content-type": "application/json", "x-api-token": apiToken, } The program defines a function to create the message to the respondent containing the gift code. In this simple version, it only takes the code as parameter. It could be easily extended to be personalized with the respondent’s name. The message is in HTML and can be styled with in-line CSS. # Set email message to send the code in HTML def genMessage(code): message= """""" <html> <head></head> <body> <strong>Thank you!</strong> Thank you very much for taking the time to complete our survey! Please accept the electronic gift certificate code below. <span style="font-weight: bold; font-size=1.5em; text-align:center;">""" \ + code + \ """</span> Thank you again </body> </html> """ return message The program then defines a function to send the message to the respondent by email. You need to specify your correspondence email address and your SMTP host. It could potentially take a list of email addresses. However, that will not be the case in this program. # Specify you email host and email address def sendMail(to, subject, text): assert type(to)==list fro = "---Your email address--- " #add your correspondence email address to send out the code msg = MIMEMultipart() msg['From'] = fro msg['To'] = COMMASPACE.join(to) msg['Date'] = formatdate(localtime=True) msg['Subject'] = subject msg.attach(MIMEText(text, 'html')) smtp = smtplib.SMTP( '---Your mail host---' ) #add your SMTP Host smtp.sendmail(fro, to, msg.as_string() ) smtp.close() The next function creates a backup of the SQL database. def createBackup(path, database): # Check if the provdied path is valid if not os.path.isdir(path): raise Exception("Backup directory does not exist: {}".format(path)) # Define file name for the backup, includes date and time backup_file = os.path.join(path, 'backup' + time.strftime("-%Y%m%d-%H%M")+ '.sqlite') # Lock database before making a backup cur.execute('begin immediate') # Make new backup file shutil.copyfile(database, backup_file) print ("\nCreating {}...".format(backup_file)) # Unlock database sqlconn.rollback() Next, the program sets up the SQL database schema. It tries to connect to the database. If the database does not exist, the program creates two tables “respondents” and “surveys.” “Respondents” tracks the responses and the gift codes. “Surveys” tracks the surveys that have been answered. If the database exists, nothing happens. This SQL code assures that no respondent is paid twice or that the same gift card code is sent to more than one respondent by accident. The attribute UNIQUE guarantees that the program only creates a new record if the response ID, the respondent ID, the email address, and the gift card code do not already exist. # SQL schema # database path + file name database = SQLDatabasePath+SQLDatabaseName # Connect to SQLite API sqlconn = sqlite3.connect(database) cur = sqlconn.cursor() # Execute SQL code to create new database with a table for respondents and for surveys # If database and these tables already exist, nothing will happen cur.executescript(''' CREATE TABLE IF NOT EXISTS Survey ( id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE ); CREATE TABLE IF NOT EXISTS Respondent ( id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, individual_id INTEGER UNIQUE, response_id TEXT UNIQUE, survey_id INTEGER, email TEXT UNIQUE, create_date DATETIME DEFAULT (DATETIME(CURRENT_TIMESTAMP, 'LOCALTIME')), redeem_code TEXT UNIQUE, sentstatus INTEGER, sent_date DATETIME ) ''') # Commit (save) changes to database sqlconn.commit() After all the setup, the actual program starts here. Everything below will be repeated as many times as specified above. The program prints the date, time, and the number of the iteration to the console so that we can check if it’s running properly. # Everything below will be repeated for the specified number of iterations for t in xrange(reps): # Provide some information in the console print "----------" print "Iteration :" , t, "Date: ", datetime.datetime.now().strftime("%m/%d/%y %H:%M:%S") First, previously downloaded survey data files are deleted. This step is not necessary as the next download would replace these files. However, often enough surveys are not downloaded due to API connection problems. In that case, the old and already processed files would be processed again. That would slow down the program. # Delete all files in extraction path, in case they don’t get downloaded in this iteration deletelist = [os.path.join(subdir, file) for subdir, dirs, files in os.walk(SurveyExportPath) for file in files] for file in deletelist: os.remove(file) The next bit of code downloads the responses for each survey separately. It iterates over each survey. First, it tries to collect the last response ID from the SQL database from the last respondent of the respective survey. If someone has answered the survey before, the program passes the ID of the last response as parameters to the Qualtrics API call. In that case, the program only downloads responses that happened afterwards. If no one has answered the survey before, an error exception handler sets the parameters such that all responses will be downloaded. It then initiates the export. Once the export is complete, a zip file with the data is downloaded. The program includes several exceptions for API connection errors and only allows the export to take up to three minutes to make sure it does not get hung up in this step. If an error occurs, the survey will not be downloaded in this iteration, but the program will try again in the next iteration. The program then unzips the zip file. # Iterate over survey IDs to download each one separately for surveyId in surveyIds: survey = surveyIdsDic[surveyId] # Identify the survey try: # Fetch last response id from database, used to download only new responses cur.execute('''SELECT response_id FROM Respondent WHERE id == (SELECT max(id) FROM Respondent WHERE survey_id == (SELECT id FROM Survey WHERE name == ? )) ''', (survey,)) lastResponseId=cur.fetchone()[0] # Set parameters to send to Qualtrics API downloadRequestPayload = '{"format":"' + fileFormat + '","surveyId":"' + surveyId + '","lastResponseId":"' + lastResponseId +'"}' # Set exception for case that no one has answered to this survey yet except (TypeError, sqlconn.OperationalError) as e: print e # Set parameters without specifying last response id (all responses will be downloaded) downloadRequestPayload = '{"format":"' + fileFormat + '","surveyId":"' + surveyId +'"}' downloadRequestUrl = baseUrl try: # Connect to Qualtrics API and send download request downloadRequestResponse = requests.request("POST", downloadRequestUrl, data=downloadRequestPayload, headers=headers) progressId = downloadRequestResponse.json()["result"]["id"] # Checking on data export progress and waiting until export is ready startlooptime = time.time() # Record time to make sure the loop doesn't run forever requestCheckProgress = 0 # As long as export not complete, keep checking while requestCheckProgress > 100: # Abort if download takes more than 3 minutes print "Download took more than three minutes. Try again next iteration." complete = 0 break # If export complete, download and unzip file if complete==1: requestDownloadUrl = baseUrl + progressId + '/file' requestDownload = requests.request("GET", requestDownloadUrl, headers=headers, stream=True) with open("RequestFile.zip", "wb+") as f: for chunk in requestDownload.iter_content(chunk_size=1024): f.write(chunk) except (KeyError, requests.ConnectionError, ValueError, IOError) as e: # Something went wrong with the Qualtrics API (retry) print "Survey not downloaded: ", e continue try: zipfile.ZipFile("RequestFile.zip").extractall(SurveyExportPath) except (zipfile.BadZipfile, IOError) as e: print SurveyExportPath print "Zipfile not extracted: ", e continue Once the files have been downloaded and unzipped, the program loops over all downloaded files. It loads the data of each file into memory, and if new responses are in the data, it adds the them to the SQL database. The IGNORE in the SQL code makes sure that, if the response has already been recorded, it will not be added to the data base without causing an error. The program saves the changes to the SQL database after each record is created. This choice reduces speed but avoids problems from any possible connection errors. SurveyFileList = [os.path.join(subdir, file) for subdir, dirs, files in os.walk(SurveyExportPath) for file in files] for file in SurveyFileList: data=pd.read_csv(file, encoding = 'utf-8-sig', sep = ',', usecols = pdcolumns, low_memory=False, error_bad_lines = False) data=data.iloc[2:,:].reset_index(drop=True) survey = file.split('\\')[-1] # Identify the survey if len(data.index)>0: # Only if new responses are recorded for row in xrange(0, len(data.index)): individual_id = data.loc[row, 'ExternalDataReference'] response_id = data.loc[row, 'ResponseID'] email = data.loc[row, 'RecipientEmail'] # Record the survey name cur.execute(''' INSERT or IGNORE INTO Survey (name) VALUES (?)''', (survey,)) # Fetch survey id to enter in response table cur.execute('''SELECT id FROM Survey WHERE name == ? ''', (survey,)) survey_id=cur.fetchone()[0] cur.execute(''' INSERT or IGNORE INTO Respondent (email, individual_id, response_id, survey_id) VALUES (?,?,?,?)''', (email, individual_id, response_id, survey_id)) sqlconn.commit() After all new responses have been added to the SQL database, the program assigns a gift card code to each new respondent. It first collects all respondents without a gift card code from the SQL database. It then loads the file with the gift card codes into memory and fetches the last gift card code that had been assigned to a respondent. It then selects the codes that come after this last code and assigns them to the new respondents. If no codes have been assigned before, an error is raised and the exception handler makes sure that the first code is used. The newly assigned codes are added to the SQL database. The program counts the number of codes assigned and logs it to the console after all new respondents are assigned a code. # Select new respondents who need a code cur.execute('''SELECT id FROM Respondent WHERE redeem_code IS NULL ''') NeedGiftCards = cur.fetchall() numCodesAssigned = 0 # Count number of the codes assigned if len(NeedGiftCards)>0: # Only if new respondents # Import csv file that holds the codes allcodes=pd.read_csv(CODEPath, encoding = 'utf-8-sig', sep = ',', low_memory=False, error_bad_lines = False) # Identify last redeem code used try: cur.execute('''SELECT redeem_code FROM Respondent WHERE id == (SELECT max(id) FROM Respondent WHERE redeem_code IS NOT NULL)''') lastcode = cur.fetchone()[0] row=allcodes[allcodes.code==lastcode].index.values[0] # Get index value for last code except TypeError: row = -1 usecodes=allcodes[allcodes.index>row] # Select all codes after that value for needcard in NeedGiftCards: row +=1 # Extract data sqlDB_id = needcard[0] redeem_code=usecodes.code[row] numCodesAssigned += 1 # Add code to SQL database cur.execute(''' UPDATE Respondent SET redeem_code = ? WHERE id == ?''', (redeem_code, sqlDB_id)) sqlconn.commit() print 'Number of gift card codes assigned:', numCodesAssigned Finally, the program fetches all new responses for which the codes have not been sent out yet. For each respondent, it creates a message including the code and sends it via email. It counts the number of emails sent and logs it to the console to keep track. # Getting all contacts and codes for which the code has not been sent cur.execute('''SELECT id, email, redeem_code FROM Respondent WHERE redeem_code IS NOT NULL and sentstatus IS NULL''') contacts = cur.fetchall() numCodesSent = 0 # Count the number of codes sent if len(contacts)>0: # Only new respondents for contact in contacts: sqlDB_id = contact[0] email = contact[1] code = contact[2] message = genMessage(code) TOADDR = [email] # Send message try: sendMail(TOADDR, "Thank you for your participation!", message) numCodesSent += 1 sentstatus = 1 cur.execute(''' UPDATE Respondent SET sentstatus = ?, sent_date=datetime(CURRENT_TIMESTAMP, 'localtime') WHERE id == ?''', (sentstatus, sqlDB_id)) sqlconn.commit() except: e = sys.exc_info()[0] print "Error:", e continue print 'Number of codes sent:', numCodesSent The very last step is to create a backup if it’s time. After all iterations are done, the program closes the SQL database connection. if t % backupn == 0: """Create timestamped database copy""" createBackup(SQLBackupPath, database) time.sleep(waittime) # Close SQL connection sqlconn.close() I hope this program can help you with your projects. Feel free to contact me with questions or suggestions.   Guest Post by Hans Fricke , Director of Research at CEPA Labs - Stanford University This article has first been published on the CEPA Labs blog .     Back
View all blog posts

Latest job opportunities

GetYourGuide Berlin, Germany
23/10/2017
Full time
As a Senior Data Engineer you will play a key role in designing and developing our data infrastructure, architecture and processes. You will also help make the vast amount of our data accessible, reliable and efficient to query. About GetYourGuide: With 400 employees from 50 nationalities, GetYourGuide is on a mission to turn trips into amazing experiences, using our product and technology to change the way travelers find and book things to do and explore their destination. We connect these travelers to our vast catalog of amazing experiences through our online marketplace. Team Mission: The Data Platform team plays a central role in our data-driven strategy, designing and developing our data infrastructure, processes and analytics capabilities. We work on problems such as events pipelines, analytics or large-scale applied machine learning. The team is also responsible for making the vast amount of data accessible, reliable and efficient to query, so our teams can make the best data-driven decisions. We work on a modern Spark stack and our language of choice is Scala. Responsibilities: Design, build and refine our data and analytics infrastructure, scaling to terabytes of data Grow our analytics capabilities with faster, more reliable data pipelines, and better tools Design and develop a real-time events pipeline and work with data scientists, backend engineers and product managers to find new ways to leverage this data Develop complex and efficient functions to transform raw data sources into powerful, reliable components of our data lake Optimize and improve existing features or data processes for performance and stability Requirements You have 5+ years of experience as a Data Engineer You have experience designing, developing and maintaining systems at scale You have strong experience with Scala, Python or Java You have strong experience big data technologies (e.g. Spark, Hadoop) You have experience with building stream-processing systems, using solutions such as Kafka and Spark-Streaming You write efficient, well-tested code with an keen eye on scalability and maintainability You possess strong CS fundamentals and are familiar with agile development You have an analytical mind and bake your decisions with data Nice to have: Experience with AWS and cloud technologies such as S3 Experience with Microservice Architecture We Offer: Work on a product that helps create memorable travel experiences Smart, engaged co-workers Speak English in the office with people from over 50 nationalities Virtual stock options – be part of our success story Quarterly Hackathons and weekly tech talks Annual external training budget – be constantly learning Well stocked kitchen and our famous pizza & beer Friday’s GetYourGuide vouchers and free German lessons Relocation Assistance Curious? Do you have the skills for the job, enthusiasm about our vision and fit our culture? We are looking forward to hearing from you! For any further questions regarding this position contact us via jobs@getyourguide.com. In the meantime, you can check our blog at inside.getyourguide.com to see what happens behind the scenes at GetYourGuide and check out Tech Radar at techradar.getyourguide.com for an insight into the stack we use to turn trips into amazing experiences! Apply for this position
Move24 Berlin, Germany
23/10/2017
Full time
Move24 – Award Winning Tech Startup is looking for (Senior) Business Intelligence Manager (f/m) with strong technical and business background to join our highly qualified team and help us shape the future of the global moving industry. The Business Intelligence department is a critical component to our growth and culture About Move24 From Berlin’s headquarter Move24 operates into different European markets such as Germany, Sweden and France. Move24’s vision is to build the world’s leading moving platform and to deliver high-quality products towards customer and local moving partners united in a strong and trustful brand. In order to balance supply and demand as well as ensuring service quality Move24 will create technology-enabled operations to revolutionize the moving jobs industry. Your Responsibilities Ownership and support of the data warehouse and reporting/analytics processes (e.g. ETL, Log processing) You proactively design a companywide KPI framework You implement and maintain data models using business intelligence tools/dashboards to allow different stakeholders to access the data directly You enable other departments (Sales, Marketing, Operations, Product etc.) to translate high-level business problems into more specific questions which can then be answered by data-driven analysis and actions. You present findings with a clear point of view so that insights can be used to help drive business strategy and technology projects. The daily business will include the analysis and definition of KPIs, the reporting for all departments, the integration and comprehension of new and unfamiliar data sources, and ad hoc data analysis Your Skills You have a Degree in Mathematics, Statistics, Economics, Computer Science, Physics or similar with heavy quantitative focus strongly preferred. 2+ years of relevant experience in a business intelligence role, including data warehousing and business intelligence tools, techniques and technology You have hands-on experience in working with Python, SQL, an ETL framework and PostgreSQL or other common database systems Understanding of modern machine learning techniques regarding supervised and unsupervised learning algorithms. Familiarity with common frameworks like scikit and MLib Relevant experience in data analysis using data visualization tools Strong analytical skills, including the ability to define problems, collect data, establish facts, communicate results and arrive at valid conclusions independently and self-confidently Very good command of English required and the possession of strong written and verbal cross-functional communication skills What We Offer Be part of an innovative, international company that is passionate about disrupting the rapidly growing moving industry Meaning – Our work has a direct impact on the international moving market Very experienced management team, business angels and advisors You work with a highly motivated and skilled team of young top performers with great team spirit A very attractive compensation system that remunerates the quality of your work and your commitment as well as providing further incentives Open channels of communication, transparent and quick decision-making and results Work in a company where independent-thinking and personal responsibility are essential Flat hierarchies and everyday opportunities to learn, grow and develop Working with people from all around the globe Enjoy a modern, light and spacious office loft with a big rooftop terrace in Berlin-Mitte including free drinks and fresh fruits Join the Move24 team! We look forward to receiving your application. Apply for this position
Telefonica Munich, Germany
23/10/2017
Full time
Wer wir sind? Wir sind Telefónica. Unsere Vision ist die OnLife Telco. Mit unserer Hilfe sollen die Menschen ihr digitales Leben einfach und selbstbestimmt nach ihren individuellen Wünschen und Gewohnheiten gestalten. Um das zu erreichen entwickeln wir innovative Produkte und Services und transformieren unser Unternehmen. Damit treiben wir auch die digitale Transformation der Gesellschaft voran. Werden Sie ein Teil dieser Vision und gestalten Sie mit uns gemeinsam die digitalen Möglichkeiten für unsere Kunden. Telefónica Deutschland ist Teil des spanischen Telekommunikationskonzerns Telefónica S.A. mit Hauptsitz in Madrid. In Deutschland sind wir der größte Mobilfunkanbieter nach Kundenzahlen und vor allem durch unsere Premiummarke O2 bekannt. Mit unseren insgesamt 125.000 motivierten Mitarbeitern in 21 Ländern gestalten wir die digitale Welt von morgen nach den Bedürfnissen der Menschen. Unsere Abteilung Im Bereich BMI sind wir zentraler BI-Dienstleister im Unternehmen. Basierend auf unserem multi-terabyte Oracle- und Big Data-basierten DWH bieten wir alle Arten von BI-Dienstleistungen wie Reporting, Business Analytics und Campaign Management. Sehr wichtig ist hierbei stets die Einhaltung der Datenschutzbestimmungen. Was sind Ihre Herausforderungen bei uns? Umfassende Beratung von komplexen BI-Projekten mit Blick auf Datenschutzbestimmungen; welche Verarbeitungen sind zulässig, wie können geplante Vorhaben so angepaßt werden, daß sie gesetzeskonform möglich sind? Eigenständiger und proaktiver Aufbau und Weiterentwicklung von Prozessen und Workflows zur Bewertung von Vorhaben auf Datenschutz-Konformität und deren Dokumentation in einer Verarbeitungsübersicht Proaktive Unterstützung des betrieblichen Datenschutzbeauftragten bei der Außendarstellung des BI-Bereiches Strategische Beratung des BMI-Managements zu Datenschutzfragen Koordination und Moderation des Datenschutz Jour Fixe Termins Was Sie mitbringen? Erfolgreich abgeschlossenes Studium in (Wirtschafts-)Informatik, BWL oder Jura mindestens 5 Jahre nachgewiesene Berufserfahrung sowohl auf dem Gebiet des Datenschutz-Rechts, als auch im technisch/ informatischen Bereich BI-Know How, idealerweise mit BI-Tools, Datenbanken und/ oder Kampagnen-Management Systemen Kommunikationsstärke, sicheres Auftreten und Durchsetzungsfähigkeit Know-How im Telekommunikationssektor vorteilhaft Gutes Deutsch und Englisch in Wort und Schrift; Spanisch-Kenntnisse vorteilhaft Was ist unser Angebot? Wir bieten Ihnen ein dynamisches Arbeitsumfeld, an dem kein Tag wie der andere ist. Wir fördern und fordern eigenverantwortliches Arbeiten und bieten Ihnen die Chance Teil und Treiber der Digitalisierung zu sein und diese mit uns zu gestalten. Bei uns finden Sie alle Benefits der modernen Arbeitswelt. Dazu gehören flexible Arbeitsstrukturen und eine attraktive Vergütung sowie viele Zusatzleistungen. Wir unterstützen aktiv die Vereinbarkeit von Familie und Beruf und stellen diverse Gesundheits- und Fitnessangebote bereit. Ein umfangreiches o2 Tarif-Paket steht Ihnen auch zur privaten Nutzung zur Verfügung.  Wie Sie sich bewerben? Überzeugen Sie uns mit Ihrem aussagekräftigen Lebenslauf sowie dazugehörigen Zeugnissen. Ein Motivationsschreiben brauchen Sie bei uns nicht. Bitte beachten Sie, dass Bewerbungen nur über unser Bewerberportal möglich sind und Bewerbungen per E-Mail nicht berücksichtigt werden. Was Sie noch wissen müssen? Haben Sie Fragen zu unserem Bewerbungsprozess und Interesse an weiteren Informationen zu der Stelle, dann wenden Sie sich bitte an Ihren Recruiter.  Recruiter: Sabrina Petermann E-Mail: Sabrina.Petermann@telefonica.com Bei gleicher Eignung werden Bewerber mit einer Schwerbehinderung bevorzugt berücksichtigt.
NorCom Information Technology Munich, Germany
23/10/2017
Full time
Wir sind Experte für Big Data und Big Infrastructure. Unser Team aus Data Scientisten, Data Analysten und Software Experten konzipiert und entwickelt Big Data Solutions. Unsere eigene Lösung NDOS erweitert Apache Hadoop mit wichtigen EnterpriseThemen und ermöglicht so eine elegante Einbindung in vorhandene IT-Infrastrukturen und -Verfahren. Darauf aufbauend sind unsere Lösungen EAGLE und DaSense entstanden, welche bereits erfolgreich, vor allem in der Automobilbranche, eingesetzt werden. Unser Ziel ist es neueste Technologien zu nutzen, um unsere Enterprise-Kunden fit für die Zukunft zu machen. Hast Du Lust, Teil unseres jungen internationalen Teams in München zu werden und mit uns die Zukunft neu zu gestalten? Deine Aufgaben Du berätst unsere Kunden zum Einsatz von Big Data Technologien und revolutionierst ihre Daten-Welt, indem Du Velocity – Variety - Value perfekt aufeinander abstimmst Du behältst bei der Aufbereitung großer Datenmengen den Überblick, überprüfst diese hinsichtlich ihrer Konsistenz und experimentierst mit neuen Vorgehensweisen Du perfektionierst das Data-Mining – ob Search, Messdaten, Clustering, Recommendation Classification oder Forecasting, selbst im größten Datenberg findest Du Zusammenhänge! Du sorgst für eine strukturierte und übersichtliche Visualisierung der Daten Deine Lösungen ermöglichen unseren Kunden einen effizienten Umgang mit Big Data Du definierst die technischen Barrieren durch Hardware, Software oder Bandbreite neu Deine Stärken Du hast ein abgeschlossenes Studium der Informatik, Mathematik oder Ähnlichem Du bringst Erfahrung in der Systemprogrammierung mit und beherrschst sowohl funktionale (Python, Scala, Javascript, R, Lisp) als auch deklarative Programmiersprachen (SQL) Der Umgang mit Tools wie Hadoop, Spark, Hive, HBase, Map-Reduce, Mahout, Weka oder Elasticsearch ist keine Herausforderung für Dich Du verfügst über fundierte Kenntnisse im Bereich Machine Learning und Natural Language Processing (NLP) und hast bereits mit scikit-learn gearbeitet Du hast Spaß an der Analyse großer Datenmengen und deren Visualisierung mit Tools wie d3js, ggplot2 oder modern web standards Du bekommst bei uns Wir bieten Dir im Herzen von München ein offenes, modernes und abwechslungsreiches Arbeitsumfeld mit flexiblen Arbeitszeitmodellen Wir sind Teamplayer und leben eine Start-Up Kultur mit flachen Hierarchien und kurzen Entscheidungswegen Wir sind ein cross-funktionales Team von Spezialisten aus 19 Nationen mit grenzenlosen Ideen - was bisher nicht möglich war, setzen wir um Wir setzen tagtäglich neue Impulse - an vorderster technologischer Front und als Partner in der Entwicklung unserer Mitarbeiter Interessiert? Wir freuen uns auf Deine aussagekräftigen Bewerbungsunterlagen mit Angabe Deiner Gehaltsvorstellung und Deines frühestmöglichen Eintrittstermins. Klingt spannend? Sende uns Deine Bewerbung jobs@norcom.de Hast Du noch Fragen? Eneida Alikaj hilft Dir gerne weiter +49 (0)89 939 480 Du willst mehr über uns erfahren? Lerne uns als Team kennen www.norcom.de/join-us NorCom Information Technology AG Gabelsbergerstr.4 80333 München 089/93948-346


Looking for data professionals?

 

Post a Job