Your Career Platform for Big Data

Be part of the digital revolution in Germany

 

Latest Job Opportunities

ella media GmbH Köln, Germany
17/07/2019
Full time
Deep Learning , Data Science und NLP – alle reden darüber, bei uns steckst du mitten drin. Denn wir arbeiten am Storytelling der Zukunft: Digital, vernetzt und intelligent. Um unser hochgestecktes Ziel in den kommenden Jahren zu erreichen, verfügen wir über ein breit aufgestelltes Core Team aus den Bereichen Media, Data Science, sowie AI und Computerlinguistik. Deine Aufgaben Als Teil unseres agilen Data Science Teams löst du auf Basis von Big Data Probleme aus dem NLP und Data Science Umfeld und treibst so die Entwicklung unserer Core Products mit voran Hierzu identifizierst und aggregierst du die relevanten Daten, erstellst Modelle zur Auswertung und entwickelts Metriken zur Evaluation der Ergebnisse Neben der Erstellung von Modellen und deren Auswertung ist für dich die Umsetzung des dazugehörigen Service eine Selbstverständlichkeit Dich interessiert der Umgang mit Bleeding Edge Technologien und neuen Methoden aus der Forschung Wir sind noch frisch und nicht eingefahren, Du gestaltest unser Stat-up selbst mit   Deine Qualifikation Einen Hochschulabschluss der (Wirtschafts-) Informatik oder einer vergleichbaren Disziplin Hohes Maß an Abstraktionsvermögen in den  Bereichen Algorithmen und Modellierung Ausgeprägte Kommunikationsstärke und Spaß daran, auch schwierige Probleme zu lösen Sehr gute Kenntnisse in mindestens einer der folgenden Programmiersprachen: Python, R, C++11 Ein Github / StackExchange Account oder relevante MOOCs sind ein Nice to Have   Unser Angebot Unbefristete Anstellung und Ausbau deiner Fähigkeiten Flache Hierarchien und der Wunsch von uns, dass du das Unternehmen mitgestaltest Freie Wahl der Hardware und des OS (Solange unser Softwarestack läuft) Cloud Computing und Bare Metal Hardware Flexible Arbeitszeiten und Home Office   Unser Unternehmen Wir sind ella. Ein frisch gegründetes Startup aus Köln mit solider Finanzausstattung und sehr guten Kontakten in die Branche. Getrieben von Gründergeist und der Faszination innovativer Zukunftsthemen arbeiten wir seit 2018 in einem wachsenden Team am Entertainment der Zukunft: Künstliche Intelligenz, die Geschichten erzählen kann und Menschen mit genau den Inhalten, die ihnen gefallen, unterhalten und durch den Alltag begleiten. Dieses Jahr legen wir mit dem Core-Team den Grundstein für unsere Wachstumsstrategie ab Januar 2020. Wir kennen die Herausforderung, glauben an das was wir tun und bringen als Experten die nötigen Bausteine mit, diesen Weg zu gehen!
REWE Systems Köln, Germany
15/07/2019
Full time
Ob im Supermarkt oder im Discounter: ohne IT ist heute moderner Handel nicht mehr denkbar. Die Kölner REWE Group ist mit rund 52 Milliarden Euro Umsatz und 330.000 Beschäftigten nicht nur einer der führenden Handels- und Touristikkonzerne in Europa, sondern auch ein IT-Konzern. Die über 1.000 IT-Spezialisten der REWE Systems GmbH setzen in der sich schnell wandelnden Welt des Handels Standards für den Einsatz von IT. Neue Technologien werden entwickelt und Softwareanwendungen werden exakt für verschiedene Unternehmensbedürfnisse optimiert. Unternehmensteile wie REWE, PENNY oder toom Baumarkt sind anspruchsvolle und wichtige Auftraggeber innerhalb der REWE Group. Dabei müssen die IT-Experten neben der Alltagstauglichkeit und technischen Umsetzbarkeit immer stärker Aspekte der Nachhaltigkeit im Auge haben. Kurz gesagt, die REWE Systems GmbH bietet Perspektiven, so vielfältig wie die modernen Handelswelten. Ausschreibungs-ID:   9200 Zum nächstmöglichen Zeitpunkt suchen wir eine qualifizierte und engagierte Persönlichkeit als Big Data Engineer (m/w/d) für den Funktionsbereich Portale und Standardprodukte der REWE Systems GmbH am Standort Köln. Was wir zusammen vorhaben: Der Funktionsbereich ist für die Konzeption und Entwicklung von Portal- und Web-Anwendungen auf Basis von Web-Technologien und Standardprodukten für die REWE Group zuständig. Das Aufgabengebiet erstreckt sich von der Beratung und Entwicklung von Lösungsideen, Konzeption über die technische Umsetzung bis hin zur Betreuung und Wartung der IT-Systeme im produktiven Betrieb. Dabei kommen klassische sowie agile Software-Entwicklungsmethoden zum Einsatz. Was Sie bei uns bewegen: Konzeption, Entwurf, Realisierung und eine qualifizierte Dokumentation von IT-Lösungen, auf Basis von Java- und Big-Data-Technologien Unterstützung bei der Erstellung von Anforderungsanalyse und -spezifikationen sowie bei der Beratung von Lösungsansätzen Mitwirkung bei der Angebotserstellung (Ermittlung von Aufwänden, Bewertung der Machbarkeit und der Risiken) Entwicklung von innovativen Analysen und Services im Handelsumfeld auf Basis von Big-Data-Technologien Kontinuierliche Sicherstellung und Optimierung der Datenqualität Pflege und Wartung der verantworteten Produkte im laufenden Betrieb Abhängig von Ihren praktischen Erfahrungen erhalten Sie die Möglichkeit der Mitwirkung an Softwarearchitekturen Was uns überzeugt: Erfolgreich abgeschlossenes Studium aus den Bereichen Informatik, Mathematik oder Physik; alternativ freuen wir uns auch über begabte und überdurchschnittlich engagierte Quereinsteiger Mehrere Jahre praktische Erfahrung im Bereich der Entwicklung von komplexen und mehrschichtigen Systemen mittels Java Praktische Erfahrung mit verschiedenen Big-Data-Technologien wie z.B. Hadoop, Map Reduce, Hive, HBase, Mahout, Pig, Spark, Storm, Solr/Lucene Fundierte Kenntnisse in gängigen Java-Frameworks, Build-Werkzeugen und Versionsverwaltungssystemen (zum Bsp. Spring, Maven, Git, ItelliJ) Praktische Erfahrung im Einsatz softwaretechnischer Methoden Engagement, Flexibilität, Teamgeist und Zuverlässigkeit gegenüber Kunden und Kollegen Gute Kommunikationsfähigkeit und eine rasche Auffassungsgabe Sehr gute Deutschkenntnisse in Wort und Schrift sowie gute Englischkenntnisse Was wir bieten: Bei der REWE Systems GmbH erwarten Sie ein vielseitiges Aufgabengebiet im dynamischen Handelsumfeld und innovative Projekte, in denen Sie Ihr Wissen durch die Arbeit mit neuen Technologien und Applikationen laufend erweitern können. Sie arbeiten in einem Team mit offener Kommunikation und flachen Hierarchien. So können Sie früh Verantwortung übernehmen und eigene Ideen einbringen.Maßgeschneiderte Aus- und Weiterbildungsprogramme sorgen außerdem dafür, dass Ihre Karriere in Bewegung bleibt. Bei uns bieten sich Ihnen je nach Potenzial, persönlichen Stärken und Bedarf verschiedene Karrierelaufbahnen. Weiterhin profitieren Sie als Mitarbeiter der REWE Systems GmbH von einem umfangreichen Angebot an Sozialleistungen der REWE Group, z.B. einer freiwilligen betrieblichen Altersvorsorge. Neben dem Tages- und Projektgeschäft bieten wir unseren Mitarbeitern durch Veranstaltungen wie Hackathons, Collaborations Days und University Challenges immer wieder Raum für Kreativität und Innovation. Nachfolgend finden Sie ein paar Impressionen von unserem Hackathon 2019: https://youtu.be/khk-uHPi1xI Weitere Informationen erhalten Sie auf unserer Homepage unter www.rewe-systems.com und www.rewe-group.com. Wir freuen uns auf Ihre Bewerbung mit Angabe Ihrer Verfügbarkeit und Gehaltsvorstellung. Bitte nutzen Sie unser Onlineformular, so erreicht Ihre Bewerbung direkt den richtigen Ansprechpartner in unserem Recruiting Center. Bewerbungsunterlagen in Papierform können wir leider nicht zurücksenden. Für Rückfragen zu dieser Position stehen wir Ihnen telefonisch unter   0221 149-7110   zur Verfügung. Um Ihnen den Lesefluss zu erleichtern, beschränken wir uns im Textverlauf auf männliche Bezeichnungen. Wir betonen ausdrücklich, dass bei uns alle Menschen - unabhängig von Geschlecht, Nationalität, ethnischer und sozialer Herkunft, Religion/Weltanschauung, Behinderung, Alter sowie sexueller Orientierung - gleichermaßen willkommen sind.
Denodo Technologies München, Germany
15/07/2019
Full time
The Opportunity Be part of an elite team in a rapidly growing international software product company. Your career with us will combine cutting edge technology, exposure to worldwide clients across all industries (Financial Services, Automotive, Insurance, Pharma, etc.), exciting growth paths, direct mentorship, and access to senior management. Your mission will be to help and assist our clients and prospects in the DACH region to realize their full potential through accelerated adoption and productive use of Denodo’s Data Virtualization capability in many solutions. Duties & Responsibilities As a Data Engineer - SQL / Big Data (m/f) you will successfully employ a combination of high technical expertise, client communication and coordination skills between clients and internal Denodo teams to achieve your mission. Main responsibilities for this position are: Conception, implementation and execution of customer-specific integration projects based on the Denodo Platform Education, coaching and support for Denodo Platform projects to get high level of client satisfaction Diagnose and resolve clients inquiries related to operating our products Participate in problem escalation and call prevention projects to help clients and other technical specialists increase their efficiency when using Denodo products Contribute to knowledge management activities and promote best practices Implement product demos and pilots to showcase Data Virtualization Provide customer-based feedback regarding client’s business cases, requirements and issues Qualifications Desired Skills & Experience University Degree related to information systems or computer science (Bachelor or Master degree) Solid knowledge of Data Integration, SQL and relational and analytical DB management Knowledge of Java, JDBC, XML, Web Services, Git, ... Experience in Windows & Linux/UNIX Basic experience in Big Data, NoSQL, and InMemory environments is welcome Ability to learn new technologies. Active listener. Professional curiosity and continuous learning. Creativity. Team worker Good written/verbal skills in German (minimum C1 level) and in English Willingness to travel from time to time Additional Information Employment Practices We are committed to equal employment opportunity. We respect, value and welcome diversity in our workforce. We do not accept resumes from headhunters or suppliers that have not signed a formal fee agreement. Therefore, any resume received from an unapproved supplier will be considered unsolicited, and we will not be obligated to pay a referral fee
evertrace Berlin, Germany
15/07/2019
Full time
Powered by Berlin-based company builder Next Big Thing AG, Evertrace GmbH is a new startup at the intersection of supply chain and cutting-edge tech (blockchain & IoT). Our vision is to eliminate frustrating information silos and time-consuming, manual work during supply chain management; instead, we’re creating a simple and safe way to exchange data in real-time between involved parties. By providing a single platform for all participants of the logistics process - cargo owners, freight forwarders, brokers, insurers - Evertrace enables accessible data & real-time documentation.   Evertrace is seeking an experienced Java Developer to build the next generation supply chain product. In this role, you will work in an agile team and collaborate on the design, development and optimization of our platform. As one of the first employees of an IoT/Blockchain based supply-chain venture, you will have the opportunity to influence and shape processes. You’ll witness fast growth and co-write Evertrace’s success story, as well as your own.   Join our team to build a product that makes a huge difference to industry players!   These will be your tasks: Design and develop our high-volume, low-latency innovative IoT platform for the supply chain, delivering high-availability and performance Contribute in all phases of the development lifecycle (requirements management with stakeholders, architecture, implementation, test automation, operation, and monitoring) Write well-designed and testable software components Prepare, produce and support software releases Support continuous improvement by investigating alternatives and technologies, and presenting these for architectural review Some technologies you will be working with: Java 9, Spring Framework, Spring Boot Restful API and microservices Akka/Reactor or similar, Kafka RDBMS, Cassandra, Elasticsearch Data processing stack (like Hadoop/Sparks/Storm) AI, Machine learning Docker, Kubernetes AWS/Azure You're offering these qualifications: 5+ years of hands-on Java experience and a Bachelor's degree in Computer Science or a related field Proven record of successfully building and optimizing scalable, cost-effective and high performing solutions Real-world project experience with real-time data ingestion and processing is a must. A deep interest in Blockchain & IoT technologies and data analytics & AI/Machine learning frameworks (experience is a big bonus) Experience with event-driven/reactive programming Experience in working on back-end integration with relational databases and NoSQL database (like Cassandra) Experience working with microservices and API Experience with cloud deployment and agile environments Knowledge and a great interest for secure development A strong sense of responsibility for the product quality and a goal-oriented working style A true team-player who enjoys contributing to the overall company success Fluency in English (German is not mandatory) We're offering these benefits: A sweet deal   Full-time unlimited contract Competitive salary Flexible working hours Free drinks & fruits Hardware of your choice Subsidized membership to a Berlin sports club with 450+ locations Access to Factory Community events: lectures, social events, networking, great restaurant and café, and a ball-pit! Work with the best   Join an agile, cross-functional, dedicated team Take responsibility for your projects. No micromanagement. Work within a flat hierarchy among international team members from 20+ countries Recruitment agency   We do not work with recruitment agencies.   Relocation & EU work visa   We offer relocation assistance for this position regarding paperwork, visa and first steps in Germany.

DataCareer Blog

For the past few years, tasks involving text and speech processing have become really hot-trendy. Among the various researches belonging to the fields of Natural Language Processing and Machine Learning, sentiment analysis ranks really high. Sentiment analysis allows identifying and getting subjective information from the source data using data analysis and visualization, ML models for classification, text mining and analysis. This helps to understand social opinions on the subject, so sentiment analysis is widely used in business and politics and usually conducted in social networks. Social networks as the main resource for sentiment analysis Nowadays, social nets and forums are the main stage for people sharing opinions. That is why they are so interesting for researches to figure out the attitude to one or another object. Sentiment analysis allows challenging the problem of analyzing textual data created by users on social nets, microblogging platforms and forums, as well as business platforms regarding the opinions the users have about some product, service, person, idea, and so on. In the most common approach, text can be classified into two classes (binary sentiment classification): positive and negative, but sentiment analysis can have way more classes involving multi-class problem. Sentiment analysis allows processing hundreds and thousands of texts in a short period of time. This is another reason for its popularity - while people need many hours to do the same work, sentiment analysis is able to finish it in a few seconds.  Common approaches for classifying sentiments Sentiment analysis of the text data can be done via three commonly used methods: machine learning, using dictionaries, and hybrid. Learning-based approach Machine learning approach is one of the most popular nowadays. Using ML techniques and various methods, users can build a classifier that can identify different sentiment expressions in the text. Dictionary-based approach The main concept of this approach is using a bag of words with polarity scores, that can help to establish whether the word has a positive, negative, or neutral connotation. Such an approach doesn't require any training set to be used allowing to classify even a small amount of data. However, there are a lot of words and expressions that are still not included in sentiment dictionaries. Hybrid approach As is evident from the title, this approach combines machine learning and lexicon-based techniques. Despite the fact that it's not widely used, the hybrid approach shows more promising and valuable results than the two approaches used separately. In this article, we will implement a dictionary-based approach, so let's deep into its basis. Dictionary (or Lexicon)-based sentiment analysis uses special dictionaries, lexicons, and methods, a lot number of which is available for calculating sentiment in text. The main are: afinn bing nrc All three are the sentiment dictionaries which help to evaluate the valence of the textual data by searching for words that describe emotion or opinion.  Things needed to be done before sentiment analysis Before starting building sentiment analyzer, a few steps must be taken. First of all, we need to state the problem we are going to explore, understand its objective. Since we will use data from Donald Trump twitter, let’s claim our objective as an attempt to analyze which connotation his last tweets have. As the problem is outlined, we need to prepare our data for examining. Data preprocessing is basically an initial step in text and sentiment classification. Depending on the input data, various amount of techniques can be applied in order to make data more comprehensible and improve the effectiveness of the analysis. The most common steps in data processing are: removing numbers removing stopwords removing punctuation and so on. Building sentiment classifier The first step in building our classifier is installing the needed packages. We will need the following packages that can be installed via command install.packages("name of the package") directly from the development environment: twitteR dplyr splitstackshape purrr As soon as all the packages are installed, let's initialize them. In [6]: library(twitteR) library(dplyr) library(splitstackshape) library(tidytext) library(purrr) We're going to get tweets from Donald Trump's account directly from Twitter, and that is why we need to provide Twitter API credentials. In [26]: api_key <- "----" api_secret <- "----" access_token <- "----" access_token_secret <- "----" In [27]: setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret) [1] "Using direct authentication" And now it's time to get tweets from Donald Trump's account and convert the data into dataframe. In [114]: TrumpTweets <- userTimeline("realDonaldTrump", n = 3200) In [115]: TrumpTweets <- tbl_df(map_df(TrumpTweets, as.data.frame)) Here how our initial dataframe looks: In [116]: head(TrumpTweets)   text favorited favoriteCount replyToSN created truncated replyToSID id replyToUID statusSource screenName retweetCount isRetweet retweeted longitude latitude It was my great honor to host Canadian Prime Minister @JustinTrudeau at the @WhiteHouse today!🇺🇸🇨🇦 https://t.co/orlejZ9FFs FALSE 17424 NA 2019-06-20 17:49:52 FALSE NA 1141765119929700353 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 3540 FALSE FALSE NA NA Iran made a very big mistake! FALSE 127069 NA 2019-06-20 14:15:04 FALSE NA 1141711064305983488 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 38351 FALSE FALSE NA NA “The President has a really good story to tell. We have unemployment lower than we’ve seen in decades. We have peop… https://t.co/Pl2HsZbiRK FALSE 36218 NA 2019-06-20 14:14:13 TRUE NA 1141710851617034240 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 8753 FALSE FALSE NA NA S&amp;P opens at Record High! FALSE 43995 NA 2019-06-20 13:58:53 FALSE NA 1141706991464849408 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 9037 FALSE FALSE NA NA Since Election Day 2016, Stocks up almost 50%, Stocks gained 9.2 Trillion Dollars in value, and more than 5,000,000… https://t.co/nOj2hCnU11 FALSE 62468 NA 2019-06-20 00:12:31 TRUE NA 1141499029727121408 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 16296 FALSE FALSE NA NA Congratulations to President Lopez Obrador — Mexico voted to ratify the USMCA today by a huge margin. Time for Congress to do the same here! FALSE 85219 NA 2019-06-19 23:01:59 FALSE NA 1141481280653209600 NA <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> realDonaldTrump 20039 FALSE FALSE NA NA To prepare our data for classification, let's get rid of links and format dataframe in a way when only one word is in line. In [117]: TrumpTweets <- TrumpTweets[-(grep('t.co', TrumpTweets$'text')),] In [118]: TrumpTweets$'tweet' <- 'tweet' TrumpTweets <- TrumpTweets[ , c('text', 'tweet')] TrumpTweets <- unnest_tokens(TrumpTweets, words, text) And this is how our dataframe looks now: In [119]: tail(TrumpTweets) tweet words tweet most tweet of tweet their tweet people tweet from tweet venezuela In [120]: head(TrumpTweets) tweet words tweet iran tweet made tweet a tweet very tweet big tweet mistake It's obvious that dataframe also contains various words without useful content. So it's a good idea to get rid of them. In [121]: TrumpTweets <- anti_join(TrumpTweets, stop_words, by = c('words' = 'word')) And here's the result: In [122]: tail(TrumpTweets) tweet words tweet harassment tweet russia tweet informed tweet removed tweet people tweet venezuela In [123]: head(TrumpTweets) tweet words tweet iran tweet mistake tweet amp tweet record tweet congratulations tweet president Much better, isn't it? Let's see how many times each word appears in Donald Trump's tweets. In [124]: word_count <- dplyr::count(TrumpTweets, words, sort = TRUE) In [125]: head(word_count) words n day 2 democrats 2 enjoy 2 florida 2 iran 2 live 2 Now it's time to create some dataframe with sentiments that will be used for tweets classification. We will use bing dictionary although you can easily use any other source. In [126]: sentiments <-get_sentiments("bing") sentiments <- dplyr::select(sentiments, word, sentiment) In [127]: TrumpTweets_sentiments <- merge(word_count, sentiments, by.x = c('words'), by.y = c('word')) Above we did a simple classification of Trump's tweets words using our sentiment bag of words. And this is how the result looks: In [128]: TrumpTweets_sentiments words n sentiment beautiful 1 positive burning 1 negative congratulations 1 positive defy 1 negative enjoy 2 positive harassment 1 negative hell 1 negative limits 1 negative mistake 1 negative scandal 1 negative strong 1 positive trump 1 positive Let's look at the number of occurrences per sentiment in tweets. In [131]: sentiments_count <- dplyr::count(TrumpTweets_sentiments, sentiment, sort = TRUE) In [132]: sentiments_count sentiment n negative 7 positive 5 We also may want to know the total count and percentage of all the sentiments. In [133]: sentiments_sum <- sum(sentiments_count$'n') In [134]: sentiments_count$'percentage' <- sentiments_count$'n' / sentiments_sum Let's now create an ordered dataframe for plotting counts of sentiments. In [135]: sentiment_count <- rbind(sentiments_count) In [136]: sentiment_count <- sentiment_count[order(sentiment_count$sentiment), ] In [137]: sentiment_count sentiment n percentage negative 7 0.5833333 positive 5 0.4166667 And now it's time for the visualization. We will plot the results of our classifier. In [144]: sentiment_count$'colour' <- as.integer(4) In [145]: barplot(sentiment_count$'n', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5) In [146]: barplot(sentiment_count$'percentage', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5)   Conclusion Sentiment analysis is a great way to explore emotions and opinions among the people. Today we explored the most common and easy way for sentiment analysis that is still great in its simplicity and gives quite an informative result. However, it should be noted that different sentiment analysis methods and lexicons work better depending on the problem and text corpuses. The result of the dictionary-based approach also depends much on the matching between the used dictionary and the textual data needed to be classified. But still, user can create own dictionary that can be a good solution. Despite this, dictionary-based methods usually show much better results than more compound techniques.
The way other people think about one or another product or service has a big impact on our everyday process of making decisions. Earlier, people relied on the opinion of their friends, relatives, or products and services reposts, but the era of the Internet has made significant changes. Today opinions are collected from different people around the world via reviewing e-commerce sites as well as blogs and social nets. To transform gathered data into helpful information on how a product or service is perceived among the people, the sentiment analysis is needed. What is sentiment analysis and why do we need it Sentiment analysis is a computing exploration of opinions, sentiments, and emotions expressed in textual data. The reason why sentiment analysis is used increasingly by companies is due to the fact that the extracted information can result in the products and services monetizing. Words express various kinds of sentiments. They can be positive, negative, or have no emotional overtone, be neutral. To perform analysis of the text's sentiment, the understanding of the polarity of the words is needed in order to classify sentiments into positive, negative, or neutral categories. This goal can be achieved through the use of sentiment lexicons. Common approaches for classifying sentiment  Sentiment analysis can be done in three ways: using ML algorithms, using dictionaries and lexicons, and combining these techniques. The approach based on the ML algorithms got significant popularity nowadays as it gives wide opportunities for performing identification of different sentiment expressions in the text. For performing lexicon-based approach various dictionaries with polarity scores can be found. Such dictionaries can help in establishing the connotation of the word. One of the pros of such an approach is that you don't need a training set for performing analysis, and that is why even a small piece of data can be successfully classified. However, the problem is that many words are still missing in sentiment lexicons that somewhat diminishes results of the classification. Sentiment analysis based on the combination of ML and lexicon-based techniques is not much popular but allows to achieve much more promising results then the results of independent use of the two approaches. The central part of the lexicon-based sentiment analysis belongs to the dictionaries. The most popular are afinn, bing, and nrc that can be found and installed on  python packages repository  All dictionaries are based on the polarity scores that can be positive, negative, or neutral. For Python developers, two useful sentiment tools will be helpful - VADER and TextBlob. VADER is a rule and lexicon-based tool for sentiment analysis that is adapted to sentiments that can be found in social media posts. VADER uses a list of tokens that are labeled according to their semantic connotation. TextBlob is a useful library for text processing. It provides general dealing with such tasks like phrase extraction, sentiment analysis, classification and so on. Things needed to be done before SA  In this tutorial, we will build a lexicon-based sentiment classifier of Donald Trump tweets with the help of the TextBlob. Let's look, which sentiments generally prevail in the scope of tweets. As every data exploration, there are some steps needed to be done before analysis, problem statement and data preparation. As the theme of our study is already stated, let's concentrate on data preparation. We will get tweets directly from Twitter, the data will come to us in some unordered look and that is why we need to order data into dataframe and do cleaning, removing links and stopwords. Building sentiment classifier First of all, we have to install packages needed for dealing with the task. In [41]: import tweepy import pandas as pd import numpy as np import matplotlib.pyplot as plt import re import nltk import nltk.corpus as corp from textblob import TextBlob  The next step is to connect our app to Twitter via Twitter API. Provide the needed credentials that will be used in our function for connection and extracting tweets from Donald Trump's account. In [4]: CONSUMER_KEY = "Key" CONSUMER_SECRET = "Secret" ACCESS_TOKEN = "Token" ACCESS_SECRET = "Secret" In [7]: def twitter_access(): auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET) api = tweepy.API(auth) return api twitter = twitter_access() In [9]: tweets = twitter.user_timeline("RealDonaldTrump", count=600) This is how our dataset looks: In [81]: tweets[0] Out[81]: Status(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'created_at': 'Wed Jun 26 02:34:41 +0000 2019', 'id': 1143709133234954241, 'id_str': '1143709133234954241', 'text': 'Presidential Harassment!', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 10387, 'favorite_count': 48141, 'favorited': False, 'retweeted': False, 'lang': 'in'}, created_at=datetime.datetime(2019, 6, 26, 2, 34, 41), id=1143709133234954241, id_str='1143709133234954241', text='Presidential Harassment!', truncated=False, entities={'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, source='Twitter for iPhone', source_url='http://twitter.com/download/iphone', in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, in_reply_to_screen_name=None, author=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), user=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), geo=None, coordinates=None, place=None, contributors=None, is_quote_status=False, retweet_count=10387, favorite_count=48141, favorited=False, retweeted=False, lang='in') Not very informative, huh? Let's make our dataset look more legible. In [101]: tweetdata = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=["tweets"]) In [102]: tweetdata["Created at"] = [tweet.created_at for tweet in tweets] tweetdata["retweets"] = [tweet.retweet_count for tweet in tweets] tweetdata["source"] = [tweet.source for tweet in tweets] tweetdata["favorites"] = [tweet.favorite_count for tweet in tweets] And this is how it looks now. Much better, isn't it? In [103]: tweetdata.head() Out[103]:   tweets Created at retweets source favorites 0 Presidential Harassment! 2019-06-26 02:34:41 10387 Twitter for iPhone 48141 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 11127 Twitter for iPhone 45202 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 11455 Twitter for iPhone 48278 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 10389 Twitter for iPhone 44485 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 9817 Twitter for iPhone 52995   The next step needed to be taken is cleaning our dataset from useless words that bring no sense and improving our dataset that will then contain, among default tweet data, its connotation (whether it's positive, negative, or neutral), sentimental score, and subjectivity. In [104]: stopword = corp.stopwords.words('english') + ['rt', 'https', 'co', 'u', 'go'] def clean_tweet(tweet): tweet = tweet.lower() filteredList = [] global stopword tweetList = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split() for i in tweetList: if not i in stopword: filteredList.append(i) return ' '.join(filteredList) In [105]: scores = [] status = [] sub = [] fullText = [] for tweet in tweetdata['tweets']: analysis = TextBlob(clean_tweet(tweet)) fullText.extend(analysis.words) value = analysis.sentiment.polarity subject = analysis.sentiment.subjectivity if value > 0: sent = 'positive' elif value == 0: sent = 'neutral' else: sent = 'negative' scores.append(value) status.append(sent) sub.append(subject) In [106]: tweetdata['sentimental_score'] = scores tweetdata['sentiment_status'] = status tweetdata['subjectivity'] = sub tweetdata.drop(tweetdata.columns[2:5], axis=1, inplace=True) In [107]: tweetdata.head() Out[107]:   tweets Created at sentimental_score sentiment_status subjectivity 0 Presidential Harassment! 2019-06-26 02:34:41 0.000000 neutral 0.000000 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 0.081481 positive 0.588889 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 0.333333 positive 1.000000 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 0.400000 positive 0.375000 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 0.086667 positive 0.396667   For a better understanding of the obtained results, let's do some visualization. In [109]: positive = len(tweetdata[tweetdata['sentiment_status'] == 'positive']) negative = len(tweetdata[tweetdata['sentiment_status'] == 'negative']) neutral = len(tweetdata[tweetdata['sentiment_status'] == 'neutral']) In [110]: fig, ax = plt.subplots(figsize = (10,5)) index = range(3) plt.bar(index[2], positive, color='green', edgecolor = 'black', width = 0.8) plt.bar(index[0], negative, color = 'orange',edgecolor = 'black', width = 0.8) plt.bar(index[1], neutral, color = 'grey',edgecolor = 'black', width = 0.8) plt.legend(['Positive', 'Negative', 'Neutral']) plt.xlabel('Sentiment Status ',fontdict = {'size' : 15}) plt.ylabel('Sentimental Frequency', fontdict = {'size' : 15}) plt.title("Donald Trump's Twitter sentiment status", fontsize = 20) Out[110]: Text(0.5, 1.0, "Donald Trump's Twitter sentiment status")     Conclusion Sentiment analysis is a great way to explore emotions and opinions among society. We created basic sentiment classifier that can be used for analyzing textual data in social nets. The lexicon-based analysis allows creating own lexicon dictionaries thanks to what you can perform fine sentiment tuning depending on the task, textual data, and the goal of the analysis.
Introduction Nowadays PostgreSQL is probably one of the most powerful relational databases among the open-source solutions. Its functional capacities are no worse than Oracle’s and definitely way ahead of the MySQL. So if you are working on apps using Python, someday you will face the need of working with databases. Luckily, Python has quite a wide amount of packages that provide an easy way of connecting and using databases. In this article, we will go through the most common packages for dealing with PostgreSQL and show the way of using one of them. Python modules for working with PostgreSQL The way you access the database via Python environment usually depends on personal choices and development specificities. One of the most common and easiest ways is to establish connections using special Python drivers among which the most popular are psycopg2, PyGreSQL, py-postgresql, and pg8000. psycopg2 psycopg2 is probably one of the most popular packages for interaction with PostgreSQL from the Python environment. Written on C programming language with libpq wrapper, it provides a wide range of operations for database manipulations. psycopg2 provides both client-side and server-side cursors as well as asynchronous communication and notifications and a “COPY TO/ FROM” support. It should be noted that the package provides most of Python data types support with their adaptation to match PostgreSQL data types. Among the key features of this package are the support of multiple connections and connection objects, various methods of transaction and its management, pool of connections, auto filtering and async queries and cursors objects. Plus, columns from the database are returned via Python dictionary with their names. Among the possible weaknesses of the package, we can note a lack of documentation and mostly outdated code examples. PyGreSQL PyGreSQL is the first PostgreSQL adapter and one of the open-source Python modules for working with PostgreSQL databases that is actively developing. This package embeds the PostgreSQL query library to provide easy use of all of the PostgreSQL database manipulation features from Python environment. Despite the fact that this module provides great opportunities for working with databases, some users found its problems with working with cursors objects. py-postgresql py-postgresql is a Python3 module that provides wide abilities to interact with PostgreSQL including a high-level driver to make working with databases deeper. Among the main disadvantages of this package are the fact that it works only on the Python3 environment and doesn’t have direct support for high-level async interfaces. pg8000 The pg8000 is a pure Python module for dealing with PostgreSQL interactions that has wide documentation and is actively developing. It should be noted that this module complies with Python Database API 2.0 that basically provides the user a broader reach of DB connectivity from Python environment. Being pure Python, the module allows users using it in AWS Lambda functions without any extra work. For now, this module is actively reviewing and developing and therefore some bugs can be faced while working with it. Despite the fact that many more drivers for PostgreSQL interacting are available, some of them are outdated or have limits in usage. Still, the listed above provide users great opportunities for various databases manipulations. Working with queries We covered some of the most common packages for connecting PostgreSQL to the Python app and now it’s time to see it in use. We chose for working the psycopg2 module and continue reviewing connecting and working with databases with its operations as an example. It is assumed that PostgreSQL is already installed and run on your machine. 1. Installing package. Run conda install psycopg2 if you are using anaconda or pip install psycopg2 via pip package manager. 2. Once the package was downloaded, let’s check the accessibility to the database.  To check the connection, we will write a simple script. Be sure, that you provided own specific database name, username, and password. Save the script and run it with Python test.py command. As a result, we will get an empty array ([]) symbolizing that connection works properly. Now it’s time to explore some basic operations with databases. First, let’s add to our database some tables we will work with. Usually data stores in tables in .csv format so why not start exploring PostgreSQL interactions from adding some data to database from .csv table? For the next examples, we will use parts of the datasets that can be found on Kaggle:  Chicago census data; Chicago public schools. First, we establish connection and create tables with the corresponding table names. connect_str = "dbname=yourdbname user=yourname host='localhost' " + "password='yourpwd'" conn = psycopg2.connect(connect_str) cursor = conn.cursor() cursor.execute(""" CREATE TABLE census(id integer PRIMARY KEY,  COMMUNITY_AREA_NUMBER float,  COMMUNITY_AREA_NAME text,  PERCENT_OF_HOUSING_CROWDED float,  PERCENT_HOUSEHOLDS_BELOW_POVERTY float,  PERCENT_AGED_16+_UNEMPLOYED float,  PERCENT_AGED_25+_WITHOUT_HIGH_SCHOOL_DIPLOMA float,  PERCENT_AGED_UNDER_18_OR_OVER_64 float,  PER_CAPITA_INCOME integer,  HARDSHIP_INDEX float) """) In such a way another table will be created. After this, we can copy data from corresponding .csv files to our tables. with open('CENSUS.csv', 'r') as f: next(f) cursor.copy_from(f, census, sep=',') conn.commit() The commit() method is needed to make changes to our database persistent. It should be noted that while copying data from .csv tables, we do not need a header, so we skip it. The same operation will be done to another table. And now we can write some simple queries. Let’s find out which Community Areas in CENSUS table start with letter ‘B’. sql = ‘SELECT community_area_name FROM census WHERE community_area_name LIKE %s’ community_area_name = ‘B%’ cursor.execute(“sql, (community_area_name)”) cursor.fetchall() Let’s list the top 5 Community Areas by average College Enrollment. sql =  ‘SELECT community_area_name, AVG(college_enrollment) AS avg_enrollment FROM schools GROUP BY community_area_name limit 5’ cursor.execute(“sql”) cursor.fetchall() Now, let’s find the Per Capita Income of the Community Area which has a school Safety Score of 1. sql = SELECT community_area_number, community_area_name, per_capita_income FROM census WHERE community_area_number IN ( select community_area_number FROM schools WHERE safety_score = '1' )’ cursor.execute(“sql”) As you have already noticed, we used fetchall() method. The package has some quite useful - among many other useful methods - options to retrieve a single row from a table (fetchone()) or all the rows according to query (fetchall()). As soon as all the manipulations with the database are finished, the connection must be closed. It can be done with two methods: cursor.close() and conn.close(). Conclusion Establishing connection and interaction with PostgreSQL from the Python environment with the help of various drivers becomes easy and quick. Every module has its own pros and cons that basically allows each user to choose the suitable one. The psycopg2 module used for the example provides an easy and clear way to establish a connection and set queries giving opportunities for using flexible queries.
View all blog posts