A Baseline Based Deep Learning Approach of Live Tweets

In this scenario social media plays a vital role in influencing the life of people. Twitter , Facebook, Instagram etc are the major social media platforms . They act as a platform for users to raise their opinions on things and events around them. Twitter is one such micro blogging site that allows the user to tweet 6000 tweets per day each of 280 characters long. Data analyst rely on this data to reach conclusion on the events happening around and also to rate a product. But due to massive volume of reviews the analysts find it difficult to go through them and reach at conclusions. In order to solve this problem we adopt the method of sentiment analysis. Sentiment analysis is an approach to classify the sentiment of user reviews, documents etc in terms of positive (good), negative (bad), neutral (surprise). I suggest an enhanced twitter sentiment analysis that retrieves data based on a baseline in a particular predefined time span and performs sentiment analysis using Textblob. This scheme differs from the traditional and existing one which performs sentiment analysis on pre saved data by performing sentiment analysis on real time data fetched via Twitter API. Thereby providing a much recent and relevant


I. INTRODUCTION
In the past few years, there has been a huge growth in the use of micro blogging platforms such as Twitter. Spurred by that growth, companies and media organizations are increasingly seeking ways to mine Twitter for information about what people think and feel about their products and services. Apart from that data analysts also make use of this data for interpreting about eminent personalities and various events happening. The online medium has become a significant way for people to express their opinions and with social media, there is an abundance of opinion information available. Using sentiment analysis the polarity of opinion can be found such as positive, negative or neutral by analyzing the text of the opinion. Sentiment analysis has been useful for companies to get their customer's opinions on their products predicting outcomes of elections, and getting opinions from movie reviews. The information gained from sentiment analysis is useful for companies making future decisions. Many traditional approaches in sentiment analysis uses the bag of words method. The bag of words technique does not consider language morphology, and it could incorrectly classify two phrases of having the same meaning because it could have the same bag of words. The relationship between the collection of words is considered instead of the relationship between individual words. When determining the overall sentiment, the sentiment of each word is determined and combined using a function. Bag of words also ignores word order, which leads to phrases with negation in them to be incorrectly classified. Other techniques discussed in sentiment analysis include Naive Bayes, Maximum Entropy, and Support Vector Machines. Sentiment analysis refers to the broad area of natural language processing which deals with the computational study of opinions, sentiments and emotions expressed in text. Sentiment Analysis (SA) or Opinion Mining (OM) aims at learning people's opinions, attitudes and emotions towards an entity. The entity can represent individuals, events or topics. An immense amount of research has been performed in the area of sentiment analysis. But most of them focused on classifying formal and larger pieces of text data like reviews. With the wide popularity of social networking and microblogging websites and an immense amount of data available from these resources, research projects on sentiment analysis have witnessed a gradual domain shift. The past few years have witnessed a huge growth in the use of microblogging platforms. Popular microblogging websites like Twitter have evolved to become a source of varied information. This diversity in the information owes to such microblogs being elevated as platforms where people post real time messages about their opinions on a wide variety of topics, discuss current affairs and share their experience on products and services they use in daily life. Stimulated by the growth of microblogging platforms, organizations are exploring ways to mine Twitter for information about how people are responding to their products and services. A fair amount of research has been carried out on how sentiments are expressed in formal text patterns such as product or movie reviews and news articles, but how sentiments are expressed given the informal language and message-length constraints of microblogging has been less explored. Twitter is an innovative microblogging service aired in 2006 with currently more than 550 million users . The user created status messages are termed tweets by this service. The public timeline of twitter service displays tweets of all users worldwide and is an extensive source of real-time information. The original concept behind microblogging was to provide personal status updates. But the current scenario surprisingly witnesses tweets covering everything under the world, ranging from current political affairs to personal experiences. Movie reviews, travel experiences, current events etc. add to the list. Tweets (and microblogs in general) are different from reviews in their basic structure. While reviews are characterized by formal text patterns and are summarized thoughts of authors, tweets are more casual and restricted to 140 characters of text. Tweets offer companies an additional avenue to gather feedback. Sentiment analysis to research products, movie reviews etc. aid customers in decision making before making a purchase or planning for a movie. Enterprises find this area useful to research public opinion of their company and products, or to analyze customer satisfaction. Organizations utilize this information to gather feedback about newly released products which supplements in improving further design. Different approaches which include machine learning(ML) techniques, sentiment lexicons, hybrid approaches etc. have been proved useful for sentiment analysis on formal texts. But their effectiveness for extracting sentiment in microblogging data will have to be explored. A careful investigation of tweets reveals that the 140 character length text restricts the vocabulary which imparts the sentiment. The hyperlinks often present in these tweets in turn restrict the vocabulary size. The varied domains discussed would surely impose hurdles for training. The frequency of misspellings and slang words in tweets (microblogs in general) is much higher than in other language resources which is another hurdle that needs to be overcome. On the other way around the tremendous volume of data available from microblogging websites on varied domains are incomparable with other data resources available. Microblogging language is characterized by expressive punctuations which convey a lot of sentiments. Bold lettered phrases, exclamations, question marks, quoted text etc. leave scope for sentiment extraction. The proposed work attempts a novel approach on twitter data by aggregating an adapted polarity lexicon which has learnt from product reviews of the domains under consideration, the tweet specific features and unigrams to build a classifier model using machine learning techniques.

II. LITERATURE SURVEY
The related work section covers the other aspects of Twitter data usage, with an entirely different approach as discussed in the thesis. An analysis of Big Data technologies Info Sphere Big Insights and Apache Flume [6] was conducted by Birjali et al. Multiple sets of data for various research purposes was first collected from Twitter by Apache Flume, stored in Hadoop, and then displayed with Big Sheets after being ana-lyzed using Info Sphere Big Insights. They chose Twitter as their Big Data source, due to the increasingly large amount of data generated daily by its users. This method uses the Hadoop Distributed File System (HDSF) in order to utilize the Map Reduce feature, enabling the collection of larger data sets (Tweets). Map Reduce counts the number of times a matching data set is iterated and then displays the results. Apaches Flume Next Generation (NG) was used to collect the Tweets used in this case study. Flume NG uses a process that first collects data (Tweets) from multiple sources and holds them in memory, and then stores them in the HDSF using JAQL script, which is a data processing and query language. After a thorough examination of Info Sphere Big Insights analytics, a separate data collection tool developed from Apache Flume was tested, and the results were analyzed using Info Sphere Big Insights. It was determined that the technique used by the tool developed from Apache Flume was not only superior to older methods, but faster as well. A paper on the Intelligent Mining of Public Social Networks Influence in Society(MISNIS) tool [7] highlights several key limitations on current methods, such as Twitter API restrictions and dependency on hashtags and keywords for categorization, and demonstrates how MISNIS overcomes these limitations, increasing productivity by 80% and 40% respectively. MISNIS uses polarity sentiment analysis, and does not use a language dependent lexicon. While this approach is limiting, it does not negate MISNISs apparent superiority, and is open to further development in future. Joao P. Carvalho and his collaborators [7] demonstrate MISNIS by applying it to track, catalogue, analyze, and trace current events in Portugal; however, MISNIS canbe applied in many other fields with various other research questions. It can collect, store, manage, mine, and display data by using Computational Intelligence, Information Retrieval, Big Data, Topic Detection, User Influence and Sentiment Analysis. This method uses geolocation to collect Tweets within Twitter's API restriction of1% data collection, then traces the collected Tweets back to the users accounts to collect additional Tweets that meet the search criteria from multiple Twitter API accounts. A file of every viable user was created and maintained to facilitate this process. Mongo DB was used for all data storage, and a REST API was used to handle the data once it was collected. In addition, the REST API is also the tool used to collect data from individual users. This method does not make collection limitless, as it is also minimally restricted by Twitter. An insightful exploratory analyzer, demonstrates the capabilities of Tweets Characterization Methodology (TCHARM) [8] to organize collected Tweets based on geographical location, the time of the Tweet, as well as its contents. TCHARM uses the Text And Spatio-TEmporal (TAST) distance measure in order to group similar Tweets based on all three categories. This means that TCHARM is capable of grouping Tweets about the same, or similar subjects, from geographically close, or specified regions, that were Tweeted around the same time. The case study conducted in this paper to demonstrate TCHARMSs performance searched for and categorized Tweets related to the 2014 FIFA world cup. Through this study it was determined that the TAST feature utilized by TCHARM produced a more even distribution of the three factors tested for by TCHARM than did other methods. The authors also address avenues for future work based on TCHARMs limitations. One such limitation is the length of time it takes to set the specifications of TCHARMs features. It is also suggested that the K-means algorithm used by TCHARM may collect too broad a range of Tweets containing the three factors for categorization. While this means that some collections of Tweets are more loosely related than is desirable, it does not affect the overall higher efficiency demonstrated by this method. TCHARM can handle a high number of Tweets in its data collection due to its use of Apache Spark as its platform, and collects Tweets quickly on an hour to hour recurring basis.

III. LIVE TWEET ANALYSIS SYSTEM
In this system we suppose that a user in general searches for tweets related to a particular keyword at current time using his twitter credentials, retrieves tweets and finally performs sentiment analysis on them so as to reach at a conclusion. A. Architecture The following system shows the architecture of the proposed scheme. The system consists of four modules.
Creating Twitter API In order to retrieve live tweets based on baseline, the user should initially request twitter for its authentication credentials.

Tweets Retrieval
Here tweets are retrieved from the twitter API dynamically based on the Keyword name input and given count.

Preprocessing
The tweets are imported to a .csv file from the twitter API, these tweets consist of unnecessary words, whitespaces, hyperlinks and special characters. First we need to do filtering process by removing all unnecessary words. In this method we uses textblob as a method to find the polarity of the text (positive text, negative text or neutral text). The tweets are imported from the Twitter using the (API) provided by the Twitter Developer. From these API various fields like tweets, source, retweets, likes, language, user etc. can be scrapped. After collecting these data, we can analyses the various famous person thoughts on anevent or occasion D. Software Description In the system the graphs such as Table, Bar graph, Line graph are generated with the help of Jupyter notebook. The predefined functions are pandas, numpy, matplotlib, pyplot, list, Dictionary. Pandas is used for converting from csv file to dataset. Numpy is one of the essential library for scientific calculating in Python. It delivers a high-performance multidimensional array object, and apparatuses for experimenting with these arrays. Python comprises of numerous built-in container categories: lists, dictionaries, sets, and tuples. A list is the Python equal of an array, but is resizable and can contain elements of different types. A dictionary stores (key, value) pairs, like a Map in Java or an object in JavaScript. Python library such as Text Blobare used for processing the textual data. It provides API for processing natural language processing (NLP) such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. Tweepy isused for accessing Twitter API and it is open sourced.

E. Data Analysis and Visualization
In Twitter users tweets their opinion on an occasion or anything including a commodity or even an personality.. From their thoughts, importance of that occasion and the polarity of their tweet are analysed. Some of the analysis with the dataset as follows.  Visualize the various source of the tweet.  Calculate the polarity of the tweets fetched  Visualize the Polarity of tweet (positive, negative, neutral)  Calculate the general review of the tweeters  Calculate the individual review of the tweeters  Visualise the tweeters opinion in the form of pie graph F. Advantages of Proposed Scheme 1. The system gives us a review on day to day happenings. 2. Provides impartial reviews. 3. Fast analysis 4. Easily understandable by all.

IV. EVALUATION
Our scheme has a few differences from traditional password based scheme. The first is the adopting live streaming of data. The second is that the output value is tweeters current opinion. Based on these features, our proposal has advantages as follows:  Lower computational cost  High Accuracy  Supporting privacy of users The polarity of tweets can be expressed at different levels whether the expressed opinions in a document or sentence is either positive or negative. The subjectivity of tweets is basically finding of subjective words and text that show the presence of opinions. In the result shown inTable2 we can see the polarity of each baseline.

V. RESULT
This is the sample output for the project for the keyword Donald Trump for 1000 tweets. Twitter sentiment analysis comes under the category of text and opinion mining. It focuses on analyzing the sentiments of the tweets and feeding the data to a machine learning model to train it and then check its accuracy, so that we can use this model for future use according to the results. It comprises of steps like data collection, text pre-processing, sentiment detection, sentiment classification, training and testing the model. This research topic has evolved during the last decade with models reaching the efficiency of almost 85%-90%. But it still lacks the dimension of diversity in the data. Along with this it has a lot of application issues with the slang used and the short forms of words. Many analyzers don't perform well when the number of classes are increased. Also, it's still not tested that how accurate the model will be for topics other than the one in consideration. Hence sentiment analysis has a very bright scope of development in future. A. Future Scope We can perform deep sentiment analysis of text, in different areas of application. It is not adequate to say that a text is an inclusive positive or inclusive negative. Users would like to know which separate topics are talked about in the text, which of the mare positive and which are negative. So, there will be an inclination towards greater use of NLP techniques (such as syntactic parsing),in addition to machine learning methods.  A more elaborate web-based application can be made for my work infuture  By using various classification strategies we further improve the results  By the use of sentiment analysis, I forecast the future consequence s or at least anticipate them better, when people tweet about present scenario.