SaaS.ru - January 13, 2015 - by Anastasiia Tolstobrova
Alexander Sajenkov, Head of Business Intelligence at Schneider Electric Russia, describes various benefits and ways to handle unstructured data using EPAM’s Info(N)gen.
Globally, the number of information sources: websites, news and analytical portals, blogs, social networks, etc., is rapidly growing, which leads to a very large volume of data that needs to be processed efficiently. As a result, there has been a lot of discussion on BIG Data. In today’s world, getting full coverage of any relevant content published on the web is an absolutely crucial part of the market intelligence conducted by any large company. This process involves monitoring news about competitors, partners, macroeconomic indicators, or key events in different market segments (oil and gas, electrical utilities, construction, etc.). Therefore, it is rather impossible to imagine a large international company without a professional current awareness solutions utilized to discover and curate volumes of news content generated in the modern world. Therefore, companies are seeking to invest in high quality solutions designed to process unstructured data and help effectively solve market intelligence challenges.
In terms of the content processing workflow, managing unstructured data comes down to the following main stages:
So, what exactly is Info(N)gen?
First of all, it’s a “personalized” current awareness solution, or a “content aggregator tailored for the client.” It is no secret that there are many platforms and services on the market that allow aggregation of content and they range from free RSS applications to professional current awareness platforms. Among the well-known solutions in Russia, the following could be noted: Factiva, LexisNexis, M-Adaptive from M-Brain company, Scan-Interfax, Medialogia, Integrum, Public.ru and others. However, in most cases, the aforementioned content aggregators are typical solutions with the ability to categorize content by common (standard) topics or industries. Usually, such media monitoring products have pre-fixed functionality, and any ability to “tailor” the system to the needs of the client is rather limited or non-existent. The primary advantage of Info(N)gen is in its flexibility to tailor systems to suit user’s requirements and content monitoring needs.
Let's review it in more details going through the stages of content processing workflow mentioned above.
Stage #1. The aggregation of content from different sources
The Info(N)gen platform comes preconfigured with a large universe of sources. The primary focus is on online publications. The total number of sources now exceeds 70,000. Particularly interesting in the coverage of Russian sources, which is also quite broad and comprises from more than 4,000 sources. As it is the case with most news aggregators, the data primarily arrives in the form of RSS feeds (XML streams), which Info(N)gen processes every few minutes.
The flexibility of the Info(N)gen content aggregation process continues with the expansion into additional sources. Often it becomes necessary to monitor highly specialized publications and portals. An advantage of the Info(N)gen platform is that any new sources can be added at the client’s request. In the case of Schneider Electric Russia, the sources are major energy and electrical publishers, specialized portals and sites, IT press, sources that cover industrial automation, as well as specialized press on key market segments.
In addition to content aggregation in RSS format (XML streams), the Info(N)gen platform can also digest content delivered via email. You can simply reroute emails from your content providers to the predefined Info(N)gen address to have your email subscriptions processed and analyzed by the system according to the classification rules built specifically for your business case. It is also commonly known as a “custom taxonomy” approach to classification and content ranking.
Users can also connect external news aggregators. On the one hand, this is an opportunity to increase the quality of content aggregation, but at the same time, we should not forget that this solution has a number of limitations due to restrictions on the number of requests for external search engines. For this approach, of course, a balance must be found between quality and limitations.
Info(N)gen can also aggregate your paid subscriptions. This content can be made available to approved users in order to prevent any unauthorized access. At the request of the customer, EPAM provides InfoNgen services to add any sources necessary for the client. Thus, it is worth underlining the flexibility in solving aggregation and content management tasks.
Figure 1 below shows the process of content aggregation, as well as all major types of sources from which information is collected:
Stage #2. Search and visualization of search results
In order to make content discovery efficient across any large company, it is imperative to define the primary list of topics in your taxonomy on which Info(N)gen will analyze and categorize incoming content. Once the list of topics is finalized by the client, EPAM develops and tailors linguistic models that determine how content is ranked in your search results.
In the case of Schneider Electric Russia, we chose to create five large independent taxonomies to facilitate content discovery and monitoring:
1. The main competitors;
2. Partners and other players on the target market;
3. All products and systems in the scope of interest;
4. Market Segments;
5. Geography (federal districts, regions and major cities).
Therefore, all the content digested by Info(N)gen is automatically tagged during processing based on all five taxonomies mentioned above using prebuilt and fine-tuned linguistic models.
What does such an approach allow for? The primary goal is to analyze a large volume of content in real-time fashion and give you powerful tools to monitor subjects of your interest as the relevant stories break the news. Indeed, it is expected that the creation of any highly tuned linguistic model is a rather complex process since the system must be trained initially. However, the experienced and professional support team from EPAM effectively solves this problem. The accuracy of these linguistic models relies on tight cooperation between the client and EPAM since the magic sauce has two major ingredients: subject matter expertise represented by Schneider Electric Russia as well as linguistic and technical expertise of the Info(N)gen support team. After all, the developed models allow efficient “sifting” through all the news so you can focus only on relevant content.
Once the system is configured, it becomes really easy to find or monitor relevant information by simply selecting the topic of your interest in navigation sidebar (Fig. 2, area #1).
In addition to filtering content by topics, users can use advanced search capabilities (Fig. 2, area #2), as well as store your searches for the future use (Fig. 2, area #3). For example, you can select the topic “Hotels” and provide the additional topic “Ural Region” with the condition “AND.” By creating this query, you receive news on hotels precisely in the Ural region of Russia. Thus, all topics built to suit the customer can be used to create search queries. Searches can also combine topics with full-text Boolean queries. In this regard Info(N)gen really excels providing necessary tools to customize searches:
In addition to advanced search capabilities, Info(N)gen gives you convenient access to your favorite sources with dedicated navigation section assigned (Fig. 2, area #4). Adding sources to your favorite list also provides the ability to utilize the “favorites” tab in the search bar (Figure 2, area #2), which is very helpful to narrow down your search results with one click. In other words, you can run your query against all sources or quickly switch to a predefined list of sources like “favorites” or “subscribed”.
Thus, it is the flexibility and broad spectrum of search options that sets Info(N)gen apart from similar solutions on the market.
Each saved search can be saved, modified, renamed, or shared with your colleagues who have access to Info(N)gen. Therefore, a team of people with a different subject matter focus can collaborate complementing efforts while discovering valuable information.
It is particularly important to note the design of the Info(N)gen portal. All discovered articles and reports are well positioned in the center of the portal and easy to read and filter. The search results are shown in area #5 of the Fig. 2 while area #6 will show you frequency and dynamics of your search results. The display format of news items can be changed using the settings in area #7. For example, you can:
The settings in #8 (Fig. 2) allows you building custom charts that show dynamics and distribution of relevant topics in your search results. This allows you, for example, to compare media activity on companies or developed stories of your interest.
An additional advantage in terms of display and visualization is the ability to see “clouds” of trending phrases or topics (see. Fig. 2, area #9). The difference is that a “word cloud” is based on the most frequently mentioned phrases in the articles returned in search results while “tag cloud” is based on the analysis of linguistic models aligned to suit the customer. This function improves content discoverability and saves you time searching for the relevant information. Similarly, all the metadata facets shown for the search results give Info(N)gen users a great advantage (see. Fig. 2, area #10).
The idea of faceted search display is certainly not new and often regarded as a system of additional filters. It is most frequently used in online stores where the customer first chooses an item, then the manufacturer, and then the product model etc. Unlike online stores that give you static facets to choose from, Info(N)gen faceting is done in real time reflecting all the incoming content that changes constantly. Thus, linguistic processing and facet visualization of the custom topics are carried out continuously along with incoming content. What are the advantages of this approach? First of all, it is the ability to quickly and clearly see the whole array of topics trending in your search results, and secondly, it gives you an easy and quick way to narrow down your search results using relevant parameters (competitor company, region, market segment, etc.).
Stage #3. Curation and distribution of discovered content to your target audience.
Info(N)gen gives you several options to channel your content distribution:
The very high flexibility in building search queries, the ability to fully customize system according to the user’s requirements, very convenient and functional design of the aggregator as well as professional customer support from EPAM staff are the major reasons why Schneider Electric Russia selected Info(N)gen as the preferred solution for content monitoring and market intelligence.
Original publication is here.