After communicating with some friends, I found that some data analysts are not very clear about how machine learning can be applied to the business, nor are they sure whether they want to learn algorithm knowledge. In actual business, some complex algorithm scenarios such as product recommendation, content recommendation, matching strategy, etc. actually require data analysts to do a lot of exploration and verification work.
The analyst can guide the modeling in the early stage, and provide some new ideas and data insights for the optimization of the model in the middle and late stages. In addition, the use of algorithms can greatly improve the efficiency and scientific analysis of analysis. Today, let us take a detailed look at the past and present of data analysts and algorithms.
Contents of this article:
- Some understanding of the algorithm
- Which scenarios need to use machine learning algorithms
- Algorithm output and form, how to apply to business
- Why data analysts need machine learning
- The difference in responsibilities between data analysts and algorithm engineers
- How to divide labor in actual business can maximize the utility
- What data analysts should master
1. Some understanding of the algorithm
Before talking about analysts and algorithms, let’s first understand what an algorithm is. Professional terms are defined in many books and articles. For a more general understanding, you can generally think that an algorithm is a fix to solve a certain problem. Calculating methods and steps.
Disassemble the above sentence:
- Purpose: In order to solve a certain type of problem, it is necessary to understand the business background and related scenes behind it;
- Method: to achieve through calculation, which means that you need to have specific and quantifiable information input, and it can be calculated, not an unexecutable concept body;
- Steps: There is a sequence, what to do first, then what to do last, each process must be feasible, and the number of executions must be limited;
- Conclusion: Whether this problem can be solved and how effective is, there must be an output in the end. In addition to the algorithm, there are several layers of extensions;
- Decision-making: judge based on one or more conclusions, whether the process meets expectations, how to adjust and optimize, and whether it can be directly applied to the business;
- Application expansion: In addition to solving the original problem, what other homogeneous types of problems can also be solved, that is, the expansion of the scene.
The specific algorithm building process will not be mentioned, and there are very detailed explanations in many reference books, professional books, and case books. Going back to the question, what scenarios need to use algorithms to solve the problem. Give a few examples in life:
- For example, cooking: In order to eat better, choose a suitable cookbook to prepare ingredients and auxiliary materials. According to the steps and techniques, “simmer at low fire, fry at medium fire, and stir fry at high fire”, “one fry, two stew, three Braised, four-boiled”, put the pot on the plate;
- For example, go to school: starting from the door, walk straight for 50 meters, turn right at the first intersection, continue straight for 100 meters, reach the bus station, take bus 402, get off after 5 stops, continue walking along the sidewalk for 200 meters, turn left, Go straight for another 150 meters and finally reach the school gate.
These can be understood as algorithms, and they are everywhere in life, but in most cases it has become a way we are used to.
2. Which scenarios need to use machine learning algorithms
Machine learning algorithms are needed in many scenarios. From another perspective, let’s talk about my understanding of application scenarios. In essence, the problems solved by algorithms in some of my past projects can be roughly divided into these categories
1. The problem of matching supply and demand
Quantitative changes have produced qualitative changes. In the past ten years, whether in B2C, B2B, S2B, B2G, we have built user portraits for precision marketing, made a recommendation system to achieve thousands of people, and hierarchically classified and labeled users. The user’s evaluation information is divided into good and bad emotions, etc., to better match the supply and demand relationship management.
Video personalized recommendation is supply and demand management, product personalized recommendation is supply and demand management, online car-hailing is supply and demand management, and supply and demand management means “who can find and whom to consume a relatively suitable thing (content, item, information, clue, business opportunity) ), in this process, which ones may need to be passed to connect with each other.”
The resulting problems immediately appeared, how to make matching recalls from tens of millions or even hundreds of millions of products, how to locate clues from trillions of conversational content information, how to identify which talents are our target specific population, how How to push the corresponding information to the most suitable people through what channels, how to achieve good contact, and how to recover the feedback effects of these people after receiving the information.
If there are only a few thousand pieces of data, and there are 10 individuals in a team, and each person confirms it one by one, it can be realized without analysis, and it only consumes some time investment in manpower.
Therefore, in the process of daily demand matching, when a demand is received, the resource matching assessment is generally performed first. Can this matter be solved by stacking manpower? If it is through the offline labor cost, use some small sample data to summarize Summarize whether you can draw general rules. What is the cost and output of doing research and then implementing it.
After that, it is solved through algorithmic solutions. It takes a few man-months to invest in engineers, how long the equipment resource performance requirements can last, the level that can be affected, and the final output estimation. Finally, in the case of this input-output ratio, whether to form rules through small data analysis, or to use algorithms to mine features, and the sustainability of the program, in the case of this input-output ratio.
Large companies have richer resources and often the two go hand in hand. To a certain extent, it strictly distinguishes the responsibility boundary between data analysis and data algorithm; and the limited resources of small and medium-sized enterprises may cause the phenomenon that analysis is algorithm.
We found that the algorithms involved in the supply and demand matching process are basically supervised algorithms. Whether it is population classification, product recall, or demand matching, a preliminary label establishment can be made through past experience, and then the accuracy of the division can be gradually determined. Checksum optimization.
It is worth mentioning that in certain scenarios of supply and demand, there will be a lot of knowledge related to the Internet of Things, such as logistics scheduling, distribution matching, route optimization, warehouse construction and other supply chain optimization things. In these scenarios, in addition to algorithms , You also need to understand the content of operations research.
2. Anomaly identification and diagnosis
Anomaly detection. In the past few years when there were no thunderstorms in p2p, the financial field was everywhere. The main scene was risk control, and the risk control scene was subdivided:
- Credit card transaction anti-fraud: classification task, GBDT algorithm / XGBT algorithm + LR logistic regression;
- Credit card application anti-fraud: classification task, GBDT algorithm / XGBT algorithm + LR logistic regression;
- Anti-fraud loan application: classification task, GBDT algorithm / XGBT algorithm + LR logistic regression;
- Anti-money laundering: classification tasks, GBDT algorithm / XGBT algorithm + LR logistic regression.
In the financial field, almost all those involved in risk control are GBDT / XGBT+LR, because the financial industry has a very special attribute: supervision.
There must be a very good model explanation for the algorithm results. For the LR model, this is a natural advantage. The features can be explained, the feature engineering is clear, and the contribution and correlation degree of each feature can also be counted.
Changing to other deep learning models, from the final model effect, the performance of roc/auc/ks is not bad, but the interpretability is extremely poor, which has caused many application barriers. In other words, you are very advanced, but not practical and flashy.
3. Sort
The reason why the sorting is single-handed is that its application scenarios actually have certain limitations, but how to do the sorting, objective and reasonable, is a matter worth studying. Common ranking application scenarios include hot list, search ranking, recommendation ranking, etc.
Zhihu’s question answering sorting is a classic sorting application scenario. It is necessary to ensure that high-quality and highly praised content can be viewed by users in the front, but also to ensure that the new content has a certain amount of exposure. At the same time, it is necessary to comprehensively consider topic heat and community tone And other factors.
Therefore, it is necessary to algorithmically sort the comprehensive weights such as the number of answers to likes/dislikes, the authority of the answering user’s domain, the authority of the likes/dislikes of the user domain, the answering time, the controversy of the answer, and the historical portrait characteristics of the answering user.
4. Forecast
Both numerical prediction and classification prediction belong to prediction scenarios. Sales forecasts, stock forecasts, and traffic forecasts are all common forecast scenarios. In 11 or 12 years, all people will use arima. I have spss in the world. There is nothing that cannot be solved by timing. Later, it will become xgboost and LightGBM.
5. Knowledge Graph
In 2012, Google launched a product called Knowledge Graph, which can intuitively see the relationship between words and the knowledge behind them.
Many large companies have already made arrangements for the construction of knowledge graphs. The earliest application of knowledge graphs is to improve the capabilities of search engines, and then assist in intelligent question and answer, natural language understanding, big data analysis, recommendation calculations, IoT device interconnection, and Interpretive artificial intelligence and other aspects have shown rich application value. In recent years, the promotion of AI should be the aid of judicial judgments.
- Information retrieval/search: accurate aggregation and matching of entity information in search engines, understanding of keywords, and semantic analysis of search intent, etc.;
- Natural language understanding: The knowledge in the knowledge graph is used as background information for understanding the entities and relationships in natural language;
- Question answering system: match the mapping between question answering patterns and knowledge subgraphs in the knowledge graph;
- Recommendation system: Integrate the knowledge graph as a kind of auxiliary information into the recommendation system to provide more accurate recommendation options, knowledge graph + recommendation system;
- E-commerce: construct a knowledge map of products to accurately match users’ purchase intentions and product candidate sets, knowledge map + recommendation system;
- Financial risk control: Use the relationship between entities to analyze the risks of financial activities to provide remedial measures (such as anti-fraud, etc.) after the risk is triggered;
- Public security criminal investigation: Analyze the relationship between the entity and the entity to obtain clues to the case;
- Judicial assistance: structured representation of legal provisions and inquiries are used to assist judgments in cases;
- Education and medical treatment: Provide visual knowledge representation for drug analysis, disease diagnosis, etc.;
- Social services: Social services are highly connected, such as friendships, etc., <user 1, follow, user 2>.
3. The output and form of the algorithm, how to apply to the business
A word that we often hear recently is “big data kills familiarity”, which should be a very common application scenario for algorithms in business. Generally speaking, there are two kinds of output from an algorithm. The first is the result of the algorithm (grouping, classification, predicted value), and the second is the rule of the algorithm.
1. Output results
- Dimensionality reduction: Whether it is data classification or numerical prediction, business applications can be used as screening objects to further narrow the target and find a clear boundary. At some critical points, the algorithm will reduce the cost of human decision-making, choose the best from many strategies to try;
- Refinement: Use the results as labels, combine CRM, advertising systems, and marketing systems to help businesses obtain information more conveniently and accurately, strengthen user perception, create novelty to attract users’ attention, and set rules to improve user stickiness;
- Strategy: Reduce costs and increase efficiency gains. The algorithm essentially solves these two things. The output of the algorithm can effectively support the formulation of the strategy and demonstrate the feasibility of yes or no.
2. Output rules
In many cases, we only pay attention to the result itself, how the accuracy, precision, and recall rate are, but ignore the rule-level application generated by the algorithm. The interpretability of the model mentioned earlier is actually a kind of concretization of rules.
In the correlation analysis, strong correlation, weak correlation, and irrelevance have been mentioned. As a business, he can say that this output result can also be known through business experience. As an analysis, he needs to deduct the so-called “experience” into a rule, which is connected by numbers.
In terms of algorithms, when interpreting models, some features with strong rules will also be encountered, but it is often easy to only look at the results of the data, but ignore its meaning and causality in the actual business process, resulting in “algorithm The results of the analysis are not as good as making decisions based on experience.
4. Why data analysts need machine learning
Let us first clarify a concept, that is, data analysis, which can exist as an additional skill for professionals in a society, or as a backbone occupation for professionals in a society for development.
1. In most cases, we are only catering to the laws of the world, but we have not thought about why it exists
In the mining and analysis application projects, the algorithm is the core element, and the implementation principles of most algorithms involve some advanced mathematical knowledge.
Mathematics itself is very abstract, and learning is quick to forget. Naturally, algorithms have a certain sense of mystery to many people. Human curiosity and self-motivated have promoted the evolution and survival of human beings, so I want to uncover that layer of mystery to learn.
Similarly, people often overestimate their perseverance and the results that can be achieved in a short period of time, so it is often: after spending a lot of time and trouble to understand the principles of several algorithms, they never continue to stick to it. At this point, it may go to an extreme. As long as you can use a third-party algorithm library to run successfully on your computer and output the results, you can try another algorithm if the effect is not good.
2. Data analysis In order to achieve business goals, algorithms can be used to quickly demonstrate
It is very necessary for analysts to understand algorithms. In recent years, data analysts have written more or less algorithm-related requirements in their job responsibilities.
My perception is that junior analysts do not need to understand algorithms to cover most of their work. However, if you want to take your career to the next level, enhance the scientific rigor and efficiency of analysis, especially in business types that involve algorithm strategy-driven, analysts must understand some commonly used machine learning algorithms.
In fact, the focus of the analysis is still on the disassembly, demonstration, and realization of the target problem. For most analysts, the characteristics of business requirements can be roughly summarized as short delivery time, fast implementation results, rich data dimensions, and support for conclusions. Sufficient and convenient for reporting.
Most business analysis scenarios can be drilled down and disassembled layer by layer by a method similar to DuPont analysis, and this process may involve very little mathematical knowledge and algorithm knowledge.
The industry already has a lot of mature algorithm application practices. Sometimes in order to do data demonstration and exploration, similar algorithms are needed. The purpose is to find a breakthrough point where conclusions can be drawn in the shortest time. Therefore, in actual application, a premise will be encountered, that is, each algorithm has its appropriate application scenarios and preconditions, and the impact of super parameters is also very large when used in specific applications.
Therefore, if we do not understand and treat the algorithm from a higher level, then in actual use, it may be like a boat and a sword, and it is difficult to achieve the expected effect or prematurely reject an algorithm model that can properly solve the current problem, just because it is relevant. Insufficient attention to work (such as data cleaning, unreasonable feature selection methods).
The skl package provides a large number of simple functions. In order to quickly use these functions to solve practical problems, we have to spend time to understand the internal principles and implementation details of the algorithm. Architects do not need to be proficient in the process of manufacturing reinforced cement, but they need to understand the properties and uses of different steels and cements and the relationship between them. The same applies to this link.
3. For analysts to grow better, horizontal knowledge reserves are essential
The growth of a data analyst is like a marathon, requiring a reasonable allocation of time and energy. Concentration and self-control is a scarce resource that needs to be used where it is most suitable. Always remind yourself of what your goals are to get things done, especially for analysts.
Not only algorithms, but in this large social environment, many aspects of the market, industries, subdivisions, verticals, positions, occupations, technology, skills, and business need to be studied, because analysis is just a skill, As a profession, it needs to make corresponding and reasonable strategies under realistic scenarios.
5. Differences in responsibilities between data analysts and algorithm engineers
1. Data analyst requirements
- Understanding the business is a prerequisite: the vision needs to be as wide as possible, and you need to understand the industry market, market dynamics, company business, business models, and business processes, establish your own cognition and discriminative thinking, and be able to use scientific and rigorous methods under specified scenarios Come to a reasonable conclusion;
- Understanding analysis is the core: basic methods and principles of data analysis, professional and efficient data analysis methodology, flexible application of combined skills, combined business applicable analysis methodology, and high data sensitivity;
- Understanding reporting is a step: good analysis cannot be separated from good reports, and good reports cannot be separated from good reporting skills. How to speak and what to say in front of whom is also a technical task.
2. Requirements of algorithm engineers
- Knowing technology is a prerequisite: different algorithms may use different time, space or efficiency to complete the same task, and the operating efficiency of the algorithm requires a certain degree of coding technical support.
- Specialties are extremely subdivided: According to the research direction, they are mainly video algorithm engineers, image processing algorithm engineers, audio algorithm engineers, communication baseband algorithm engineers, signal algorithm engineers, NLP algorithm engineers, biomedical signal algorithm engineers and other knowledge depth and breadth.
3. The similarities and differences between the two
- Commonness: all need to explore the data, discover the patterns and laws between the data, and use a series of rules and formulas to solve practical problems (all to read statistics, probability theory);
- Difference: Data analysis uses some traditional methods to solve practical problems. The threshold is low, everyone is data analysis, and performance can be ignored when the effect is achieved; the threshold for algorithm engineers is relatively high, and a certain degree of innovation in the original methods is required , To solve problems in a specific field, and need to ensure the performance, effect, and stability of the algorithm.
6. How to divide labor in actual business can maximize the utility
In the actual business process, there are certain differences between the demand side of analysis and algorithms. In collaboration, it is often possible that people in different departments are doing the same thing. There may be differences between the conclusions due to different backgrounds and perspectives when the requirements are imported.
1. A case
There are some people who don’t pay the telecom operators in time. How to find them?
- Data analysis: Through the observation of the data, we found that the poor accounted for 82% of the people who did not pay in time. So the conclusion is that people with low incomes often fail to pay in time. Conclusions need to reduce tariffs;
- Data algorithm: Find out the deep-seated reasons by yourself through the written algorithm. The reason may be that people who live outside the Fifth Ring Road do not pay in time due to the remote environment. In conclusion, it is necessary to set up more business halls or self-service payment points.
2. How to collaborate
Before the data algorithm, the data should be explored and analyzed. Through the positioning and disassembly of the business problem, the available data dimensional characteristics are found, the data is collected, and the data indicators are formed. The statistical analysis of various dimensional combinations is carried out, and preliminary conclusions are drawn for reporting. , As above: low per capita income, it is recommended to reduce tariffs.
In the process of focusing on business information, organize special research on topics that cannot be described concretely, construct data features in the form of algorithms for in-depth mining, and draw potential conclusions. As above: It is recommended to increase the number of stations in the remote environment.
Then, based on the conclusion of the algorithm output, a feasibility analysis can be carried out, and based on the actual requirements of the business, the site selection, crowd coverage, package standards, etc. can be analyzed.
3. Summary
Analysis and algorithms can be confused to a certain extent. In a small team, one or two senior analysis can also hold. Many things need to be self-driven, but in the advancement of actual projects, it is usually analyzed first, then thematic, and then deeply integrated with business analysis, and then analysis-driven algorithm iteration, and so on.
7. The degree that a data analyst should master
In summary, for a professional data analyst, the ability requirements to be mastered at all levels can be as follows:
- Industry knowledge ★★★★
- Business understanding★★★★★
- Analytical thinking★★★★★
- Data processing★★★★
- Algorithm principle ★★★
- coding ability ★★★
- Report writing ★★★★★
- Report speech ★★★★
- Summary ★★★★★
- Resource integration ★★★★