进门见什么好| 高血糖喝什么茶好| 1月26号是什么星座| 小孩为什么经常流鼻血| 木薯是什么| 王火火念什么| 它是什么结构| 树冠是什么| 梦见抬死人是什么意思| 农历五月初五是什么节| 一九六七年属什么生肖| 什么争鸣成语| 点蜡烛什么意思| 操逼什么意思| 细菌性感冒吃什么药效果好| 拉肚子为什么会肚子疼| 姑姑家的儿子叫什么| 脚底有黑痣有什么说法| 女人什么年龄性最旺| 来月经喝什么茶好| 吃什么丰胸最好| 白球比偏低是什么意思| 汛期是什么| 什么是佣金| 软组织损伤是什么意思| 十月初七是什么星座| shadow是什么意思| 猴跟什么生肖相冲| 乳酸脱氢酶高是什么原因| 心电图是什么科室| 1993年出生属什么生肖| 春宵一刻值千金是什么意思| 结婚前一天晚上的宴会叫什么| 掉头发吃什么药最有效| 向日葵代表什么| 胎先露是什么意思| 为什么大便不成形| 霉菌性阴道炎有什么症状| 张一山和杨紫是什么关系| ot什么意思| 贺喜是什么意思| 香芋是什么| 有甲状腺结节不能吃什么| 小马过河的故事告诉我们什么道理| 今天什么冲什么| 6月30日是什么座| 泌乳素高是什么原因| 肛瘘是什么原因造成的| 外阴瘙痒是什么病| 春捂秋冻指的是什么意思| 做梦梦到剪头发是什么意思| 神器积分换什么最划算| 私生饭什么意思| 中伏是什么意思| 护士最高职称是什么| 后背一推就出痧是什么原因| 为什么放屁特别臭| 隐翅虫咬了用什么药| 五行中什么生木| 5月23号是什么星座| 放下是什么意思| 狼来了的寓意是什么| dr是什么意思| 红薯什么时候种植| gu是什么品牌| 梦见长豆角是什么意思| 风热感冒吃什么药最快| 郁是什么生肖| 做体检挂什么科| vogue是什么意思| 镀金什么意思| 海豚吃什么| 医学ace是什么意思| 脑梗前期有什么症状| 手上蜕皮是什么原因| 被虫咬了挂什么科| 在吗是什么意思| 机能鞋是什么意思| 攻击的近义词是什么| 吃什么对脾胃好| 减肥吃什么药瘦得快| 遗憾是什么| 做牛排需要什么调料| 我看见了什么| hpv长什么样| 半夏反什么药| 血小板低吃什么补得快| 晚上剪指甲有什么说法| 通宵是什么意思| 宝宝辅食虾和什么搭配| 皮革是什么材质| 矫正度数是什么意思| 糖尿病人可以吃什么零食| 梦到蛇预示着什么意思| ige高是什么意思| ooh什么意思| 5月25号是什么星座| 魂牵梦绕的意思是什么| 女生体毛多是什么原因| 孕妇梦见摘桃子是什么意思| 浮萍是什么意思| 脸上有红血丝是什么原因| 停诊是什么意思| 结婚27年是什么婚| 裕字五行属什么| 肴肉是什么肉| 匀字五行属什么| 大姨妈吃什么水果| 丁火是什么意思| 什么是小针刀治疗| 吃什么水果对肾好| 工科和理科有什么区别| 淋巴结钙化是什么意思| 出清什么意思| 亚急性甲状腺炎吃什么药| 小肠镜什么情况下需要做| 洗面奶和洁面乳有什么区别| 什么是妇科病| 心肌酶是查什么的| 脾胃不好吃什么食物可以调理| 辣椒含有什么维生素| 前列腺不能吃什么食物| 出阁宴是什么意思| 四月十八是什么星座| 病字旁加且念什么| 肾炎是什么症状| 什么叫大数据| 拔罐出水是什么原因| 肺炎吃什么药有效| 吃什么生发| 一五行属性是什么| 脚底干燥是什么原因| 打喷嚏头疼是什么原因| 属虎的本命佛是什么佛| 临界是什么意思| 冬瓜有什么功效和作用| 一什么田| 破伤风什么情况需要打| 沙发是什么头发| 牙疼喝什么药| 音序是什么意思| 天长地久是什么生肖| 88.88红包代表什么意思| 10月12是什么星座| 冰毒是什么| 死于非命是什么意思| 羔羊跪乳是什么意思| 百香果有什么好处功效| 普拉提是什么意思| 海石花是什么| 舌有裂纹是什么原因| 身上到处痒是什么原因| 键盘侠是什么意思| 成人发烧38度吃什么药| 一什么桃子| 分水岭是什么意思| 中学校长什么级别| 有两把刷子是什么意思| 更年期有什么表现| 什么人一年只工作一天| 子宫复旧不良有什么症状| 什么日子适合搬家| 小儿风寒感冒吃什么药| 活碱是什么| 7月17号是什么星座| suan是什么意思| 山楂炖肉起什么作用| 什么牌子的保温杯好| 腊肉炖什么好吃| 巨细胞病毒阳性什么意思| 白球比偏低吃什么补| 胰岛素抵抗吃什么药| 一什么草坪| 女命正财代表什么| 看正月初一是什么生肖| 洋葱吃多了有什么坏处| 为什么三文鱼可以生吃| 芝麻吃多了有什么坏处| 什么助听器| 脚趾头麻木是什么原因| 指征是什么意思| 今年16岁属什么生肖| 纵什么意思| 两极分化是什么意思| qs什么意思| 右边肚子疼是什么原因| 女人脸黄是什么原因该怎么调理| 9月13号是什么星座| 肺部结节有什么症状| pet一ct是一种什么检查| 翊什么意思| 嘴唇白是什么原因| 天堂是什么意思| 黄加黑变成什么颜色| b型钠尿肽是什么意思| 雄起是什么意思| 白细胞酯酶弱阳性什么意思| 掌纹多而乱代表什么| 包皮与包茎有什么区别| 抗体是指什么| 6月14日是什么星座| 什么是三净肉| 儿童不长个子去医院挂什么科| 阿托伐他汀钙片治什么病| 180度是什么角| 败火是什么意思| 人死之前为什么会拉屎| 手脱皮是缺什么维生素| 91网站是什么| 肿瘤挂什么科| 西洋参吃了有什么好处| 什么是便血| 肺肿瘤吃什么好| 内蒙古有什么特产| 梦到捉鱼是什么意思| 手脱皮是什么原因| 什么是破伤风| 吃东西恶心想吐是什么原因| 全身疼痛是什么原因| 什么样的你| 乳头疼是什么原因| 1960属什么生肖| 林彪什么时候死的| 长期是什么意思| 尾骨疼是什么原因| 小孩测骨龄挂什么科| 凤凰单枞是什么茶| 三维b片主治什么病| 外面下着雨犹如我心血在滴什么歌| 赘肉是什么意思| 小孩睡不着觉是什么原因| 烯烃有什么用| 09年是什么年| 手掌发热是什么原因| 慢性病都包括什么病| 每年什么时候征兵| 鸭子吃什么| 96年属鼠的是什么命| 法不传六耳什么意思| 雷震子是什么神位| 丙三醇是什么东西| 胃不舒服挂什么科| 会字五行属什么| 十二指肠溃疡a1期什么意思| 肝昏迷是什么意思| 肌酐高是什么原因造成的| 乌鸦嘴是什么意思| 为什么小鸟站在电线上不会触电| 什么食物含钾多| ppl什么意思| 妇检tct是什么检查| itp是什么意思| 手心干燥是什么原因| 女性解脲支原体阳性吃什么药| 狐臭挂什么科室的号| 孕妇吃什么胎儿智商高| 葫芦鸡为什么叫葫芦鸡| 卫生纸筒可以做什么| proof是什么意思| 70年的狗是什么命| 铁剂是什么| 湿热泄泻是什么意思| 红骨髓是什么意思| 低密度脂蛋白胆固醇高是什么意思| 百度
 

沈阳市振兴实体经济助企减负近百亿

百度   《玛纳斯》团结奋发的民族史诗《英雄·玛纳斯》首演剧照来源:国际在线  玛纳斯是柯尔克孜族传说中的著名英雄和首领,是力量、勇敢和智慧的化身。

This article explains in detail some of the issues that you may face during your machine learning project.



By Alberto Artasanchez, Knowledgent.

Failure
Okay, so the title of the article is somewhat apocalyptic and I don’t wish ill on anyone. I am rooting for you and hope your project succeeds beyond expectations. The purpose of this article is not to put a voodoo curse on you and assure your project’s failure. Rather, it a visit to the most common reasons why data science projects do fail and hopefully help you avoid the pitfalls.

1. Asking the wrong question

If you ask the wrong questions, you will get the wrong answers. An example comes to mind from the financial industry and the problem of fraud identification. The initial question might be “Is this particular transaction fraud or not?”. To make that determination, you will need a dataset that contains examples of fraudulent and non-fraudulent transactions. Most likely this dataset will be generated with human help. i.e., the labeling of this data could be determined by a group of subject matter experts (SME) trained to detect fraud. However, this dataset may be labeled by the experts using the fraudulent behavior they have witnessed in the past. Therefore, the model trained on this data will only catch fraud that fits the old pattern of fraud. If a bad actor finds a new way to commit fraud, our system will be unable to detect it. A potentially better question to ask could be “Is this transaction anomalous or not?”. Therefore, it would not necessarily look for a transaction that has proven to be fraudulent in the past, it would look for a transaction that doesn’t fit the “normal” signature of a transaction. Even in the most sophisticated fraud detection systems, rely on humans to further analyze predicted fraudulent transactions to verify the model results. One bad side effect of this approach is that more than likely it will generate more false positives than the previous model.

Another favorite example of mine for this kind of failures also comes from the area of finance. The investing legend Peter Lynch.

During his tenure at the Fidelity Magellan fund, Lynch trounced the market overall and beat it in most years, racking up a 29 percent annualized return. But Lynch himself points out a fly in the ointment. He calculated that the average investor in his fund made only about 7 percent during the same period. When Lynch would have a setback, for example, the money would flow out of the fund through redemptions. Then when he got back on track it would flow back in, having missed the recovery.

Which would have been a better question to answer?

  • What will be the performance of the Fidelity Magellan fund for the next year?
  • What will be the number of purchases or redemptions for the Fidelity Magellan fund next year?

2. Trying to use it to solve the wrong problem

A common mistake that is made here is not focusing on the business use case. When shaping your requirements, there should be a laser focus on this question. “If we solve this problem, will it substantially add value to the business?”. To answer this question, as you are breaking your problem down into subtasks, the initial tasks should be focused on answering this question. As an example, let’s say you come up with a great idea for an Artificial Intelligence product and now you want to start selling it. Let’s say it’s a service where you upload a full-body photograph to a website and the AI program determines your measurements so it can then create a tailored suit that fits your body type and size. Let’s brainstorm some of the tasks that would need to be performed to accomplish this project.

  • Develop the AI/ML technology to determine body measurements from the photograph.
  • Design and create a website and a phone app to interact with your customers.
  • Perform a feasibility study to determine if there is a market for this offering.

As technologists, we are eager to design and code, so we might be tempted to start working on the first two tasks. As you can imagine, that would be a horrible mistake if we perform the feasibility study after performing the first two tasks and the result of the study indicates that there is not a market for our product.

3. Not having enough data

Some of my projects have been in the life sciences domain and one problem that we have run into is the inability to obtain certain data at any price. The life sciences industry is very sensitive about storing and transmitting protected health information (PHI) so most datasets available scrub this information out. In some cases, this information would have been relevant and would enhance model results. For example, an individual’s location might have a statistically significant impact on their health. As an example, someone from Mississippi might have a higher probability of having diabetes than someone from Connecticut. But since this information might not be available, we won’t be able to use.

Another example comes from the financial industry. Some of the most interesting and relevant datasets can be found in this domain but again much of this information might be very sensitive and closely guarded so access to it might be highly restricted. But without this access, relevant results will not be possible.

4. Not having the right data

Using faulty data or dirty data can lead to making bad predictions even if you have the best models. In supervised learning, we use data that has been previously labeled. In many cases, this labeling is normally done by a human, which can lead to errors. An extreme hypothetical example would be having a model that consistently has perfect accuracy but is working with inaccurate data. Think for the MNIST dataset for one moment. When we run our models against it, we are assuming that the human labeling of the images was 100% accurate. Now, imagine a third of the numbers are mislabeled. How difficult do you think it would be for your models to produce any kind of decent results regardless of how good your model is? The old adage of garbage in, garbage out is alive and well in the data science domain.

Big Data Hype

5. Having too much data

In theory, you can never have too much data (as long it’s correct data). In practice, even though there have been tremendous advances in storage and computing costs and performance, we are still bound by physical constraints of time and space. So currently, one of the most important jobs a data scientist has is to judiciously pick out the data sources they think will have an impact in?achieving accurate model predictions. As an example, let’s assume we are trying to predict baby birth weights. Intuitively, a mother age seems like a relevant feature to include but the mother’s name is probably not relevant, whereas address might be relevant. Another example that comes to mind is the MNIST dataset. Most of the information in the MNIST images are in the center of the image and we can probably get away with removing a border around the images without losing much information. Again, in this example, human intervention and intuition were needed to make the determination that removing a certain number of the border pixels would have a minimal impact in predictions. One last example for dimensionality reduction is to use Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (t-SNE).? Determining which of these features are going to be relevant are still a hard problem for computers before you run the models, but it is a field that is ripe with possibilities to automate the process. In the meantime, having too much data remains a potential trap that can derail your data science project.

6. Hiring the wrong people

You wouldn’t trust your doctor to fix your car and you should trust your mechanic to perform your colonoscopy. If you have a small data science practice, you might have no choice but to rely on one or a few people to perform all tasks from data gathering and procurement, data cleanup and munging, feature engineering and generation, model selection as well as deploying the model in production. But as your team grows, you should consider hiring specialists for each one of these tasks. The skills required to be an ETL development expert do not necessarily overlap with the expertise of a Natural Language Processing (NLP) expert. In addition, for certain industries – biotech and finance come to mind – having deep domain knowledge might be valuable and even critical. However, having a subject matter expert (SME) and a data scientist with good communication skills might be a suitable alternative. As your data science practice grows, having the right specialized resources is a tricky balancing act, but having the right resources and talent pool is one of the most important keys to your practice success.

7. Using the wrong tools

There are many examples that come to mind here. One common pitfall is the proverbial “I have a hammer and everything looks like a nail now”. A more industry-specific example: You recently sent your team to train on MySQL and now that they are back, you have a need to set up an analytics pipeline. Having the training fresh on their mind, they suggest using their new shiny tool. However, based on the amount of data that your pipeline will be processing and the amount of analytics you need to perform with the results, this selection might be the wrong choice for the job. Many SQL offerings have a hard limit on the amount of data that can be stored on an individual table. In this case, a better alternative might be to use a NoSQL offering like MongoDB or a highly scalable columnar database such as AWS Redshift.

8. Not having the right model

A model is a simplified representation of reality. These simplifications are made to discard unnecessary fluff, noise, and detail. A good model allows its users to focus on the specific aspect of reality that is important in a particular domain. For example, in a marketing application, keeping attributes such as customer email and address might be important. Whereas in a medical setting a patient’s height, weight, and blood type might be more important. These simplifications are rooted in assumptions; these assumptions may hold under certain circumstances, but may not hold in others.? This suggests that a model that works well under a certain scenario may not work in another.

There is a famous theorem in mathematics. The “No Free Lunch” (NFL) theorem states that there is no one model that works best for every problem.? The assumptions of a good model for one domain may not hold for another, so it is not uncommon in data science to iterate using multiple models, trying to find the one that fits best for a given situation.? This is especially true in supervised learning. Validation or cross-validation is commonly used to assess the predictive accuracy of multiple models with varying complexity to find the most suitable model. In addition, a model that works well could also be trained using multiple algorithms – for example, linear regression could be trained using normal equations or using gradient descent.

Depending on the use case, it is critical to ascertain the trade-offs between speed, accuracy, and complexity of different models and algorithms and to use the model that works best for a given domain.

9. Not having the right yardstick

In machine learning, it is critical to be able to evaluate the performance of a trained model. It is essential to measure how well the model performs against the training data and the test data. This information will be used to select the model to be used, the hyperparameter selection and to determine if the model is ready for production usage or not.

To measure model performance, it is of the utmost importance to choose the best evaluation metrics for the task at hand. There is plenty of literature on metric selection so we won’t delve too deeply into this topic but some parameters to keep in mind when selecting metrics are:

  • The type of machine learning problem: Supervised learning, unsupervised learning and reinforcement learning.
  • The type of supervised learning: binary, classification or regression.
  • The dataset type: If the data set is imbalanced a different metric might be more suitable.

Conclusion

There are many ways to fail and only one best way to solve a problem. The definition of success might also be measured against different yardsticks. Is the solution being sought a quick and dirty patch? Are you looking for a best in class solution? Is the emphasis on model training speed, inference endpoint availability, or the accuracy of the model? Maybe finding that a solution did not work can be considered a win because now the focus can be given to other alternatives. I would love to hear your thoughts on this topic. Which other ways have you found to fail? What solutions have worked best for you? Feel free to discuss below or reach out on?Linked In. Happy coding!

Bio: Alberto Artasanchez is a Principal Data Scientist at Knowledgent. He has a Masters in CS and AI and is an AI/ML researcher, practitioner, author and educator with over 25 years in the tech industry. As Principal Data Scientist at Knowledgent from 2015 to the present he has been involved in a variety of data science, cloud and big data initiatives and projects in the financial and biotech industries where he plays the role of advisor, mentor, designer, troubleshooter and coder depending on the project’s requirements.

Original. Reposted with permission.

Related:



慢性胃炎吃什么药好 萝卜什么时候种 射的快吃什么药 三点水一个金读什么 减肥喝什么水
女性尿路感染吃什么药效果好 芒果像什么比喻句 浅表性胃炎是什么意思 兽医是什么专业 万圣节应该送什么礼物
前列腺增大是什么原因 吴孟达什么时候去世的 虎头虎脑是什么生肖 落井下石是什么意思 布克兄弟什么档次
芸字五行属什么 土地出让金是什么意思 早晨五点是什么时辰 两三分钟就射什么原因 乳腺增生吃什么药效果好
思密达是什么意思hcv8jop2ns9r.cn 通便吃什么最快排便hcv8jop4ns3r.cn 处暑是什么季节hcv8jop1ns7r.cn 长痘痘是什么原因hcv9jop6ns9r.cn 平菇炒什么好吃hcv9jop2ns2r.cn
肾阴虚吃什么食物hcv9jop1ns9r.cn 爱是什么东西tiangongnft.com 低钠盐是什么意思hanqikai.com 38岁属什么的生肖hcv8jop4ns8r.cn 孕妇便秘吃什么最快排便hcv8jop1ns5r.cn
丑时是什么时辰hcv7jop9ns4r.cn 嗅觉失灵是什么原因hcv8jop2ns6r.cn 狗狗拉肚子是什么原因hcv7jop5ns0r.cn 盐吃多了有什么危害zhiyanzhang.com 人生苦短是什么意思hcv8jop7ns5r.cn
阿昔洛韦乳膏治什么hcv8jop7ns3r.cn 什么的角hcv8jop7ns4r.cn 食用棕榈油是什么油hcv8jop8ns5r.cn 梦到墓地什么预兆hcv9jop2ns8r.cn gucci什么品牌hcv9jop6ns1r.cn
百度